Loading [MathJax]/extensions/MathMenu.js
Hybrid Page Layout Analysis via Tab-Stop Detection | IEEE Conference Publication | IEEE Xplore

Hybrid Page Layout Analysis via Tab-Stop Detection


Abstract:

A new hybrid page layout analysis algorithm is proposed, which uses bottom-up methods to form an initial data-type hypothesis and locate the tab-stops that were used when...Show More

Abstract:

A new hybrid page layout analysis algorithm is proposed, which uses bottom-up methods to form an initial data-type hypothesis and locate the tab-stops that were used when the page was formatted. The detected tab-stops, are used to deduce the column layout of the page. The column layout is then applied in a top-down manner to impose structure and reading-order on the detected regions. The complete C++ source code implementation is available as part of the Tesseract open source OCR engine at http://code.google.com/p/tesseract-ocr.
Date of Conference: 26-29 July 2009
Date Added to IEEE Xplore: 02 October 2009
ISBN Information:

ISSN Information:

Conference Location: Barcelona, Spain
References is not available for this document.

1. Introduction

Physical Page layout analysis, one of the first steps of OCR, divides an image into areas of text and non-text, as well as splitting multi-column text into columns. This paper does not address logical layout analysis, which detects headers, footers, body text, numbered lists, and segmentation into articles.

Select All
1.
F. Wahl, K. Wong, R. Casey, "Block segmentation and text extraction in mixed text/image documents," Computer Graphics and Image Processing, 20, 1982, pp375-390.
2.
M. Chen, X. Q. Ding, "Unified HMM-based Layout Analysis Framework and Algorithm," SCI CHINA Ser F, 46(6), Dec. 2003, pp401-408.
3.
S.P. Chowdhury, S. Mandal, A.K. Das, B. Chanda, "Segmentation of Text and Graphics from Document Images," Proc. of the 9th Int. Conf. on Document Analysis and Recognition, IEEE, Curitiba, Brazil, Sep 2007, pp619-623.
4.
G. Nagy, S.C. Seth, "Hierarchical Representation of Optically Scanned Documents" Proc. 7th Int. Conf. on Pattern Recognition, Montreal, Canada, 1984, pp347-349.
5.
H.S. Baird, S.E. Jones, S.J. Fortune, "Image Segmentation by Shape-directed Covers," Proc. 10th Int. Conference on Pattern Recognition, IEEE Atlantic City, NJ, 1990, pp820-825.
6.
T. Pavlidis, J. Zhou, "Page Segmentation and Classification," CVGIP: Graphical Models and Image Processing, 54(6), November 1992, pp484-496.
7.
T.M. Breuel, "Two Geometric Algorithms for Layout Analysis," Proc. of the 5th Int. Workshop on Document Analysis Systems V, Springer-Verlag 2002, pp188-199.
8.
Leptonica image processing and analysis library. http://www.leptonica. com.
9.
R. Smith. "An overview of the Tesseract OCR Engine." Proc 9 Int. Conf. on Document Analysis and Recognition, IEEE, Curitiba, Brazil, Sep 2007, pp629-633.
10.
The Tesseract open source OCR engine. http://code.google.com/p/tesseract- ocr.
11.
A. Antonacopoulos, B. Gatos, D. Bridson, "ICDAR2007 Page Segmentation Competition," Proc 9 Int. Conf. on Document Analysis and Recognition, IEEE, Curitiba, Brazil, Sep 2007, pp1279-1283.
12.
UNLV ISRI OCR testing toolkit and database http://www.isri.unlv.edu/ISRI/ OCRtk.
13.
A. Antonacopoulos et. al. "ICDAR2009 Page Segmentation Competition," Proc 10 Int. Conf. on Document Analysis and Recognition, IEEE, Barcelona, Spain, Jul 2009.
Contact IEEE to Subscribe

References

References is not available for this document.