LAYOUT SEGMENTATION OF SCANNED NEWSPAPER DOCUMENTS

A. BANDYOPADHYAY1, A. GANGULY2, U. PAL3
1CVPR Unit, Indian Statistical Institute 203 B T Road, Kolkata, India.
2CVPR Unit, Indian Statistical Institute 203 B T Road, Kolkata, India.
3CVPR Unit, Indian Statistical Institute 203 B T Road, Kolkata, India.

Received : -     Accepted : -     Published : 15-12-2011
Volume : 1     Issue : 1       Pages : 5 - 10
J Comput Ling 1.1 (2011):5-10

Cite - MLA : A. BANDYOPADHYAY, et al "LAYOUT SEGMENTATION OF SCANNED NEWSPAPER DOCUMENTS." Journal of Computational Linguistics 1.1 (2011):5-10.

Cite - APA : A. BANDYOPADHYAY, A. GANGULY , U. PAL (2011). LAYOUT SEGMENTATION OF SCANNED NEWSPAPER DOCUMENTS. Journal of Computational Linguistics, 1 (1), 5-10.

Cite - Chicago : A. BANDYOPADHYAY, A. GANGULY , and U. PAL "LAYOUT SEGMENTATION OF SCANNED NEWSPAPER DOCUMENTS." Journal of Computational Linguistics 1, no. 1 (2011):5-10.

Copyright : © 2011, A. BANDYOPADHYAY, et al, Published by Bioinfo Publications. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution and reproduction in any medium, provided the original author and source are credited.

Abstract

Layout segmentation algorithms found in published papers often rely on some predetermined parameters such as general font sizes, distances between text lines, presence of images and document scan resolutions. Variations of these parameters in real document images greatly affect the performance of these algorithms. In this paper we present a simple and novel approach for document page segmentation which are complex in nature (having more than one picture or header). In this paper we have dealt with the segmentation of a scanned document into images, headers, columns and finally into paragraphs. We first separate the image and fonts of greater size and then follow it up with column separation. Finally we divide it into smaller paragraphs.

References

[1] D. Chetverikov, J. Liang, J. Komuves, and R. Haralick, Zone classification using texture features . In Proc. of Intl. Conf. on Pattern Recognition, volume 3, pages 676 680, 1996.  
» CrossRef   » Google Scholar   » PubMed   » DOAJ   » CAS   » Scopus  

[2] S. S. G. Nagy and S. Stoddard, Document analysis with expert system . Procedings of Pattern Recognition in Practice II, June 1985.  
» CrossRef   » Google Scholar   » PubMed   » DOAJ   » CAS   » Scopus  

[3] M. Hose and Y. Hoshino, Segmentation method of document images by two-dimensional fourier transformation . System and Computers in Japan.  
» CrossRef   » Google Scholar   » PubMed   » DOAJ   » CAS   » Scopus  

[4] A. Jain, Fundamentals of digital image processing . Prentice Hall, 1990.  
» CrossRef   » Google Scholar   » PubMed   » DOAJ   » CAS   » Scopus  

[5] A. Jain and B. Yu, Document representation and its application to page decomposition . IEEE trans. On Pattern Analysis and Machine Intelligence, 20(3):294 308, March 1998.  
» CrossRef   » Google Scholar   » PubMed   » DOAJ   » CAS   » Scopus  

[6] A. K. Jain and S. Bhattacharjee, Text segmentation using gabor filters for automatic document processing . Machine Vision and Applications, 5(3):169 184, 1992  
» CrossRef   » Google Scholar   » PubMed   » DOAJ   » CAS   » Scopus  

[7] O. Okun, D. Doermann, and M. Pietikainen, Page segmentation and zone classification: The state of the art . In UMD, 1999  
» CrossRef   » Google Scholar   » PubMed   » DOAJ   » CAS   » Scopus  

[8] T. Pavlidis and J. Zhou, Page segmentation by white Streams . Proc. 1st Int. Conf. Document Analysis and Recognition (ICDAR),Int. Assoc. Pattern Recognition, pages 945 953, 1991  
» CrossRef   » Google Scholar   » PubMed   » DOAJ   » CAS   » Scopus  

[9] C. Tan and Z. Zhang, Text block segmentation using pyramid structure . SPIE Document Recognition and Retrieval, San Jose, USA, 8:297 306, January 24-25 2001.  
» CrossRef   » Google Scholar   » PubMed   » DOAJ   » CAS   » Scopus  

[10] F. Wahl, K. Wong, and R. Casey, Block segmentation and text extraction in mixed text/image documents . CGIP, 20:375 390, 1982  
» CrossRef   » Google Scholar   » PubMed   » DOAJ   » CAS   » Scopus  

[11] D. Wang and S. Srihari, Classification of newspaper image blocks using texture analysis . CVGIP, 47:327 352, 1989  
» CrossRef   » Google Scholar   » PubMed   » DOAJ   » CAS   » Scopus  

[12] Z. Shi and V. Govindaraju, Multi-scale Techniques for Document Page Segmentation pp.1020-1024, Eighth International Conference on Document Analysis and Recognition (ICDAR'05), 2005  
» CrossRef   » Google Scholar   » PubMed   » DOAJ   » CAS   » Scopus  

[13] G.Harit, R.Garg and S.Chaudhury, Syntactic and semantic labelling of hierarchical organized document image components of Indian Scripts . Proc. ICAPR, pp. 314-317, 2009  
» CrossRef   » Google Scholar   » PubMed   » DOAJ   » CAS   » Scopus