Automatic content extraction of filled form images based on clustering component block projection vectors

Hanchuan Peng, Xiaofeng He, Fuhui Long

Research output: Contribution to journalConference articlepeer-review

3 Scopus citations

Abstract

Automatic understanding of document images is a hard problem. Here we consider a sub-problem, automatically extracting content from filled form images. Without pre-selected templates or sophisticated structural/semantic analysis, we propose a novel approach based on clustering the component-block-projection-vectors. By combining spectral clustering and minimal spanning tree clustering, we generate highly accurate clusters, from which the adaptive templates are constructed to extract the filled-in content. Our experiments show this approach is effective for a set of 1040 US IRS tax form images belonging to 208 types.

Original languageEnglish
Pages (from-to)204-212
Number of pages9
JournalProceedings of SPIE - The International Society for Optical Engineering
Volume5296
DOIs
StatePublished - 2004
Externally publishedYes
EventDocument Recognition and Retrieval XI - San Jose, CA, United States
Duration: 21 Jan 200422 Jan 2004

Keywords

  • Clustering
  • Document analysis
  • Form processing
  • Image classification
  • Image understanding

Fingerprint

Dive into the research topics of 'Automatic content extraction of filled form images based on clustering component block projection vectors'. Together they form a unique fingerprint.

Cite this