Abstract
An approach to searching for user-specified words in imaged Chinese documents, without the requirements of layout analysis and OCR processing of the entire documents, is proposed in this paper. A small number of Chinese characters that cannot be successfully bounded using connected component analysis due to larger gaps between elements within the characters are blacklisted. A suitable character that is not included in the blacklist is chosen from the user-specified word as the initial character to search for a matching candidate in the document. Once a matched candidate is found, the adjacent characters in the horizontal and vertical directions are examined for matching with other corresponding characters in the user-specified word, subject to the constraints of alignment (either horizontal or vertical direction) and size similarity. A weighted Hausdorff distance is proposed for the character matching. Experimental results show that the present method can effectively search the user-specified Chinese words from the document images with the format of either horizontal or vertical text lines, or both appearing on the same image.
| Original language | English |
|---|---|
| Pages (from-to) | 229-246 |
| Number of pages | 18 |
| Journal | International Journal of Pattern Recognition and Artificial Intelligence |
| Volume | 18 |
| Issue number | 2 |
| DOIs | |
| State | Published - Mar 2004 |
| Externally published | Yes |
Keywords
- Character matching
- Character segmentation
- Chinese document image
- Weighted Hausdorff distance
- Word searching