Skip to main navigation Skip to search Skip to main content

Retrieving Imaged Documents in Digital Libraries Based on Word Image Coding

  • Yue Lu*
  • , Li Zhang
  • , Chew Lim Tan
  • *Corresponding author for this work
  • National University of Singapore

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

A great number of documents are scanned and archived in the form of digital images in digital libraries, to make them available and accessible in the Internet. Information retrieval in these imaged documents has become a growing and challenging problem. For this purpose, a word image coding technique is proposed in this paper, and a web-based system for efficiently retrieving imaged documents from digital libraries is described. Some image preprocessing is first carried out off-line to extract word objects from imaged documents stored in the digital library. Then each word object is represented by a string of feature codes. As a result, each document image is represented by a series of feature code strings of its words, which are stored in a feature code file. Upon receiving a user's request, the server converts the query word into feature code string using the same conversion mechanism as is used in producing feature codes for the underlying imaged documents. Searching is then performed among those feature code files generated offline. An inexact string matching technique, with the ability of matching a word portion, is applied to match the query word with the words in the documents, and then the occurrence frequency of the query word in each corresponding document is calculated for relevant ranking. Preliminary experimental results with some imaged documents of students' theses in the digital library of our university show that the proposed approach is efficient and promising for retrieving imaged documents, with potential applications to digital libraries.

Original languageEnglish
Title of host publicationProceedings First International Workshop on Document Image Analysis for Libraries - DIAL 2004
Pages174-187
Number of pages14
StatePublished - 2004
Externally publishedYes
EventProceedings First International Workshop on Document Image Analysis for Libraries DIAL 2004 - Palo Alto, CA, United States
Duration: 23 Jan 200424 Jan 2004

Publication series

NameProceedings - First International Workshop on Document Image Analysis for Libraries - DIAL 2004

Conference

ConferenceProceedings First International Workshop on Document Image Analysis for Libraries DIAL 2004
Country/TerritoryUnited States
CityPalo Alto, CA
Period23/01/0424/01/04

Fingerprint

Dive into the research topics of 'Retrieving Imaged Documents in Digital Libraries Based on Word Image Coding'. Together they form a unique fingerprint.

Cite this