How to use dtSearch or dtSearch Web with OCR
Last Reviewed: January 29,
2009
Article: DTS0167
Applies
to: dtSearch 6, 7
Scanned documents are usually stored as
TIFF images, which are then converted using OCR (optical
character recognition) into a text format such as HTML or
Microsoft Word. Using dtSearch, dtSearch Desktop, or dtSearch
Web, these text documents can then be indexed and made
searchable.
In many cases, it is necessary to provide
access both to the searchable text and the original image file,
so that users can see exactly what the original document looked
like. Because web browsers cannot display TIFF images without
an image-viewing plug-in, the image files must be converted
into another format if web access is needed.
Using
PDF to combine images and text
The PDF file format provides two ways to
combine images and text in a single file. First, the "image
with hidden text" format stores the complete original TIFF
images, along with the text obtained through OCR. The text is
"hidden" because, when a user opens the PDF, the user only sees
the scanned image, not the text. Because the text is also in
the file, dtSearch can index and search it. After a search,
dtSearch, or dtSearch Web, can highlight hits directly on the
scanned image.
Another option for combining scanned images
and text in a single PDF file uses small images for the parts
of each scanned page that do not appear to be text. For
example, a picture or a signature would be stored as a small
image embedded in the page, while the rest of the page would be
converted to text. This format produces much smaller files than
the first alternative, because only a few small images are
stored for each page, instead of a complete image of the whole
page. Additionally, the text detected through OCR often becomes
more readable, because it is stored as text with font
information rather than as an image.
The PDF format is ideal for use on the web
because multiple pages of images and text can be combined into
a single, compressed file; anyone with a web browser and the
free Adobe Reader viewer can view the files; and the text can
be searched using dtSearch Web.
Some OCR products that can generate PDF files
from scanned images include:
DocuLex, www.doculex.com
Ligature, http://www.ligatureltd.com
Adobe Capture, www.adobe.com
Nuance OmniPage, http://www.nuance.com
Once you have created PDF files with one of
the OCR products listed above, you can index and search them
with dtSearch Web just like any other documents. After a
search, dtSearch Web will display a list of retrieved
documents. When a user clicks on one of the documents, dtSearch
Web will display the PDF file, with hits highlighted. If the
PDF file is in the "image with hidden text" format, dtSearch
Web will highlight hits directly on the image.
For a demo, see http://support.dtsearch.com/ocr/
|