| |
|
|
| OCR and
Imaging |
|
 |
dtSearch supports
the PDF "image with hidden
text" format, and can
highlight right on the
scanned image in this
format. |
 |
dtSearch also
supports combined text and
image displays in
HTML. |
 |
dtSearch Desktop and
Network include a built-in
image viewer. |
 |
dtSearch recommends
using fuzzy searching for
sifting through possible OCR
errors. |
|
|
|
OCR and
PDF
The Adobe PDF file format
provides two ways to combine in a single
file images and OCR’ed text, or images
that have been converted to text through
Optical Character Recognition (OCR)
software.
(1) The "image with hidden text format" stores
the complete original image of a scanned
document, along with the text obtained through
OCR. The text is "hidden" in the sense that
simply opening the PDF file displays only the
scanned image, not the underlying OCR'ed text.
Because the OCR'ed text is "hidden" in the
file, however, dtSearch can index and search
it.

After a search, when a user clicks on an "image
with hidden text format" PDF document, the
dtSearch product will display the scanned
image. Because the actual OCR’ed text is
"hidden," the display will appear to highlight
hits directly on the image. Click here for a dtSearch Web
demo showing hidden text highlighting.
(2) Another option for combining scanned images
and OCR’ed text in a single PDF file uses
"small images" for the parts of each scanned
page that do not appear to be text. For
example, the format would store a picture or a
signature as a small image embedded in the
page. The format would store the non-picture
portion of the page only as OCR’ed text.
While the "small images" alternative does not
preserve the true image of the original
document, it does produce much more compact
files than the "image with hidden text" option.
The "small images" PDF file usually stores only
a few images for each page, instead of a
complete image of the whole document. The text
detected through OCR in the "small images"
format can also be more readable because the
resulting PDF file stores it as text with font
information rather than as an image.
For more information on both PDF
/ OCR options, including a list of some
additional third-party products that OCR
into the PDF format, click here.
|
|
|
|
The dtSearch product
line can instantly search terabytes of
text across a desktop, network,
Internet or Intranet
site.
|
|
dtSearch products
also serve as tools for publishing,
with instant text searching, large
document collections to Web sites or
CD/DVDs.
|
 |
over two dozen indexed, unindexed,
fielded and full-text search
options |
 |
highlights
hits in HTML, XML and PDF, while
displaying embedded links, formatting and
images |
 |
converts other file types — word
processor, database, spreadsheet, email and
full-text of email attachments, ZIP, Unicode,
etc. — to HTML for display with highlighted
hits |
 |
built-in Spider adds a third-party
or other Web site (public, secure content,
password accessible, etc.) to your searchable
database |
 |
Spider supports Web-based
content (HTML, PDF, XML, etc.) as well as
dynamically-generated content (ASP.NET, MS CMS,
SharePoint, etc.) |
| General supported file
types |
| SQL and similar data
sources |
|