|
Optimizing
Indexing of Large Collections of
Data
This article acts as a forensics
supplement to the article on tips for
optimizing indexing of large collections
of data. Topics in that
article include: document storage and the
NTFS file system, general indexing
strategy, index and document location,
indexing resources and efficient text
processing.
dtSearch can index over a
terabyte of text in a single index, with
search time typically less than a second.
There are no limits on the number of
indexes dtSearch can build and
simultaneously search. Please see optimizing indexing of
large collections of data
for additional information on using the
terabyte indexer.
Distributed/Federated
Searching
A single terabyte-data index can
span multiple local and remote locations.
For example, a single index can include
data from hard drives, local area
networks, Exchange servers (see
Outlook/Exchange topic below), Intranet
servers and public Web sites (see Spider
topic below).
dtSearch can rank federated or
distributed indexed search results
collectively by relevance, displaying all
local and remote files with highlighted
hits. A scrolling "word wheel" display in
dtSearch Desktop includes all words in an
index covering local and remote
locations. dtSearch can also output all
indexed words to a file.
 |
dtSearch
Desktop: Click Index
> List Index
ContentsdtSearch |
 |
Developer
API: Use ListIndexJob
(.NET) or DListIndexJob
(C++) |
Spider-Assisted
Searching
The dtSearch Spider supports
searching of static browser-ready content
(HTML, PDF, XML/XSL); dynamic
browser-ready content (MS CMS,
SharePoint, ASP.NET, etc,); as well as
browser-incompatible content (MS Office
files, OpenOffice files, etc.) The Spider
can even index and search web-accessible
data in platforms that dtSearch does not
directly support like MAC and
Unix.
The Spider supports public sites
as well as password accessible,
forms-based authentication, and other
secure content access. Indexing with the
Spider involves simply selecting a URL or
URLs and indicating how many vertical or
horizontal links to follow. The Spider
automatically figures out the format of
the data, so there is no need to tell the
Spider whether a retrieved web page
contains, for example, an MS Office
document or a PDF file.
The dtSearch Spider displays
static and dynamic browser-ready content
WYSIWYG, including display of images,
formatting and links, with the sole
addition of highlighted hits. The Spider
converts browser-incompatible content
(such as MS Office or OpenOffice) "on the
fly" to HTML for browser display with
highlighted hits.
More information (basic
article);
more information (advanced
article)
For convenient offline access,
the dtSearch Spider also includes a
caching option, to store the full
spidered content along with the index.
(Without caching, the Spider has to
return to the relevant URL to display the
full content with highlighted
hits.)
 |
dtSearch
Desktop: To enable
caching, using the Create
Index (Advanced) dialog
box. |
 |
dtSearch
developer API: To enable
caching, set the caching
flags in
IndexJob.IndexingFlags. |
Automatic
Recognition of Date, Email Addresses, and
Credit Card Numbers
dtSearch can automatically
recognize dates, email addresses, and
credit card numbers, and search for these
items by type. Through this feature,
dtSearch can, for example, search for a
credit card number regardless of how it
may be formatted, or search for a range
of dates even if the dates are expressed
in different text formats (January 15,
2005, through 2/19/07). dtSearch can also
extract all dates, emails and credit card
numbers from a collection of documents.
More information
Forensics
Filtering Features
dtSearch offers a Unicode
filtering feature for automatic recovery
of text from corrupt
forensically-recovered documents and
large data blocks, such as those
recovered through an "undelete" process,
from unallocated computer space, or from
partially recovered file fragments. The
filtering algorithm can scan recovered
data blocks using multiple Unicode and
other text encoding detection methods.
More information
 |
dtSearch
Desktop: Click Options
> Preferences >
Filtering Options, and check
the "Filter text" option
under "Binary files" to
enable filtering of binary
files. |
 |
dtSearch
developer API: Set
Options.BinaryFiles =
dtsoFilterBinaryUnicode. |
Outlook/Exchange
Support
dtSearch includes two ways to
index Outlook or Exchange messages,
contacts, tasks, and notes. Both methods
include indexing and searching of the
underlying messages as well as the full
text of all email attachments. dtSearch
will highlight hits in both messages and
attachments.
In the first approach, dtSearch
indexes "live" content in an Outlook
profile. In addition to display of search
results in dtSearch with highlighted
hits, dtSearch supports launching a
message, contact, task, or note in the
native application. For example, you can
search for a message in dtSearch, launch
the message in Outlook, and then reply to
the message using Outlook.
For archiving and forensic
applications, dtSearch recommends
extracting Outlook and Exchange data to
.msg files. The .msg conversion approach
in dtSearch works through a command-line
tool to extract Outlook items in bulk
from larger volumes of PST or Exchange
data. The converted .msg files will
include all properties of the original
Outlook item, including any attachments.
Following conversion, dtSearch can index
the resulting .msg files, including
highlighting hits in messages and
attachments.
More information
(Note: the above discussion
applies to Outlook and Exchange data.
dtSearch can index Outlook Express .dbx
files just like any other supported file
type.)
Fuzzy
Searching
Fuzzy searching uses a
proprietary algorithm to find search
terms even if they are misspelled.
dtSearch recommends fuzzy searching for
searching emails, OCR’ed text, or any
other text that may contain
misspellings.
Search fuzziness adjusts from 0
to 10 so you can fine-tune fuzziness to
the level of OCR or typographical errors
in your files. A search for
alphabet with a fuzziness of 1
would find alphaqet; with a
fuzziness of 3, it would find both
alphaqet and alpkaqet.
Fuzziness is not built into the index, so
you can vary fuzziness at the time of
each search. More information on fuzzy and
other search
options
International
Language Support
dtSearch includes
Unicode-compatible file parsing, to
convert input data to Unicode. dtSearch
automatically recognizes all
Unicode-supported encodings, representing
hundreds of international languages.
The following dtSearch search options work
automatically on text in any international
language: phrase; Boolean; proximity and
directed proximity; wildcard; macro; numeric
range; fielded data / metadata search options;
fuzzy searching (adjustable from 0 to 10 to
account for typographical or OCR errors); and
relevancy-ranked searching (including natural
language vector-space ranking, positional
scoring options, general variable term
weighting, variable term weighting in fields,
and other API-based document classification and
sorting options).
More information
Chinese, Japanese and
Korean Text With No Word
Breaks
Some Chinese, Japanese, and
Korean text does not include word breaks.
Instead, the text appears as lines of
characters with no spaces between the
words. Because there are no spaces
separating the words on each line,
dtSearch sees each line of text as a
single long word. To make this type of
text searchable, enable automatic
insertion of word breaks around Chinese,
Japanese, and Korean characters, so each
character will be treated as single
word.
 |
dtSearch
Desktop: In Options >
Preferences > Letters and
Words, check the box to
“Insert word breaks between
Chinese, Japanese, and Korean
characters in
text.” |
 |
dtSearch
Developer API: set
dtsoTfAutoBreakCJK in
Options.TextFlags. |
Language
Group
Identification
For documents in certain formats
that do not include encoding information,
such as single-byte text files, dtSearch
provides a proprietary language
recognition algorithm for detecting text
in a large variety of languages (Western
European, other European, Middle-Eastern,
etc.). This algorithm is enabled by
default.
Hidden
Content
A search in dtSearch will always
include white-on-white text and similar
"invisible" text in files. dtSearch also
includes options for searching embedded
objects in Microsoft Office documents,
and normally hidden content in
HTML.
While HTML comments, scripts,
links, and styles are not by default
included in indexing, dtSearch has an
option to include these.
 |
dtSearch
Desktop: Click Options
> Preferences >
Indexing Options, and check
the box to "Index HTML
scripts, styles, links and
comments." |
 |
dtSearch
developer API: Set
Options.FieldFlags = to a
combination of these flags:
dtsoFfHtmlShowLinks,
dtsoFfHtmlShowImgSrc,
dtsoFfHtmlShowComments,
dtsoFfHtmlShowScripts,
dtsoFfHtmlShowStylesheets,
and
dtsoFfHtmlShowMetatags. |
A similar option searches hidden
content (such as Macros or other embedded
objects) in Microsoft Office
files.
dtSearch Desktop: Click
Options > Preferences > Indexing
Options, and check the box to "Index
Hidden content in Office documents."
dtSearch developer API: This option is
set by default. To disable it, set
dtsoFfOfficeSkipHiddenContent in
Options.FieldFlags.
Search for List of
Words or
Concepts
dtSearch provides an option to
search for a list of words. Under this
option, a special dialog box provides a
way to search for a long list of words,
and create a list of matching files, in a
single step. This option can work with
the full range of dtSearch search
features (Boolean, fuzzy, natural
language, etc.).
More information
For expanding a search for a
specific set of word or words to a
user-defined list of concepts or
synonyms, dtSearch also offers a
user-defined thesaurus add-on to the
comprehensive English-language thesaurus
included with dtSearch.
 |
dtSearch
Desktop: Click Options
> Preferences > Search
Options > User Thesaurus
to add a list of synonym
rings to a specific
terms. |
View Log of
Encrypted Files; Index Encrypted
PDFs
After an index update completes,
click "View Log" to see a report that
will include information on any encrypted
or unreadable files that the indexer
could not process. This report can be
accessed at any time in the index folder
in the file Index_LastUpdateErrors.html.
The report indicates which files were (a)
encrypted, (b) corrupt, (c) partially
encrypted, and (d) partially corrupt.
Partially encrypted or corrupt files are
files that could be indexed in part but
that included some encrypted or corrupt
data (for example, an email with an
encrypted attachment).
To index encrypted PDFs, make a
temporary, decrypted copy of the
encrypted files, index the decrypted
copy, and then replace the temporary
decrypted copy with the encrypted
versions. This one-time
unencryption is sufficient for dtSearch
operation. dtSearch does not need
to unencrypt the PDF files to search and
display them with highlighted hits once
the original index is
complete.
Making Available
Retrieved Files on CD/DVD or Other Portable
Media
The dtSearch Publish product can
quickly publish forensically retrieved
(or e-discovery retrieved) documents to
CD, DVD or other portable media.
The resulting product provides instant
search and display access to the document
set. The CD, DVD or other portable
media can run with zero footprint,
requiring no installation on the
end-user's computer.
Please see
Mirroring Searchable Web
Content on Portable Media article
for an overview of how dtSearch Publish
works.
|