Optimizing indexing of large document collections
Last Reviewed: February 6,
2009
Article: DTS0206
Applies
to: dtSearch 7
Document Storage and the NTFS File System
General Indexing Strategy
Index and Document Location
Other Software
Indexing Resources
Efficient Text Processing
Why filtering improves accuracy when searching forensic
data
Document Storage and the NTFS
File System
• Distribute large numbers of files in a
folder tree, so individual folders do not have more than a few
thousand files.
• Disable Microsoft “8.3” short filename
creation on NTFS partitions that contain a very large number of
files.
• Use ZIP files to aggregate large numbers
of files into a smaller number of archives.
• Use a disk defragmentation utility.
Use a
folder tree. In our experience and that of
some of our customers, NTFS can become slow or unstable when
storing very large numbers of files in a single folder.
To avoid this problem, we recommend distributing
documents in a folder tree, or aggregating documents into ZIP
files, to reduce the number of files in individual NTFS
folders. (Windows Vista contains an improved file system
that may eliminate this issue for Vista users.)
Disable
“8.3” short filenames. Changing a file system
setting to disable creation of short “8.3” filenames can also
help with NTFS problems with large numbers of files in a single
folder. For information on disabling short path names in
Windows, please see http://support.microsoft.com/kb/121007.
While making this change can improve NTFS performance,
programs that rely on 8.3 filenames will not be able to access
the data in these partitions.
Use ZIP
archives. Aggregating documents into ZIP
archives greatly reduces the number of files that NTFS must
manage and will also reduce storage requirements.
dtSearch can automatically index, search and display
documents inside ZIP archives, and the effect on indexing speed
is generally minor. To work with the dtSearch ZIP file
parser, each ZIP archive must be smaller than 2 GB.
Use a
defragmentation utility. Use a defragmentation
utility such as Diskeeper to keep your hard disk defragmented.
Keep the hard disk at least 25% empty so the
defragmentation utility can run efficiently.
General Indexing
Strategy
• Index in larger batches.
• Use the index compress function after
multiple index updates.
• For very large indexing jobs, index on
multiple machines running simultaneously, and then merge the
indexes.
• When merging, merge indexes into a new
empty index, rather than merging into an index that already
contains data.
• Do not require the indexer to “commit”
index updates too often (dtSearch Engine users only).
Index
in larger batches. The dtSearch indexer is
optimized for indexing large volumes of text at once.
Indexing in small batches makes each update relatively
slower and fragments the index structure.
Use the
compress function after multiple index updates.
For optimal search speed, after many index updates, use
the compress function to defragment the index.
Index on multiple machines running
simultaneously. For very large indexing jobs, using
multiple machines to simultaneously build indexes on different
portions of a data collection is generally much faster than
indexing on a single machine. Splitting up the indexing
job is also a good strategy if disk space is insufficient to
index all data at once.
Merge
multiple indexes into a new, empty index.
After creation of multiple individual indexes, you can
run searches across all indexes at once. (A single
dtSearch query can search any number of indexes.) Or, for
optimal index structure and search efficiency, merge the
multiple indexes into a single index.
Merging indexes into a new, empty
index—rather than merging into an index that already contains
data—results in a substantially faster and more efficient merge
process. Make sure, however, that the final index holds
no more than about a terabyte of text.
”Commit” index updates
infrequently. The dtSearch Engine API provides
a setting, IndexJob.AutoCommitIntervalMB, that determines how
often dtSearch must commit index updates. Higher
values improve indexing performance. For best
performance, set AutoCommitIntervalMB to a value greater than
16,000. Alternatively, you can set
AutoCommitIntervalMB to zero, which requires dtSearch to commit
only once at the end of an indexing job.
dtSearch Desktop:
This setting is not currently available in dtSearch
Desktop.
dtSearch developer API: Set
AutoCommitIntervalMB to a value of either 0 or greater than
16,000.
Index and Document Location
• Keep the indexes as close as possible to
the machine where the indexer is executing—even if the data is
remote.
• Avoid generating an index on an external
drive such as Firewire, USB, or NAS.
• Do not generate or store an index on a
compressed or encrypted NTFS folder.
• If you are accessing data across
potentially unreliable network connections (for example,
crawling a large variety of web sites), download the data prior
to indexing.
• Consider using the dtSearch caching
feature, particularly for use with web-Based data that changes
frequently or that may not be available in the future.
Keep
the indexes close to the indexer. Generating
the index requires a high volume of read/write activity to and
from the index. Therefore, it is better to build the
index on an internal drive on the machine where the indexer is
running, rather than generating an index on a remote drive or
external drive. If the indexes must be located on a
network drive, redirect temporary sorting buffers to a local
drive, as described below.
While the index files should remain close
to the indexing engine, it does not matter as much where the
target data resides. Unlike the index building process,
which requires a large amount of read/write activity, the
indexer must read the target data just once, making the
location of the target data far less critical.
Avoid
generating an index on an external drive such as Firewire, USB,
or NAS. In our experience, generating an index
on external drives like Firewire, USB or NAS causes a
substantial reduction in indexing performance. If
you want to ultimately store an index on an external drive,
build it on an internal drive and use a copy program such as
Robocopy to copy it over to the external drive after completing
the index.
If you do copy an index to an external
drive, and if the documents are located on the same drive as
the index, ensure that the documents move in tandem with the
indexes so relative path references from the index to the
documents will remain valid in the new index location.
Alternatively, you can also use the dtSearch caching
feature to accomplish a similar result (see discussion below).
If it is necessary to build an index on a
network storage device, please see this article for additional
information: Using dtSearch
with network storage devices
Do not
use compressed or encrypted NTFS folders to store or build
indexes. Compressed or encrypted NTFS folders
impose a severe performance penalty on both indexing and
searching.
Avoid
accessing data through unreliable network
connections. If a web-Based or other network
connection to the data is unreliable, download the data first.
Downloading the data instead of using an unreliable
network connection results in greater efficiencies both in the
initial indexing, as well as the display of the retrieved data
with highlighted hits.
Products such as WinHTTrack or Offline
Explorer Pro can download web sites to local folders. The
download approach also has the advantage that it separates the
web site crawl from the indexing work. Separating these
two tasks can result in efficiencies in the performance of
both.
Consider the caching feature for use
with rapidly changing or sporadically available web-Based or
other remote data. When dtSearch displays a
retrieved document or web page, it refers back to the original
document or web page to display highlighted hits using hit
offset information in the index. If the document or web
page has changed since the last update, then the hit highlights
will not be in the correct place.
Since dtSearch can use the cached text to
display hit highlights correctly, hit highlighting in cached
pages will always be consistent, even if the original page has
changed since the last index. The caching feature also
ensures that dtSearch can display retrieved pages, even if the
original pages are removed, offline, or otherwise
inaccessible through an erratic connection.
dtSearch Desktop: Specify the index
location in the Create Index or Create Index (Advanced) dialog
box. To enable caching, using the Create Index (Advanced)
dialog box.
dtSearch developer API: Specify the
index location in IndexJob.IndexPath. To enable caching,
set the caching flags in IndexJob.IndexingFlags.
Other Software
• Disable on-access virus scanning of the
folder containing the index.
• If possible, avoid indexing with
IFilters, as some can result in speed and stability
problems.
Disable
virus scanning of the folder containing the index.
To prevent on-access scans by antivirus software from
affecting indexing performance, configure your antivirus
software not to scan files in the folder containing your index.
On-access scans of document folders will have a much more
minor effect on indexing performance, and provide an important
security benefit, so we do not recommend disabling on-access
scans of document folders.
Avoid
IFilters, if possible. While dtSearch supports
using IFilters, they may be slower and less stable than
dtSearch’s built-in file parsers. We recommend that you
do not use IFilters for large indexing jobs unless for some
reason a particular IFilter is absolutely necessary.
dtSearch Desktop:
dtSearch does not use IFilters by default; IFilter
integration is disabled unless you enable IFilter support in
the Options > Preferences > File Types dialog
box.
dtSearch developer API: The dtSearch Engine does not
use IFilters by default. IFilter integration is
controlled using Options.FileTypeTableFile, which specifies the
location of a file type table file in the format generated by
dtSearch Desktop (filetype.xml).
Indexing Resources
For small index updates (less than 20 GB of
data), dtSearch can work efficiently with limited memory (512
MB - 1 GB) and disk space of at least 60% of the size of the
data to be indexed. For larger index updates,
• The machine building the index should
have 2 GB RAM or more for larger indexing jobs.
• The hard disk where the index will reside
should have free space of at least 15% of the size of the
original data, plus 16-32 GB for temporary workspace.
• Avoid running many indexers at the same
time on the same computer.
• If possible, redirect temporary sort
buffer creation to an internal drive other than the one
containing the index.
Use 2
GB RAM or more for larger indexing jobs. While
you can limit the amount of memory the indexer will use for
in-memory sort buffers, for best indexing performance, let the
dtSearch indexer decide how much memory to use Based on
available system resources, rather than specifying a limit.
dtSearch indexer does this by default in all dtSearch
products except the dtSearch Engine.
dtSearch Desktop: Click
Options > Preferences > Indexing Resources to control
the amount of memory dtSearch uses during
indexing.
dtSearch developer API: Set
IndexJob.MaxMemToUseMB.
Ensure
sufficient disk space. The final index will be
about 15% of the size of the original documents (for smaller
indexes, the ratio will usually be higher). In addition,
for large index updates, at least 16 GB, and preferably 32 GB,
of disk space should be available during indexing.
Avoid running many indexers at
the same time on the same computer. Indexing uses system resources -- CPU,
memory, and disk -- very heavily. As a result, there
is generally little benefit to running many indexers at the
same time on the same computer, even on multi-CPU machines,
because resource contention will reduce indexing
performance. Additionally, when indexes are
located on a network share, running multiple indexers at the
same time can cause unpredictable spikes in network I/O
(because all of the indexers may be writing to the index
folders on the network at the same time), which can lead to
network write errors and corrupt indexes.
Redirect temporary sort buffer files
to a different internal drive. By
default, dtSearch creates temporary sort buffers in the
index folder. If you have multiple internal
drives, you can redirect such files to a different internal
drive from the one holding the index.
If you have to build an index on a network
drive, redirect temporary files to a local folder to minimize
network traffic. This location should have free space of
32 Gb or more.
dtSearch Desktop:
Use Options > Preferences > Indexing Resources
>Temporary files to specify the folder to use for
temporary file buffers.
dtSearch developer API: Set
IndexJob.TempFileDir.
Efficient Text
Processing
• Do not use case and accent sensitive
indexing.
• Enable Unicode filtering for binary
files.
• Disable numeric range searching, if your
application does not require searches for numeric ranges.
• If possible, set the hyphenation option
to treat hyphens as spaces.
• Use the noise word list to skip common
words such as the.
It may seem that increasing the number of
unique words in an index will increase the accuracy of a
search. However, in many cases, increasing the number of
unique words in an index can reduce the accuracy of a search by
defining each incidence of a text occurrence too narrowly.
In addition, increasing the number of unique words can
also dramatically increase the index size and the index
building time.
Avoid
case and accent sensitive indexing. With
case-sensitive indexing on, the indexer would consider World,
world, and WORLD as completely different words. Storing
each of these words separately increases the size of the index
and makes indexing slower. Even more importantly, it also
increases the chance that a user searching for world would miss
World and WORLD.
Accordingly, dtSearch does not recommend
using case- or accent-sensitive indexing except in highly
unusual situations where case- or accent-sensitive searching is
absolutely necessary. By default, dtSearch indexes are
not case- or accent-sensitive.
Enable
Unicode filtering for binary files. Another
text-related indexing setting affecting both search accuracy
and index efficiency is the dtSearch Unicode filter for binary
files. Without this filter, massive amounts of useless
random data will clog indexes of binary files, and the text
indexing process may miss critical data that does not appear in
consecutive form in the binary file. For more details on
why filtering improves both efficiency and accuracy, see
"Why
filtering improves accuracy when searching forensic data"
at the end of this article.
dtSearch Desktop: Click
Options > Preferences > Filtering Options, and check
the “Filter text” option under “Binary files” to enable
filtering of binary files.
dtSearch developer API: Set
Options.BinaryFiles = dtsoFilterBinaryUnicode.
Disable
numeric range searching, if possible. Numeric
range searching requires dtSearch to index each number twice,
once in its text form and once in its numeric value form.
This feature adds about 10-20% to the size of a typical
index. Accordingly, if an application does not require
numeric range searching, disabling indexing of numeric values
will result in better indexing efficiency. Note that
disabling numeric range searching continues to allow searching
of numbers as text.
dtSearch Desktop: Click
Options > Preferences > Indexing Options, and un-check
the box to “Index numeric values” to disable numeric range
searches.
dtSearch developer API: Set Options.TextFlags
= dtsoTfSkipNumericValues.
Keep
the default treatment of hyphens as spaces.
Through alphabet customization, dtSearch can index
hyphenated words in multiple permutations. (For example,
dtSearch can index world-class as world class, worldclass and
world-class to ensure retrieval no matter which of these
variants a user types in.) Treating hyphens as spaces,
however, results in more efficient indexing.
dtSearch Desktop:
Treating hyphens as spaces is now the default.
To change the hyphens setting, click Options >
Preferences > Letters and Words.
dtSearch developer API: set Options.Hyphens =
dtsoHyphenAsSpace
Use the noise word list. A search
engine can reduce index size and make searching faster by
ignoring a few dozen words that are so common as to be, for
purposes of searching, mere “noise.” For example, the, of
and for are all in the dtSearch noise word list.
For information on non-English noise word lists,
see http://www.dtsearch.co.uk/.
dtSearch Desktop:
To edit the noise word list, click Options >
Preferences > Letters and Words.
dtSearch developer API: Set
Options.NoiseWordFile to the name of a text file to use as the
noise word list before you create an index.
Why filtering improves accuracy when
searching forensic data
Binary files are files that dtSearch does
not recognize as documents. Examples of binary
files include executable programs, fragments of documents
recovered through an “undelete” process, or blocks of
unallocated or recovered data obtained through computer
forensics. Content in these files may appear in a variety
of formats, such as plain text, Unicode text, or fragments of
.doc or .xls files. Many different fragments with
different encodings may be present in the same binary file.
Indexing such a file as if it were a simple
text file would miss most of the content. In contrast to
a simple text scan, the dtSearch filtering algorithm scans a
binary file for anything that looks like text using multiple
encoding detection methods. The algorithm can detect
sequences of text with different encodings or formats in the
same file, so as to better extract text from recovered or
corrupt data.
In forensic applications, when complete and
accurate results are critical, investigators may be reluctant
to enable a “filtering” feature out of concern that they will
miss something, even if disabling filtering makes indexing
slower. In reality, filtering improves completeness and
accuracy, and without it investigators will probably miss much
of the useful data in the files they are searching.
For example, this is a hex view of how some
text from this article might appear in a fragment of a
recovered Word document:
Offset
0 1 2 3
4 5 6 7 8 9
A B C D E F
>00009C00 FF FF FF FF FF FF FF FF FF FF FF
FF FF FF FF FF ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
00009C10 FF FF FF FF 73 65 63 72 65 74 31 FF
FF FF FF FF ÿÿÿÿsecret1ÿÿÿÿÿ
00009C20 FF FF FF FF FF FF FF FF FF FF FF FF
FF FF FF 7F ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
00009C30 FF FF FF 7F EC 37 93 00 00 00 00 00
B2 00 00 00 ÿÿÿì7“.....²...
00009C40 00 00 FF FF FF 4A 6F 68 6E 53 6D 69
74 68 FF FF ..ÿÿÿJohnSmithÿÿ
00009C50 FF FF 00 00 00 00 00 00 28 00 4D 00
61 00 6E 00 ÿÿ......(.M.a.n.
00009C60 61 00 67 00 69 00 6E 00 67 00 20 00
61 00 6E 00 a.g.i.n.g. .a.n.
00009C70 64 00 20 00 53 00 65 00 61 00 72 00
63 00 68 00 d. .S.e.a.r.c.h.
00009C80 69 00 6E 00 67 00 20 00 54 00 65 00
72 00 61 00 i.n.g. .T.e.r.a.
00009C90 62 00 79 00 74 00 65 00 73 00 20 00
6F 00 66 00 b.y.t.e.s. .o.f.
00009CA0 20 00 54 00 65 00 78 00 74 00 00 00
00 00 00 00 .T.e.x.t.......
All of the useful text actually present is
broken up or embedded in garbage data, effectively making it
unsearchable. A naïve, unfiltered attempt to index this
data would find the following words:
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿsecret1ÿÿÿÿÿÿÿÿ, ì7, ÿÿÿJohnSmithÿÿÿÿ,
M, a, n, a, …
The dtSearch filtering algorithm would
analyze the data more intelligently, enabling it to
• extract the word secret1 embedded in a
long sequence of non-text characters,
• extract and separate the names John and
Smith, and
• recognize that the data starting at
offset 9C58 looks like Unicode, enabling it to identify the
words Managing, Search, etc.
The dtSearch filtering algorithm works by
analyzing the patterns of characters in the data. The
dtSearch filtering algorithm makes no attempt to analyze the
meaning of the language present, so the algorithm works with
Arabic or Russian text, for example, as well as English.
Therefore, to retrieve as much as possible
of the text present in fragments of recovered word processing
files, spreadsheets, database data, and the like, enable the
dtSearch filtering algorithm.
dtSearch Desktop:
Click Options > Preferences > Filtering Options,
and check the “Filter text” option under “Binary files” to
enable filtering of binary files.
dtSearch developer API: Set
Options.BinaryFiles = dtsoFilterBinaryUnicode.
|