How to index databases with the dtSearch Engine
Last Reviewed: April 3, 2009
Article: DTS0111
Applies
to: dtSearch Engine versions 6.x or 7.x
1. Indexing SQL Data and other data
sources
a. Sample code and documentation
b. Incremental updates
c. Binary documents in fields (BLOB data)
d. Selecting which columns and fields
to index
e. Highlighting hits
f. Third-party database indexing
tools
2. Indexing desktop databases
1. Indexing SQL Data and
other data sources
Sample code and documentation
This dtSearch Engine has an API that can
index any data you can access from .NET, Visual Basic, Java, or
C++. A sample database indexing application included with the
dtSearch Engine uses Active Data Objects (ADO.NET) to index any
database that can be accessed through ADO. The same
approach will work with other types of non-file data.Source
code for the database indexing sample application is installed
to these locations:
C:\Program Files\dtSearch
Developer\examples\vb.net2\ado_demo (VB.NET 2.0)
C:\Program Files\dtSearch Developer\examples\cs2\ado.net (C#
2.0)
C:\Program Files\dtSearch Developer\examples\vb.net\ado_demo
(VB.NET)
C:\Program Files\dtSearch Developer\examples\cs\ado.net
(C#)
C:\Program Files\dtSearch Developer\examples\vbasic\ado_demo
(Visual Basic 6)
The API's used to implement this feature
are documented here:
.NET 2.0:
DataSource class, documented in dtSearchNetApi2.chm
.NET 1.0:
DataSource class, documented in dtSearchNetApi.chm
C/C++: dtsDataSource in the dtSearch
Engine help file.
COM: Indexing COM Data Sources in
the dtSearch Engine help file.
Java: DataSource2 class
The ado_demo sample application uses the
database schema to determine the tables and fields that are
present in the database, and then iterates over every row of
every table, indexing the entire database. The source code can
be modified to exclude certain tables or fields from the
index.
The basic mechanism used to implement data
source indexing is the same in all of the APIs:
1. A data source object is created that
iterates over the content to be indexed. The object
implements a "GetNextDoc" method that returns the next document
to be indexed.
2. This object is attached to the
IndexJob, and when the IndexJob is executed, the dtSearch
Engine calls the GetNextDoc method repeatedly to get data to be
added to the index.
3. For each item returned by GetNextDoc,
the data source can return plain text, fielded data, and/or a
BLOB containing binary document data (such as a PDF file).
The data source can choose whatever name it wants for the
row, as long as the name has the form of a valid Win32
filename. The name will be returned as the document filename in
search results.
Usually the name includes a row ID for the
row, making it easier to retrieve a row from the database using
the filename returned in search results.
Incremental
Updates
An index
of a database that was created using the data source API can be
updated incrementally. To do this,
1. Set up the IndexJob exactly as
it was set up when the index was initially created, but set all
of the Action flags (ActionCreate, ActionRemoveDeleted, etc.)
to false), except the ActionAdd flag, which must be set to
True. If ActionCreate is true, the index will be cleared
at the start of the index update.
2. Set up your data source so it will
only iterate over database rows that are either (a) new, or (b)
have been modified since they were last indexed. If the
data source iterates over the whole database, dtSearch will
still only index new or modified rows (it checks the
modification date and size of each item to see if it has
changed before reindexing). However, incremental
updates will be much faster if you just pass dtSearch the rows
that need to be indexed.
To support incremental updates, it is often
helpful to have a boolean "NeedsIndexing" field in the database
indicating that a row requires reindexing. The field can
be set to true when a row is added or modified, and false when
a row is indexed. To perform an incremental update, the
data source would select only those rows where
NeedsIndexing=true.
To remove deleted rows from an index, set
IndexJob.ActionRemoveListed = true and set
IndexJob.ToRemoveListName to the name of a text file with a
list of the rows to delete. For more information on
removing items from an index, see "
Removing documents from an index."
Binary
documents in fields (BLOB data)
If data in a column consists of binary
files such as Word documents or Excel spreadsheets, you can
index the contents of these columns along with other fields.
The data source API provides a way to combine (field,
value) pairs from other columns of the same row with a binary
document such as a Word file into a single logical document for
indexing and searching purposes. dtSearch will index the
document just as it would if found on disk, including any
fields in the document itself (such as Document Summary
Information fields). For more information on how this is
done, see the article, "How to
add fields to documents during indexing."
Highlighting hits
To highlight hits in a retrieved document,
use the FileConverter object to convert the document to HTML,
RTF, or text, with hit highlight markings around the hits.
For documentation on FileConverter, see:
Highlighting Hits - Overview in the dtSearch Engine
help file, dtSearchApiRef.chm.
.NET:
FileConverter class, documented in dtSearchNetApi.chm
C/C++:
DFileConvertJob in dtSearchApiRef.chm.
COM:
FileConverter in dtSearchApiRef.chm.
Java:
FileConverter in dtSearchApiRef.chm.
When data has been indexed using a data
source, there is often no disk file to use as input so the
original document must be reconstructed. There are two
ways this can be done:
(1) Set up the FileConverter with the same data that was
supplied to the IndexJob in the DataSource object. For
example, in the .NET API, set FileConverter.InputText,
InputFields, and InputBytes to the values returned through the
data source as DocText, DocFields, and DocBytes.
(2) Using the version 7 index
format, you can set up the index so it will cache each
document as it is indexed. Then instead of reconstructing
the document to highlight hits, you can set FileConverter.Flags
= dtsConvertGetFromCache, so the cached copy of the document
will be used to highlight hits.
Selecting which tables and
columns to index
The ado_demo samples all implement a
SampleDataSource class that traverses the entire database,
using schema information to get the list of tables and columns
in the database. To modify the sample code so it
indexes only certain tables or only certain fields, modify the
SampleDataSource class, changing the code that gets the list of
tables and columns from the schema to instead use a specific
list of tables and columns.
When indexing very
large tables, your indexing application may run out of memory
if you attempt to select the entire table at once. If
this occurs, you can modify your data source implementation to
select batches of rows (i.e., 1-100,000, 100,001-200,000, etc.)
to index.
Third-party database-indexing
tools
Cybergroup's dbConnector is
available from Cybergroup and can be used to index any
ODBC-compliant database using a GUI. For more
information, see:
http://www.dtsearch.com/CS_CybergroupdbConnec.html
2. Indexing desktop databases
with the dtSearch Engine
dtSearch indexes desktop database formats
directly, without the need for ADO or OBDC drivers.
These include: Microsoft Access (*.mdb,
*.accdb), XML, CSV (comma-separated values), and XBase (FoxPro,
dBASE, and other .DBF-compatible formats).
Additional Information
For more information about indexing
databases with dtSearch, please see the following topics:
Field Searching in the dtSearch Text Retrieval Engine
Programmer's Reference (dtSearchApiRef.chm)
Indexing Databases in the dtSearch Text Retrieval Engine
Programmer's Reference (dtSearchApiRef.chm)
How to get
field data in search results
How to add
fields to documents
Troubleshooting database
indexing
How to use
dtSearch Web with dynamically-generated
content
|