|
Last Reviewed: March 9,
2006
Article: DTS0180
Applies
to: dtSearch Web 7 and later
To use dtSearch Web to search a web site, you
must first create an index of the web site with the dtSearch
Indexer. If the web site consists of static web pages
(HTML files, PDF files, etc.), you can just click
Add folder
in the dtSearch Indexer to add the folders with the web
pages to the index. However, if the web site is
dynamically generated, by a database, ASP or ASP.NET
program, or content manager such as Microsoft Content
Manager Server (CMS) or SharePoint, there will be no
folder with web pages to index. Instead, you can
use the dtSearch Spider to crawl the web
site.
To index your web site with the dtSearch
Spider, click Add
Web in the dtSearch Indexer and provide the starting
address for the crawl (usually your site's home page).
The dtSearch Indexer will traverse the web site by
following the links connecting the pages. Because the Spider
follows the same links that a web browser would use to navigate
your site, it will be able to index the dynamically-generated
content just as it is presented on your web site.
For programmers, there is a .NET API for the
Spider in the dtSearch Text Retrieval Engine. For API
documentation, click here or see the dtSearchNetApi.chm help
file.
To ensure that you can highlight hits in
documents retrieved from a dynamically-generated site, create
the index with the options to "Cache documents"
and to
"Cache original files" enabled. These options are
set in the Index >
Create (Advanced) dialog box. When
content is cached in the index, dtSearch and dtSearch Web
can highlight hits from the cached data, without the need
to download the pages again from the site, which makes
hit highlighting faster and more reliable. For more
information on this option, see: "Caching
Documents and Text in an Index" in the dtSearch help
file.
dtSearch 7.0 or later is needed to create
indexes with caching. If you have an earlier version,
please see: dtSearch 7
Upgrade Information. For more information on the
version 7 index format, see:
http://www.dtsearch.com/index7.html
An option setting to control whether hits are
highlighted in content indexed using the Spider is in dtSearch
Web Setup's Form
Builder dialog box, in the
Search
Results tab. You can
also change this setting after a search form is created
by editing the dtSearch_options.html file. The
option setting is controlled by this item in the
dtSearch_options.html file:
<BR><HR><I>Highlight documents indexed via
HTTP: </I>
<!-- $Begin HighlightHttpDocs -->
1
<!-- $End -->
If the option is set to 0 (off), then dtSearch
Web will return direct links to any pages indexed by the
spider, so the page will be displayed just as it appears
normally. If the option is set to 1 (on), then dtSearch
Web will request the page itself, add hit highlight markings,
and then display the page with hits highlighted.
Often pages generated by a content manager will
contain sections of HTML that you would not want to be indexed,
such as the table of contents and navigation menus. To
tell dtSearch not to index parts of an HTML file, add HTML
comments around the text to be excluded, like this:
<!--BeginNoIndex-->
... nothing here will be searchable...
<!--EndNoIndex-->
The BeginNoIndex and EndNoIndex tags must look
exactly as they
do in this example. dtSearch will skip everything between
the two markers when it indexes web pages.
In the dtSearch Indexer, you can use filename
filters and exclude filters to limit indexing by filename or
folder name. For example, you could use a filter of
*/OnlyThisFolder/* to limit indexing to documents in a folder
named OnlyThisFolder, or you could use an exclude filter of
*/NotThisFolder/* to prevent anything in the folder named
NotThisFolder (or subfolders) from being indexed. For
more information on filename filters, see: How to exclude
folders from an index.
Additionally, the dtSearch Spider checks for
robots.txt and robots META tags in web pages, so you can use a
robots.txt file or embedded tags in web pages to specify
whether they should be indexed, and whether the Spider should
check them for links when indexing the site. For
more information on robots.txt and the robots META tag
standard, see:
http://www.robotstxt.org/wc/meta-user.html
http://www.robotstxt.org/wc/exclusion.html
For more information on creating indexes, see
"dtSearch
Quick Start".
For more information on setting up dtSearch
Web, see "dtSearch Web
Quick Start".
For more information on using the dtSearch
Spider to index web sites, see "How to index a
web site with the dtSearch Spider".
|