Automatic Recognition of Dates, Email Addresses, and Credit
Card Numbers
dtSearch 7.40 includes an option to automatically recognize
dates, email addresses, and credit card numbers in text during
indexing.
Dates
Date recognition looks for anything that appears to be a
date, using English-language months (including common
abbreviations) and numerical formats. Examples of date
formats that are recognized include:
January 15, 2006
15 Jan 06
2006/01/15
1/15/06
1-15-06
The fifteenth of January, two thousand six
To search for a date, put "date()" around the date
expression or range. For example, to find any of the
expressions above near the word "apple", search for:
date(jan 15 2006) w/10 apple
To search for a range of dates near the word "apple", search
for:
date(jan 10 2006 to jan 20 2006) w/10 apple
A field search for a date expression would be expressed like
a field search for a word:
DateField contains date(jan 10 2006 to jan 20 2006)
Unterminated ranges are not supported, so to search for any
date after or before a particular date, enter a bounded range
with a maximal or minimal value for the bounds. The
maximum value for a year is 2900, and the minimum value is
1000. Example:
DateField contains date(jan 10 2006 to jan 1 2900)
Email Addresses
Email address recognition looks for text that follows the
syntax for a valid email address (example:
sales@dtsearch.com). This makes it possible to search for
a specific email address regardless of the alphabet settings
for the @ and . characters, as well as any other punctuation
that may be present in an email address. Also, this makes
it possible to use the word listing functions in dtSearch to
enumerate all email addresses in a document collection.
To search for an email address, put "mail()" around the
address. The * and ? wildcard expressions are supported
inside the () marks. Examples:
mail(sales@dtsearch.com)
mail(s*@dtsearch.com)
Credit Card Numbers
Credit card number recognition looks for any sequence of
numbers, that appears to satisfy the criteria for a valid
credit card number issued by one of the major credit card
issuers. Credit card numbers are recognized regardless of
the pattern of spaces or punctuation embedded in the
number. Examples:
1234-5678-1234-5678
1234567812345678
1234 5678 1234 5678
Numerical tests used by the credit card issuers for card
validity are used to exclude sequences of numbers that are not
credit card numbers. However, these tests are not perfect
and so the credit card number recognition feature may pick up
some numbers that are not really credit card numbers.
To search for a credit card number, put "creditcard()"
around the number. Example:
creditcard(1234*)
Enabling automatic recognition of dates, email addresses,
and credit card numbers
In dtSearch Desktop, click Options > Preferences >
Indexing Options, and check the box to "Automatically recognize
dates in text."
In the dtSearch Engine API, set the flag
dtsoTfRecognizeDates in Options.TextFlags.
Currently there is no option to separately control whether
dates, email addresses, and credit card numbers are
recognized.
Word lists
To list all dates, credit card numbers or email addresses in
an index, you can use the word listing functions in dtSearch
Desktop (Index > List Index Contents...). In the
dtSearch Engine API, you can use ListIndexJob (.NET) or
DListIndexJob (C++).
The same syntax used in search requests works in the listing
functions, so if you generate a list using "creditcard(*)", you
will get a list of all credit card numbers in the index.
Effect on performance
Indexing will be slower with the recognition feature
enabled.
Searching for dates, email addresses, and credit card
numbers can be substantially faster because you can search for
a single unique expression instead of having to search for many
different variations. For example, a single search
for:
creditcard(1234123412341234)
will find that credit card number regardless of the presence
of spaces or punctuation between the numbers. To
cover just the most common variations on credit card number
formats would require a much more complex search request that
would take more processing time. Similarly, it will be
much faster to search for:
date(January 15, 2005)
than to search for the many ways this date could be
expressed in text.
What about phone numbers, social security numbers,
etc.?
Currently these are not recognized, although we may add this
in a future version. There is a trade-off between
completeness and false positives that gets worse as more types
of numerical data are recognized. Credit card numbers can
be verified to some extent, while telephone numbers and social
security numbers cannot, so adding support for these types of
numbers will generate many more false positives.
|