Unicode support
Last Reviewed: February 6,
2009
Article: DTS0140
Applies
to: dtSearch 7
dtSearch supports indexing and searching
Unicode text. This article will describe what is and is not
covered in this support, and will provide additional
information about how dtSearch Unicode support works with
different operating systems and document types.
Contents
Background
dtSearch Support for Unicode
File Formats
Language Issues
Alphabet Customization
Troubleshooting Encoding
Problems
See also: International
Language Features in dtSearch
Background
Unicode. Unicode is a
specification that allows text in any language to be encoded in
a consistent way. Detailed information on the Unicode
specification is available at www.unicode.org.
UTF-8. UTF-8 is a
widely-used, compact encoding of Unicode text that preserves
all information in a Unicode string. For example, Java uses
UTF-8 to provide Unicode support. In UTF-8, characters between
1 and 128 are encoded as Ansi characters 1 through 128. Other
characters are encoded using character values greater than 128.
UTF-8 encoded strings do not contain embedded NULL characters.
Additional information on UTF-8 is available at www.unicode.org.
Fonts. If characters are appearing as small
rectangles, your system font may not support display of
characters in the language you are searching. Microsoft Office
contains a useful "Arial Unicode MS" font with coverage of
nearly every character in every language included in the
Unicode standard. Use Windows "Display Options" to select this
font for use in menus and message boxes. In dtSearch Desktop,
use Options > Preferences > Display Options to select the
font used to display documents in the dtSearch viewer
window.
Keyboard and
Character Sets. To add support for additional
languages and keyboards to your Windows system, use the
Regional Options tool in Control Panel.
The Windows charmap.exe program provides
another way to enter non-English text. To access it, click
Start > Programs > Accessories > System Tools >
Character Map.
dtSearch Support for Unicode
dtSearch Unicode support means that dtSearch
can index and search documents containing Unicode-encoded data.
dtSearch Unicode support is built into the dtSearch Engine and
works on all 32-bit and 64-bit versions of Windows, including
Windows 95, Windows 98, and Windows ME (which do not themselves
have Unicode support). dtSearch can support Unicode even under
non-Unicode versions of Windows because the necessary data is
built into the dtSearch Text Retrieval Engine.
dtSearch supports 8-bit (UTF-8) and 16-bit
(UCS-16) encodings of Unicode. UCS-32, a 32-bit encoding of
Unicode that can express characters beyond the original 65,000
character limit in Unicode, is not yet supported.
File Formats
Microsoft
Office
dtSearch can automatically recognize Unicode
data in Microsoft Word, Excel and PowerPoint files.
HTML and
XML
An HTML or XML file can include Unicode data if
the HTML file uses the UTF-8 encoding. HTML files that are
stored with the UTF-8 encoding contain a META tag in the
beginning of the file that looks like this:
<meta http-equiv="Content-Type"
content="text/html; charset=utf-8">
If the file uses a different encoding, the META
tag will contain a different charset= value, like this:
<meta http-equiv="Content-Type"
content="text/html; charset=iso-8859-1">
HTML editors such as Microsoft FrontPage
generally have an option that lets you control the encoding
used to store HTML files.
dtSearch can index and search Unicode data in
UTF-8 encoded HTML files and can also recognize many other HTML
encodings.
Text
Files
Text files do not contain any encoding
information, so dtSearch has to infer the encoding. Unless
otherwise specified, dtSearch assumes that a text file uses the
Windows Ansi encoding. Using the Options > File Types dialog
box in dtSearch Desktop, you can set up rules to tell dtSearch
to treat certain text files as UTF-8 or as DOS Text files
(using the old "OEM" character set from from DOS programs).
WordPerfect
WordPerfect files use the WordPerfect Character
Set to express non-English text. dtSearch converts WordPerfect
Character Set data to Unicode for indexing, so non-English text
in WordPerfect files is supported.
PDF
dtSearch can index and search Unicode
characters in some, but not all, PDF files. Unlike other
document formats, which usually contain text in some form, PDF
files are essentially drawing instructions that provide
information necessary to print a document on a printer or to
draw it on the screen. Many PDF files contain character
encoding information in addition to the drawing instructions,
so the content of the PDF file can be converted back to text.
In these types of PDF files, you can use the Text Select tool
in Adobe Reader to select a block of text, copy the text to the
clipboard, and paste it into another program like Notepad or
Microsoft Word. If you can you use the Text Select tool in
Adobe Reader to copy and paste text from a PDF file, it means
that the file does contain meaningful character encoding
information, and so dtSearch will probably be able to index and
search the file correctly.
In some PDF files, however, only the drawing
instructions are present, and the encoding information is
either absent or random. As a result, there is no way to
convert the file back to text. In these types of PDF files,
Adobe Reader's Text Select tool will either (a) fail to work
entirely, or (b) will copy text to the clipboard that is
meaningless. dtSearch cannot index or search this type of PDF
file, because the file is really just a picture of text but
does not contain any words.
Language Issues
Chinese,
Japanese, Korean
Text in Chinese, Japanese, and Korean can be
stored in, or converted to, Unicode, so dtSearch can search for
words in these languages just as it can search for words in
other languages. However, while dtSearch can search for literal
word matches (or wildcard or fuzzy matches), there are some
limitations on the support in dtSearch for Chinese, Japanese,
and Korean text, described below.
(1)
Dictionary-Based word breaking
Some documents store text in a way that does
not separate the words with spaces. Instead, all of the text in
a document is run together and a language-specific dictionary
is needed to find word breaks. dtSearch does not have the
ability to identify word breaks in these documents, because it
does not include any language-specific dictionaries. To
make this type of text searchable, you can enable an option in
dtSearch to automatically insert of word breaks around Chinese,
Japanese, and Korean characters. With this option enabled, each
character will be treated as single word. In dtSearch
Desktop, this option setting is in Options > Preferences
> Letters and Words. In the developer API, the flag to
enable this feature is dtsoTfAutoBreakCJK in Options.TextFlags.
(2)
Variations in character forms and scripts
In these languages, the same text can be
presented in different ways depending on the context. dtSearch
will search for a word as it is provided in the search request
and does not generate additional grammatical or script
variations for words in Chinese, Japanese, and Korean.
For background information on handling text in
these languages, and resources for software developers, see the
CJK Institute site at www.cjk.org.
The dtSearch Engine has an API that can be used
to integrate with dictionary-Based language analyzers from
companies such as Basis Technologies . For more
information, see How to integrate the
dtSearch Engine with a language analyzer.
Word
Prefixes and Suffixes (Arabic)
In some languages such as Arabic, the
surrounding context for a word (my, your, the, a,
masculine/feminine, etc.) can be expressed as characters added
in front of or behind the word. For example, "the apple" or "my
apple" would not be two words but would be different prefixes
or suffixes added to "apple". To search for text in these
languages, adding a * in the front and back of the word will
pick up most of the variants, like this: *apple*.
Arabic and Hebrew PDF Files
Some PDF files store Arabic and Hebrew
text in reversed order, from left to right, instead of the
logical order in which the characters occur in the text
(right to left). In these files, this means that every
word is stored in the PDF file spelled backward, and every
line of text has the words in reversed order. dtSearch
checks for this condition when it indexes PDF files and
inverts the order of the characters within reversed Hebrew
and Arabic words, so these words will still be searchable.
However, to enable hit highlighting to work, dtSearch
does not reverse the order of words on each line, so words
within a line will be indexed in the actual order they occur
in the PDF file.
Accent-insensitive
indexing
dtSearch 6 can create indexes that are either
"accent-sensitive" or "accent-insensitive." An
accent-insensitive index converts characters, wherever
possible, to a "Base" character which is either one of the
letters A-Z or one of the digits 0-9. Accent-insensitive
indexes are generally easier to use because they ensure that a
document will be found even if the author omitted an accent, or
if the user entering a search request omitted an accent, in
typing a word. The following are examples of the character
conversions done in an accent-insensitive index:
|
Character
|
Unicode value
|
"Base" Character
|
|
Å
|
U+00C5
|
A
|
|
ç
|
U+00E7
|
c
|
|
Superscript 8
|
U+2078
|
8
|
|
Arabic-Indic digit 8
|
U+0668
|
8
|
In an accent-sensitive index, each letter
is converted to lower case where possible but otherwise
characters are indexed using their Unicode values. In an
accent-sensitive index, ç and c would be considered different
letters, and a search for one would not find the other.
dtSearch versions 5 and earlier used "alphabet"
files with a .ABC extension to provide for customization of the
handling of 8-bit characters. This made it possible to define,
for each character in the range from 33 to 255, whether it was
a letter or not and the rules for capitalization and accents.
dtSearch still uses .ABC files, but only for characters in the
range from 33 to 127. All other characters are handled
according to the definitions in the Unicode character
tables.
No accented
letters appear in the indexed word list
Indexes are created accent insensitive by
default. This means that all letters are converted to a-z
whenever possible, and a search for é is considered equivalent
to a search for e. Therefore, no accented letters will appear
in the indexed word list. To make an accent-sensitive index,
check the "accent sensitive" option in the Create Index dialog
box when you create the index.
Text files
appear incorrectly in dtSearch, and the words in the indexed
word list have missing or scrambled accented
characters
Please see this article for troubleshooting
steps: Troubleshooting encoding
detection
|