Character sets
==============


Supported character sets
------------------------
UdmSearch supports for the following character sets:

Cyrillic group:
	koi8-r, windows-1251, iso-8859-5, cp866, x-mac-cyrillic

Western group:
	iso-8859-1

Central Europe group:
	windows-1250, iso-8859-2

Arabic group:
	windows-1256 


Recoding
--------
indexer recodes all documents to the character set specified
in the "LocalCharset" indexer.conf command. Recoding only inside 
the character set group is available of course. This is currently 
implemented for "Cyrillic" and "Central Europe" groups. Recoding 
between character sets from different groups, for example, from 
Cyrillic koi8-r into Western iso-8859-1 will never be done by indexer.


Document charset elimination
----------------------------
indexer eliminates the character set in this order:

1) "Content-type: text/html; charset=xxx"
2) <META NAME="Content" CONTENT="text/html; charset=xxx">
3) Defaults from "Charset" indexer.conf command


Automatic charset guesser
-------------------------
There is also automatic cyrillic charset guesser which is not compiled
by default. You may activate it using "--with-charset-guesser" configure
arguments. If the automatic character set guesser was built at 
installation time, the above three methods of charset eliminating will 
be used only in the case when automatic guessing fails.