Character sets ============== Supported character sets ------------------------ UdmSearch supports for the following character sets: Cyrillic group: koi8-r, windows-1251, iso-8859-5, cp866, x-mac-cyrillic Western group: iso-8859-1 Central Europe group: windows-1250, iso-8859-2 Arabic group: windows-1256 Recoding -------- indexer recodes all documents to the character set specified in the "LocalCharset" indexer.conf command. Recoding only inside the character set group is available of course. This is currently implemented for "Cyrillic" and "Central Europe" groups. Recoding between character sets from different groups, for example, from Cyrillic koi8-r into Western iso-8859-1 will never be done by indexer. Document charset elimination ---------------------------- indexer eliminates the character set in this order: 1) "Content-type: text/html; charset=xxx" 2) 3) Defaults from "Charset" indexer.conf command Automatic charset guesser ------------------------- There is also automatic cyrillic charset guesser which is not compiled by default. You may activate it using "--with-charset-guesser" configure arguments. If the automatic character set guesser was built at installation time, the above three methods of charset eliminating will be used only in the case when automatic guessing fails.