Run-time indexer configuration and usage ======================================== Configuration ------------- First, you should configure UdmSearch. indexer configuration is covered mostly by indexer.conf-dist file. You can find it in etc directory of UdmSearch distribution. You may take a look at other *.conf samples in doc/samples directory. To set up indexer.conf file, cd to UdmSearch installation etc directory, copy indexer.conf-dist to indexer.conf and edit it. To configure search frontends (search.cgi and/or search.php3), you should edit search.htm file in etc directory of UdmSearch installation. See doc/templates.txt for detailed description. Running indexer --------------- Just run indexer once a week (a day, a hour ...) to find the latest modifications in your web sites. You may also insert indexer into your crontab job. Built-in database support notes ------------------------------- indexer with built-in database support can't do reindexing and index whole site every time when it is started. SQL backend notes ----------------- By default, indexer being called without any command line arguments reindex only expired documents. You can change expiration period with 'Period' indexer.conf command. If you want to reindex all documents irrelevant if those are expired or not, use -a option. indexer will mark all documents as expired at startup. Retrieving documents, indexer sends 'If-Modified-Since' HTTP header for documents that are already stored in database. When indexer get next document it compute document's control sum. If control sum is the same with old control sum stored in database, it will not parse document again. indexer '-m' command line option prevents indexer from sending 'If-Modified-Since' headers and make it parse document even if control sum is the same. It is usefull for example when you have changed your Allow/Disallow rules in indexer.conf and it is required to add new pages that was disallowed earlier. If UdmSearch retrieves URL with redirect HTTP 301,302,303 status it will index URL given in "Location: " field of HTTP-header instead. Subsection control with SQL backend ----------------------------------- indexer has -t, -u, -s options to limit the action to only a part of the database. -t corresponds 'Tag' limitation, -u is a URL substring limitation (SQL LIKE wildcards). -s limits URLs with given HTTP status. All limit options in the same group are ORed and in the different groups are ANDed. UdmSearch with built-in database dos not support subsection control. How to clear database (SQL only) -------------------------------- To clear the whole database, use 'indexer -C'. You may also delete only the part of database by using -t,-u,-s subsection control options. Database Statistics with SQL backend ------------------------------------- If you run 'indexer -S', it will show database statistics, including count of total and expired documents of each status. -t, -u, -s filters are usable in this mode too. The meaning of status is: 0 - new (not indexed yet) URL If status is not 0, then it is HTTP response code: Some of HTTP codes are here: 200 - "OK" (url is successfully indexed) 301 - "Moved Permanently" (redirect to another URL) 302 - "Moved Temporarily" (redirect to another URL) 303 - "See Other" (redirect to another URL) 304 - "Not modified" (url is not modified since last indexing) 401 - "Authorization required" (use login/password for given URL) 403 - "Forbidden" (you have not access to this URL(s)) 404 - "Not found" (there was references to URLs that do not exist) 500 - "Internal Server Error" (error in cgi, etc) 503 - "Service Unavailable" (host is down, connect timeout) 504 - "Gateway Timeout" (read timeout when retrieving document) HTTP 401 means that this URL is password protected. You can use AuthBasic command in indexer.conf to set login:password for this URL(s). HTTP 404 means that you have incorrect reference in one of your document (reference to resource that does not exist). Take a look on HTTP specific documentation for futher explanation of different HTTP status codes. Link validation (SQL only) -------------------------- Being started with -I command line argument, indexer displays URL and it's referer pairs. It is very usefull to find bad links on your site. Don't forget to use 'DeleteBad no' indexer.conf command for this mode. You may use subsection control options -t,-u,-s in this mode. For example, 'indexer -I -s 404' will display all 'Not found' URLs with referers where links to those bad documents are found. Setting relevant indexer.conf commands and command line options you may use UdmSearch special for site validation purposes. Take a look at 'url-checker.conf' example for this mode in doc/samples directory of UdmSearch distribution. Parallel indexing (SQL only) ---------------------------- MySQL and PostgreSQL users may run several indexer simultaniously with the same indexer.conf file. We have successfully tested 30 simultaneous indexers with MySQL database. Indexer uses MySQL and PostgreSQL locking mechanism to avoid double indexing of the same URL by different indexer's copies. Parallel indexing in the same database is not implemented for other backends yet. You may use multi-threaded version of indexer with any SQL backend thought which does support several simultanuious connections. Multi-threaded indexer version uses own locking mechanism. It is not recommended to use the same database with different indexer.conf files! First process could add something but second could delete it, and it may never stop. In other hand, you may run several indexer processes with the different databases with ANY supported SQL backend.