MOMspider -- Avoiding and Leafing Specific URLs

Not all URLs can be safely traversed by a spider and there are many URLs for which it makes no sense to collect maintenance information. Furthermore, many URLs (and a few entire sites) are simply not intended for non-human traversal. For these reasons, MOMspider employs a mechanism of systemwide and user Avoid and Sites files, in combination with the Robot Exclusion Protocol, to allow both the user and any information providers to maintain control over what URLs can be accessed by a compliant spider.

Before any link is tested, the destination site is looked-up in a table of recently accessed sites (the definition of "recently" is set via the configuration default CheckInterval or via the instruction directive SitesCheck). If it is not found, that site's /robots.txt document is requested and parsed for restrictions to be placed on MOMspider robots. Any such restrictions are added to the user's avoid list and the site is added to the site table, both with expiration dates indicating when the site must be checked again. Although this opens the possibility for a discrepancy to exist between the restrictions applied and the contents of a recently changed /robots.txt document, it is necessary to avoid a condition where the site checks cause a greater load on the server than would the maintenance requests alone.

One example of a /robots.txt can be seen at my site. Note that I place fewer restrictions on MOMspider than I do on other robots (a user-agent of * represents the default for all robots other than those specifically named elsewhere in the file). URLs that almost all information providers should be encouraged to "Disallow" (what MOMspider refers to as "Avoid") are those that point to dumb scripts (scripts that don't understand the difference between HEAD and GET requests.

It is important to note that avoid and sites files must always be used in pairs. The entries in the avoid file should expire at the same time as their corresponding site entry in the sites file. Otherwise, the spider will fail to act properly.

In addition to the robot exclusion standard, the avoid files can be edited by hand (when the MOMspider process is not running) such that the user can specify particular URL prefixes to be avoided or leafed. Avoided means that no request can be made on URLs containing that prefix. Leafed means that only HEAD requests can be made on those URLs, thus having the effect of preventing their traversal.

Finally, URL prefixes can be temporarily leafed (for the duration of one task) by including an Exclude directive to that effect in the task instructions.

Examples are included of some real avoid and sites files. The system avoid file provided with the release lists many URLs which are known to cause trouble with spiders. Of particular importance is to avoid gateways to other protocols (e.g. wais and finger) that would cause a great deal of unnecessary computation if they are traversed by a spider.

Before running MOMspider on your site, you should create a /robots.txt on your server so that the spider can read it on its first traversal. You can then look at the resulting avoid and sites files to see how well MOMspider parsed the file. To force MOMspider to re-read a site's /robots.txt, delete that site's entry from the sites file before running the spider.

This documentation will be expanded in a patch to be released soon.

Roy Fielding <fielding@ics.uci.edu>
Department of Information and Computer Science,
University of California, Irvine, CA 92717-3425

Last modified: Wed Aug 10 02:07:57 1994