by Beaten Rice » Mon Dec 21, 2009 5:52 am
A robots.txt file,
when present in the root directory, indicates those areas of your site which should not be accessed or indexed by automated site crawlers (also called spiders) such as those used by search engines.
While spiders are supposed to follow the instructions contained within the robots.txt file, none are compelled to do so. Major search engines usually follow their instructions. Scores of other spiders, such as those used by spammers to collect email addresses, do not.
Robots.txt File Discussions
Searching robots.txt on Google will reveal scores of results. It is one of those regularly discussed topics on many discussion forums, including our own.
Opinions vary from something short and to the point, to endless lists of disallows. There are three points to really keep in mind:
* An improperly written robots.txt file can more harm than good, and disallow the indexing of content you'd like to see in a search engine.
* The robots.txt file being itself accessible, it provides a roadmap to all of the content you might want to keep private. Never consider trying to hide sensitive material by use of the robots.txt file. Any human visitor will have ready access.
* Scores of spiders ignore robots.txt files altogether. The latter include those used by spammers, but not only. Spiders used by desktop applications may ignore it as well, in order to allow their users to experience a faster browsing experience or search within bookmarks functionality.