Robots.txt – Keeping web crawlers under control and off your site

On January 19, 2008, wrote:

In the process of promoting your website, you’ve undoubtedly invited some electronic guests to your website — by submitting your site to Search Engines, Directories, etc. These guests, called Spiders or Crawlers, collect information from your site much like human visitors do. Much like inviting strangers into your home, you’ll get both good guests and bad guests.

A robots.txt file specifies where you would not like these computer visitors to go. Also known as the robots exclusion standard, this is a simple text file placed in your root directory, named “robots.txt”. (www.yourdomain.com/robots.txt) You do not have to specify where this file is to any of the search engines.

To aid in web crawling, there are some folders and files we don’t want any robots to visit. For example, any directories containing dynamic files, documentation for webservers, or information we don’t want showing up on Google. Make a list of those directories and files and continue to step 2.

Now, just like real life house guests that are not polite and wander off into private rooms, some spiders crawl into folders you list in your robots.txt. These naughty robots are refered to as “mal-formed spiders,” and they collect everything from email addresses to add to Spam lists to private information. One reason why I recommend against listing individual files to your robots.txt is because anyone pulling up your robots file will know exactly where you don’t want them to look. Instead, place them inside directories that have protection against file listing and block the whole directory.

To create a robots.txt, start up a new text file in Notepad or Textedit and paste in the following:

User-agent: *
Disallow: /folder1/
Disallow: /folder2/
Disallow: /folder3/

Some common examples of folders listed are:
/cgi-bin/
/images/
/tmp/
/private/

The * after “user-agents” means this should apply to all of the robots. Once you saved this file as ‘robots.txt’, post it up to the root folder of your webserver and you’ll be good to go. More information can be found at: http://www.robotstxt.org/


Stay Connected, Subscribe to the Lakeshore Branding blog feed via RSS, email and you can follow Lakeshore Branding on Twitter!

What do you think? Share your thoughts by leaving a comment.

Leave a Reply

Your email address will not be published. Required fields are marked *