What’s the purpose of a robots.txt file?
First off, lets clear the reason why everyone should use a robots.txt file. When the search engine crawlers (Googlebot, MSNbot, Yahoo! Web Crawler) arrive at your website, they first check if a robots.txt file exists on the root of your domain (eg. www.yourdomain.com/robots.txt). If it does, they use that as a guideline for crawling and indexing your website and its pages. If it doesn’t, they just try their best to crawl your website. But that’s where the problem begins. When a crawler reaches a page that doesn’t have any links, it will stop crawling there. For instance, if you link to a small pop-up page that contains nothing but text, crawlers will goto that page and stop the crawl there because it can’t find its way back. The solution to this is to block the page from being crawled in the robots.txt file.
That’s only one example of using a robots.txt, here are some other examples:
- Search result pages (eg. www.yourdomain.com/search?q=)
- Folders with irrelevant pages and data (eg. www.yourdomain.com/ajax/)
- Session IDs (eg. www.yourdomain.com/page.php?session=)
- Print pages (eg. www.yourdomain.com/article/print.php)
Writing the robots.txt file
Before you begin, look through your website and the indexes in the Search Engines (search “site:www.yourdomain.com” query) and just look for some pages that really shouldn’t be there and has NO benefit to users if they landed on it (if you use Wordpress, the /wp-login and /wp-admin pages are perfect examples). Now if you have a few pages in mind, open up Notepad and you can begin writing your robots.txt. Use the following syntax:
User-agent: *
This should be the first line in your robots.txt file. This defines which Search Engine crawler bot should obey your rules. The asterisk(*) says every bot should listen to your rules.
Disallow: / Allow:
These make up the basis of your robots.txt file. Disallow tells the bots to ignore a certain parameter while Allow tells it’s okay to crawl it. You don’t need to use Allow to define every page you want the bots to crawl, it’s only used when you want a certain parameter in a directory you disallowed crawled. I’ll show you in the example below.
Sitemap:
Another great thing about the robots.txt is the ability to tell the bots where your XML sitemap is located.
Writing Guideline:
- Make sure there’s only one rule per line
- Make sure you start with the User-agent:
- Use # to write comments you may want to leave for anyone that works on your website
- Upload the robot.txt to your ROOT domain
Example robots.txt file
# Every bot should listen to these rules
User-agent: *
Disallow: /ajax/
Allow: /ajax/*.html
Disallow: /print.php
Disallow: /search?q=*
Sitemap: http://www.mydomain.com/sitemap.xml
# Here are the rules explained in the order it was written:
# Don't crawl the /ajax/ folder and everything inside it
# But it's okay to crawl the html pages in there
# Don't crawl anything that ends in print.php
# Don't crawl anything that starts with /search?q=
You can get more advanced and picky like the below example.
User-agent: *
Disallow: /browse/tmp/
Disallow: *.asp
User-agent: msnbot/2.0b
User-agent: msnbot/1.1
Disallow: ?session=*
Disallow: ?id=*
Sitemap: http://www.mydomain.com/sitemap.xml
# This robots.txt says every bot should not crawl the first two rules, but the msnbots should follow two more additional rules
You might be wondering why and when this is ever practical. But ever since Google (and Yahoo!) allowed webmasters to block parameters in the Webmasters Tool, you no longer need to put those “?parameter=*” into the robots. Read this article on parameter handling in Google Webmasters.

