The dangerous consequences of a robots.txt file

May 14, 2014 in Tips and Tricks by David Zimmerman

The robots.txt file is an important way to control spiders and bots as they visit your website. But if you do it wrong, you can cause some serious problems.

One time a prospective client called my boss. They were desperate. They had just launched a new website and suddenly their traffic from Google dropped off a cliff. Now, this wasn’t anything we hadn’t seen before. Developers are focusing on making a website work and when it does, they believe their job is done. This often results in things like completely changing the site structure and URLs (even making them “more SEO” by converting them to words) but not 301ing the old URLs to the new pages. Even if they think about 301ing all the URLs, we often see Google traffic dipping a little as they re-index the new website.

When the client called, this is what we first thought was happening (it is so very common). When we looked- nope, the dev actually took care of all the redirects (and even used 301s).

What was going on?

The dev launched the website with the following code in the robots.txt file:

User-agent: * Disallow: /

(BTW: This is how you tell a spider not to crawl your website)

Now this is an important step to keep spiders out of the dev site, but when this made the transfer to the live site Google took the suggestion and kicked the site out of the index. This is the first danger of a robots.txt file:

A robots.txt file can kick a website out of Google’s index.

Speaking of suggestions- it’s important to remember that the robots.txt file is only that: a suggestion. If you have files on your website that you really don’t want Google to index, the only sure way to prevent this is by password protecting that directory using a server-side password.

It’s scary to think about this for a moment. Remember that Google isn’t the only bot out there that is crawling the web. Sometimes devs try to hide important or secret files from bots by listing them in the robots.txt file. I’ve seen devs provide direct access to installation files, administrative back ends and even secure documents all in the public robots.txt file. It’s like telling a kid: “Don’t go in the pantry- there’s candy in there.”

That’s the second danger of the robots.txt file:

The robots.txt file is a public list of things you don’t want people to see.

Consequently, if I really need to hide something on my web server, I don’t list it in my robots.txt file. So long as my server is setup correctly (preventing people from viewing directories directly) it is much harder for them to find that information if I don’t list it.

If I’ve made you a little scared about using the robots.txt file I’m sorry. I’m not trying to tell you NOT to use one on your website. There are a lot of good uses of a robots.txt file (even some SEO benefits). Just be sure you are doing it right: http://www.robotstxt.org/robotstxt.html

Have you ever felt the consequences of a robots.txt error? What are some other dangers you’ve experienced from a robots.txt file?

Leave them in the comments, below.

Tags: Google

dizzysoft

The dangerous consequences of a robots.txt file

Leave a Reply Cancel reply

Examples

Recent Comments

Topics