Bug in msnbot (Microsoft / Bing web crawler) exceeding the bandwidth for my website
Unfortunately my website watkissonline.co.uk went offline for a few hours on Sunday due to a bandwidth exceeded issue. I’ve already created a post about the problem I’ve had with low cost hosting issues with bandwidth exceeded. Whilst I still think it’s wrong to block a website in these circumstances I believe I’ve found the reason for the bandwidth being so high during the last two months.
The problem appears to be a result of a software bug from Microsoft which is a bit ironic for a website hosted on a Linux server, using open source software and even developed using a Linux laptop / netbook. The software is not running on my computer but rather one working for Microsoft.
What’s a bot / webbot / web robot/ web spider / web crawler?
These are all different words that refer to the same thing – if you already know what these are you can skip to the explanation of the problem.
To understand why this is working then you’ll need to understand a little about how search engines work, but don’t worry I’ll keep it simple.
Search engines work finding pages on the Internet and adding them to their web index databases. This is done by the using web robots which act like a web browser by downloading various pages. Assuming that this is working correctly then most websites would not really be impacted as the pages would be downloaded spaced over a period of time and should be fairly insignificant. Of course when I says “assuming”, “working correctly” and “should” I’m implying that sometimes this doesn’t always go according to plan.
What awstats says
First thing I did after the website was restored was to check the bandwidth value shown in the cpanel provided by the hosting provider. In this case it showed over 8Gb of downloads this month which is very high for my website, especially as we were only half way through the month.
I then went to awstats which is the website log analysis tools that I usually use. This showed bandwidth used as only 1.6Gb.
At first I thought it may have been the bandwidth reporting that was at fault, but then I saw that the main overview in the awstats report only shows viewed traffic:
* Not viewed traffic includes traffic generated by robots, worms, or replies with special HTTP status codes.
And scrolling down into the Robots/Spiders visitors section I found that msnbot had downloaded over 6.1Gb with then a little more from MSNBot-media (MSN news bot). This is almost 4 times the amount of bandwidth from people viewing the pages.
The MSNbot is the web robot used for the Bing search engine by Microsoft and this appears to be the cause of the problems with the website.
Using robots.txt to restrict search engine robots
Fortunately webmasters do have a tool to help keep these robots at bay known as robots.txt. This is a text file normally in the top level of the website which tells the search engines how they are allowed to crawl the pages. This is a convention that is used by most search engines, although some web robots may not follow them.
Usually the robots.txt file is used to exclude a search engine from part of a website, but in this case I’m using it to restrict how frequently the msnbot can access my site.
I added the following lines to my robots.txt file:
User-agent: msnbot
Crawl-delay: 120
This tells the msnbot (User-agent) to have a delay of at least 2 minutes (120secs) between every time it fetches a page to index.
Based on the number of hits they had to generate 6.1Gb of data it will take approx 180 days if they follow the new entry in the robots.txt file.
Wait and see
It’s now a case of waiting and watching to see if the amount of traffic generated by the msnbot reduces significantly. Hopefully this should be enough to stop it from hogging the bandwidth, otherwise perhaps more drastic measures may be required. I would certainly like to keep (and perhaps improve) my rankings in Bing, but I’d rather have a website that can serve my visitors than one that is unavailable due to the msnbot traffic.
Note that this is based on the information provided for analysing traffic for the website. This is not infallible and it’s possible that this could be from a different search engine incorrectly identified or even from a malicious attack, but the information looks as though this is an overactive web robot.