Looking through my web access logs for my Penguin Tutor (Linux Tutorials / Certification information) web site, I saw a particularly large amount of traffic on Friday. Looking through the logs I see that this is because someone has downloaded my entire web site using the WebReaper (web crawler / web spider) program. The program downloads all the pages of a web site by following all the internal links. You can then view the entire web site offline.
The problem with this is that my web site includes a number of large collections of files including all the RFC (request for comments) documents and the main Linux man pages. If anyone really wants all these collections then there are far better ways of downloading these as compressed archives.
Fortunately there are ways of blocking these web crawlers / spiders. Fortunately the software obeys the robots.txt file format. This should be a basic function of every web spider / crawler. In this case by adding certain entries to my existing robots.txt file I can block this software from downloading those parts of the web site. The following shows the relevant parts of my robots.txt file.
User-agent: * Disallow: /man/html User-agent: WebReaper Disallow: /blog/ Disallow: /cgi-bin/ Disallow: /forum/ Disallow: /man/ Disallow: /rfc/
The first entry says that the following entry should be applied to all User-agents (e.g. any web robot / crawler / spider). The second line then says that none of the web robots / crawlers / spiders should be able to download files in the /man/html directory, which holds the raw man pages (although not the formatted pages). The second section I’ve added for the WebReaper program. Which prevents it download the blog entries, cgi-bin entries (which won’t work because they are dynamically created using a session), the forum pages, the man pages and the RFCs.
This could be expanded to include other User Agents which obey the robots.txt file.
Some programs may not be well behaved and may ignore the robots.txt, and others may have an option to override it. Wget also has several genuine uses where you may want to override the robots.txt file. Obviously this should only used in this way if you take great care to only download the relevant information, or have permission from the site owner.
For example if I want to download a file from the Internet direct to my webserver I often find the file using Firefox, and then copy and paste the url of the actual file I want into a wget command on my web server (which does not normally run X Windows).