Stop AboutUs.org from scraping your blog

2006 August 27
by Paul O'Flaherty

Tom’s posted some good tips for preventing AboutUs.org from scraping your site:

Block their robot
Add the following lines to your robots.txt (in the “root” folder of your website)
User-agent: AboutUsBot
Disallow: /

Block their IP range
In your .htaccess file (if you’re on Apache) add the following lines:

deny from 66.249.16.

Block the bot’s user agent
If you do user agent blocking, block the bot’s user agent:
(currently Mozilla/5.0 (compatible; AboutUsBot/0.9; +http://www.aboutus.org/AboutUsBot))

Block the DomainTools.com IP Range
AboutUs.Org uses Domaintools services to generate thumbnail images of site content, so block their IP range too:

deny from 66.249.4.

Update: Tom appears to have deleted the post from his site. (It’s back) Thankfully AboutUs.org has a page stating how to block their bot from your site and it’s a bit simpler than what’s posted above.

To prevent the AboutUsBot from collecting your site content in the future, please include the following lines in your /robots.txt file.

User-agent: AboutUsBot
Disallow: /

The AboutUsBot will include the following in it’s User-Agent string:

Mozilla/5.0 (compatible; AboutUsBot/0.9; +http://www.aboutus.org/AboutUsBot)

Please note that the current AboutUsBot behavior is to visit each site only once to initialize the AboutUs.org page.

Also, they’ve posted a page for people to voice their concerns! So, go voice them…

Reddit
2 Responses leave one →
  1. August 28, 2006

    Well, the morning brings a change to the “concerns” page, a promise of manual deletion (once you prove you own the domain) – shame this “proof” isn’t required to create and edit a record really…

    He still hasn’t explained the purpose of the site, or what “valuable service” it’s providing…

Trackbacks and Pingbacks

  1. AboutUs.Org starting to cave in at Tom’s View of the World

Leave a Reply:

Note: You can use basic XHTML in your comments. Your email address will never be published.

Never miss an update. Subscribe by RSS