How to Stop Unknown Robots from crawling my website?

Discussion in 'Site Security & Legal Issues' started by Anonymous, Dec 29, 2009.

  1. Anonymous

    Anonymous Habitué

    1,131
    613
    +272
    Unknown robots drastically eating my website's bandwidth unnecessarily.
    I could not be able to identify them. Because they are crawling the website by without any of its details like name, IP address etc. While they crawling the only thing I know is "Unknown robot crawling". If I know the IP address, I can stop them by IP blocking.
    And the Unknown robots may not required to obey robot.txt standards or else other techniques like "no follow" attribute.
    They will just crawl websites for all the website links available on the web without user authentication.
    These robots not required to crawl websites and there is no use of them.
    How to stop them in a efficient way?
     
  2. EthanJ

    EthanJ La Belle Époque

    771
    285
    +30
    Most robots identify themselves by a custom user agent in the request headers. Which can easily be blocked with htaccess.

    There are a number of good articles on this, like this one or this one. Let me know if you have any problems, as it's a matter of identifying the offending bots/crawlers and banning them as per your need.
     
  3. Ouisri

    Ouisri Neophyte

    2
    1
    +0
    You have choice of block them and improve you site engine. Both need modification on your html-code. EthanJ guide you how to block them, I will guide you how to improve sitengine in order to run faster such as replace old some old date function like "ereg" to "preg".
     
  4. blahblahblah2

    blahblahblah2 Adherent

    287
    0
    +4
    Last edited: Jun 15, 2010
  5. EthanJ

    EthanJ La Belle Époque

    771
    285
    +30
    I think they're a tiny bit unnecessary and flat out useless in most cases. The central idea behind a bot trap is that you trust self declared user-agent strings, and no self respecting malicious bot is going to declare itself as anything other than that of a trusted user agent (like Google Bot or even a regular browser). In other words, it's not likely a shoplifter will walk into a store with a huge 'I'M HERE TO STEAL THINGS' sign is it? Which undermines the entire point of the exercise.

    So in practice all you're doing is banning lesser known generally non-malicious bots. If you want to get rid of 90%+ of scrapers (which is what the vast majority of bots do that can detected by user agent string) a static list in your htaccess file is going to do a good enough job.

    To be entirely honest though, doing any kind of security work whatsoever based on user agent string is a fools errand.
     
  6. blahblahblah2

    blahblahblah2 Adherent

    287
    0
    +4
    Kay was just wondering ... you seem alot more up on these matters than I am. So was curious about your opinion on the topic.
     
  7. inenigma

    inenigma Aspirant

    26
    0
    +0
    Been trying to block Yandex

    Hi,

    I've got this bot "spider88.yandex.ru" chewing thru my bandwidth. I've run the DNS thru whois on cqcounter.com and have picked out the following IP's:

    77.88.19.60, 77.88.21.11, 87.250.251.11, 93.158.134.11, 213.180.204.11, 213.180.204.211, 213.180.193.1, 213.180.199.34, 213.180.204.1

    I've updated my .htaccess to include the following which I thought would've blocked any access from the Yandex spider, but, the sod's still getting thru ??

    # block Yandex
    <IfModule mod_rewrite.c>
    RewriteEngine On

    RewriteCond %{REMOTE_ADDR} ^77\.88\.19\.$
    RewriteCond %{REMOTE_ADDR} ^77\.88\.21\.$
    RewriteCond %{REMOTE_ADDR} ^87\.250\.251\.$
    RewriteCond %{REMOTE_ADDR} ^87\.250\.252\.$
    RewriteCond %{REMOTE_ADDR} ^93\.158\.134\.$
    RewriteCond %{REMOTE_ADDR} ^213\.180\.193\.$
    RewriteCond %{REMOTE_ADDR} ^213\.180\.199\.$
    RewriteCond %{REMOTE_ADDR} ^213\.180\.204\.$
    RewriteRule ^(.*)$ - [F,L]
    </IfModule>


    First, are my rules in the .htaccess correct ?? Second, if they are correct (which I think they are), how do I identify what IP address it's coming thru on if whois isn't telling the correct IP ??? Also, is there any way I can find out what requests Yandex are actually submitting so I can try one of the other ways to Blacklist ??

    Any help is greatly appreciated.

    Thanks,
    David


    Uh, got a hold of my raw access logs and picked out the IP address that the bot was running from. I'll see tomorrow if it gets blocked....
     
    Last edited: Jun 20, 2010
  8. inenigma

    inenigma Aspirant

    26
    0
    +0
    Hi,

    It looks like I've pissed them off as they are now pinging my site even more now.

    These are only some of the messages from my raw access log:

    95.108.247.252 - - [21/Jun/2010:19:18:37 +1000] "GET /USM/calendar.php?do=getinfo&day=2010-4-28&c=1 HTTP/1.1" 200 4541 "-" "Yandex/1.01.001 (compatible; Win16; I)"
    95.108.247.252 - - [21/Jun/2010:19:19:20 +1000] "GET /USM/calendar.php?do=getinfo&day=2010-3-16&c=1 HTTP/1.1" 200 4540 "-" "Yandex/1.01.001 (compatible; Win16; I)"
    95.108.247.252 - - [21/Jun/2010:19:20:48 +1000] "GET /USM/calendar.php?do=getinfo&day=2013-4-8&c=1 HTTP/1.1" 200 4540 "-" "Yandex/1.01.001 (compatible; Win16; I)"
    95.108.247.252 - - [21/Jun/2010:19:21:31 +1000] "GET /USM/calendar.php?do=getinfo&day=2007-10-11&c=1 HTTP/1.1" 200 4544 "-" "Yandex/1.01.001 (compatible; Win16; I)"


    This is the part of my .htaccess file that's supposed to block Yandex spiders

    # block Yandex and redirect the request back to their own site.
    <IfModule mod_rewrite.c>
    RewriteEngine On

    RewriteCond %{REMOTE_ADDR} ^77\.88\.19\.$
    RewriteCond %{REMOTE_ADDR} ^77\.88\.21\.$
    RewriteCond %{REMOTE_ADDR} ^87\.250\.251\.$
    RewriteCond %{REMOTE_ADDR} ^87\.250\.252\.$
    RewriteCond %{REMOTE_ADDR} ^93\.158\.134\.$
    RewriteCond %{REMOTE_ADDR} ^95\.108\.247\.$
    RewriteCond %{REMOTE_ADDR} ^95\.108\.247\.252$
    RewriteCond %{REMOTE_ADDR} ^213\.180\.193\.$
    RewriteCond %{REMOTE_ADDR} ^213\.180\.199\.$
    RewriteCond %{REMOTE_ADDR} ^213\.180\.204\.$
    RewriteRule ^(.*)$ - [F,L]
    </IfModule>


    Help !!!!

    Sorry for the thread hijack... I didn't think anyone would mind seeing as it had been dead since last Dec.
     
  9. nick47274

    nick47274 Neophyte

    5
    1
    +0
    I'm not good at Server side stuff, but I think it would be:

    deny (IP)

    I'm sure if you search this in Google, you could find some good info to help you
     
  10. inenigma

    inenigma Aspirant

    26
    0
    +0
    Yeah, put that in as well, but, the sods kept getting thru ? Eventually the OPs placed a server wide block (? anyone know what this is ?) on the IP address and that finally stopped them.

    Cheers,
    David
     
  11. mongo50

    mongo50 Neophyte

    5
    0
    +0
    Unknown robot (identified by 'crawl')

    Three of the five top bots in one of my domains show up in AWSTATs as "Unknown robot (identified by 'bot/' or 'bot-')" and "Unknown robot (identified by 'crawl')" and "Unknown robot (identified by 'robot')". Interestingly, they don't show up at all in WebLog Expert's parsing of the same logs.

    I'd like to block these spiders. Is there any way to do this? Thanks for your help...
     
  12. free1proxy

    free1proxy Aspirant

    13
    1
    +0
    You need to make use of htaccess to block the unkown robots aways from the website
     
  13. Adam H

    Adam H Think before you speak.

    1,390
    547
    +1,964
    I currently use this on all my sites , basically it blocks all bad user agents , bad bots and scrappers, Not only can it save your content from being mass harvested but will also save you a little bandwidth because of less bots running around your site. Hope it helps

     
  14. mongo50

    mongo50 Neophyte

    5
    0
    +0
    OK so what I ended up doing was spending a lot of quality time with my error logs. Turned out the bots in question weren't giving a referer agent. Two were overseas and one was amazonaws.com which was sucking an immense amount of bandwidth for no apparent positive benefit for me. Think I've stopped them for the moment...
     
    Last edited: Dec 17, 2010
  15. BirdOPrey5

    BirdOPrey5 #Awesome

    2,447
    547
    +692
  16. mongo50

    mongo50 Neophyte

    5
    0
    +0
    Very interesting, BirdOPrey5. Just did a quick cruise through http://www.spambotsecurity.com/zbblock.php and the only drawback I see is that it doesn't do anything to protect non-php files. Is there anything else like this out there? Thanks again...
     
  17. BirdOPrey5

    BirdOPrey5 #Awesome

    2,447
    547
    +692
    I ended up experimenting with zbblock today... It's default setup is very strict IMO and I find it will block the occasional legitimate user so I personally ended up disabling it again for now while I try to edit the settings more to my liking. For example on the zbblock forums there is a thread on how to allow AOL Proxies which are blocked by default. Also I had blocked people following links to my site because the referring page had what zbblock considered "spam" words in the URL...

    It is a great system but the out-of-the-box settings are too strict IMO... Also it kills Tapatalk so I had to remove it for now.

    As for not protecting non-php pages- There isn't much that can be done to exploit non php pages so I don't really think it's that big of a concern- assuming you're not running asp or some other language.
     
  18. Raymond

    Raymond Enthusiast

    174
    33
    +4
    Just a heads up, some bots change their IP every visit, which cannot be blocked unless you block the actual website sending them. I never recommend IP banning people as someone could change their IP and someone else can obtain it.
     
  19. Anonymous

    Anonymous Habitué

    1,131
    613
    +272
    I've had a tremendous amount of success utilizing ZBBlock on my IPB 2.3 forum and PHP based website. In addition to stopping spam bot registrations, it's protecting my site from other malicious bots like hackbots poking around for exploits and site content scrapers. My first 24 hours using it saw over 1000 malicious site requests blocked, and they've been declining ever since.

    Like BirdOPrey5 said, the default blocks may be a bit too harsh if you have international users. I created custom signature bypasses for some lousy ISP's in Italy, Poland, Brazil, Thailand, and Columbia that foster spammers, because I noticed an occasional legitimate user was blocked.

    It's very easy to install and is compatible with virtually everything. It works 'out of the box' but it's not exactly intended to. You will want to monitor your block log files for legitimate traffic to make sure everything is going through. After a few days of monitoring my logs and making a few signature customizations, I'm confident that only malicious traffic is being blocked.

    It checks that URL queries against your website are not malicious or exploitative, looking for things like sql injection attempts, keyword spamming, directory traversal, and shuts them down before your own PHP scripts have even been initialized. It also blocks known bad ISPs, IP ranges, hostnames, spiders, and user agents. It additionally checks visitor IP addresses against the stopforumspam.com database on critical pages like registration and login. I went from 50+ bogus users a day to manageable 3 or 4 (I now manually approve new accounts, since spambots started hammering forums a few weeks ago).

    It saves on bandwidth and resources because it runs before any of your site's content can be loaded. I'm currently using it on every page, serving over 100,000 page views a day, with no slowdown in page load times. From one forum owner to another, I highly recommend giving it a try on your website. It takes a little investment in time to get it set up just right, but it's worth it. The developer is also very active on their support forums, he's been able to answer any question I've had in creating custom bypasses when needed.
     
  20. zylstra

    zylstra Aspirant

    15
    51
    +0
    Haha! I just tried going to the ZBBlock link but got an error from ZBBlock blocking me.
     
Verification:
Draft saved Draft deleted