twiceler Robot

Discussion in 'Site Security & Legal Issues' started by blackjack, May 27, 2007.

  1. blackjack

    blackjack Enthusiast

    248
    48
    +1
    Ive seen this robot around my forum quite regularly now for the past week, is it anything to worry about? a quick google search showed it as an experimental bot... what does that mean exactly?:shifty:
     
  2. Big_doug

    Big_doug Neophyte

    4
    0
    +0
    Blocking Twiceler

    Twiceler is a badbot

    Twiceler has been rampaging on one of the sites I administer. It leeched enormous amounts of bandwidth, nearly 2Gb this month until it was blocked. (It visited nearly 70,000 times!)

    Twicler does not obey normal robot txt commands and can only be blocked by denying access in the .htaccess file. (Be careful with this file, and back up the exisiting code for it)

    This site has a good tutorial for blocking robots

    http://www.clockwatchers.com/htaccess_block.html

    Twicler was using variations of an IP of address starting with 38.99

    Inserting this code in the .htaccess file blocked it from leeching.

    order allow,deny
    deny from 38.99
    allow from all

    Other IP addresses can be blocked this way


    Twiceler caused major bandwidth leeching on numerous sites one of my sons adminsters. In one case it blocked a site by using up all the bandwidth. He had to insert the above code was inserted in 40+ other sites to prevent bandwidth leeching.

    I hope this is of help.

    Doug
     
  3. blackjack

    blackjack Enthusiast

    248
    48
    +1
    oh boy!
    thanks for the heads up, i will continue to monitor the 'visits' and my bandwidth carefully and let my host know too.
    I will try and work out how to block them in the .htaccess file. a little scary for me!
    many thanks.
     
  4. Big_doug

    Big_doug Neophyte

    4
    0
    +0
    Twiceler has come back this morning after 2 days rest with an alternative IP address starting 64.1
    ------------------------------
    IP address record info for the current attack -
    http://whois.domaintools.com/64.1.215.164

    -------------------------------

    This code in the .htaccess file will block it, but also anyone using that IP address properly, however I don't think there will be that many ;-) You just need to vary the first two blocks of the IP address if it changes.

    order allow,deny
    deny from 38.99
    deny from 64.1
    allow from all

    It does mean that you have to periodically check your site visitors.
    So far I only know of Twiceler using IP addresses starting with 38 and 64

    If your server allows it, the following code in .htaccess will block it and other user agents. (It will also stop EmailSiphon & Exabot as shown) You can insert the code for other unwanted agents that are indexing your site!
    It is not IP address dependent.

    RewriteCond %{HTTP_USER_AGENT} EmailSiphon
    RewriteRule .* - [F,L]
    RewriteCond %{HTTP_USER_AGENT} Exabot
    RewriteRule .* - [F,L]
    RewriteCond %{HTTP_USER_AGENT} Twiceler
    RewriteRule .* - [F,L]
    Options FollowSymLinks
    RewriteEngine On
    RewriteBase /


    Doug
     
  5. lcx

    lcx Neophyte

    4
    1
    +0
    these are the IP's I found in my logs

    208.36.144.10
    208.36.144.7
    208.36.144.8
    38.99.13.123
    38.99.13.124
    38.99.13.125
    38.99.13.126
    64.1.215.163
    64.1.215.164
    64.1.215.165

    I blocked them all by firewall :)
     
  6. julia44

    julia44 Adherent

    285
    63
    +0
    I'm going to sound like a idiot. But how do you learn anything if you don't ask. Where do I go to check my logs of who has visited so I can start seeing if this bot has visited my sites.
     
  7. lcx

    lcx Neophyte

    4
    1
    +0
    well this depends on your server and OS.
    give us some more information and then we can help :)
     
  8. minstrel

    minstrel Tazmanian Veteran

    10,000
    727
    +105
    Julia, look to see what statistics package your host offers - common ones are AWSTATS and Webalizer.

    If you access those packages (perhaps through cPanel), you should be able to see the "robots" who have visited your site in order of frequency and their IP addresses.
     
  9. julia44

    julia44 Adherent

    285
    63
    +0
    took a few but I did see the logs in my awstats finally ;) Thanks guys!
     
  10. pingu

    pingu Stealin ur Bandwidth

    358
    145
    +1
    the .htaccess method seems to stop it. My forum had several thousand connections from this piece of crap last night and it killed my database server...

    for thos einterested its, allegidly, pioneering a new approach to Search... yeah by simulating a DOS attack..

    contact details for the muppets that designed it are:

    Cuill, Inc.
    66 Willow Place
    Menlo Park, Ca 94025
    (650) 325 1701 Office
    (650) 325 1702 Fax


    and the whois entry is

    Registrant:
    Tom Costello
    Tom Costello
    1127 Thorntree CT
    San Jose CA 95120
    US
    Email: costello@cs.stanford.edu
    Registrar Name....: REGISTER.COM INC.
    Registrar Whois...: whois.register.com
    Registrar Homepage: www.register.com
    Domain Name: cuill.com
    Created on..............: Thu Apr 07 2005
    Expires on..............: Mon Apr 07 2008
    Record last updated on..: Mon Feb 27 2006
    Administrative Contact:
    Tom Costello
    Tom Costello
    1127 Thorntree CT
    San Jose CA 95120
    US
    Phone: (408) 323-1065
    Email: costello@cs.stanford.edu
    Technical Contact:
    Register.Com
    Domain Registrar
    575 8th Avenue 11th Floor
    New York NY 10018
    US
    Phone: 1-902-7492701
    Email: domain-registrar@register.com
    DNS Servers:
    dns23.register.com
    dns24.register.com
    Visit AboutUs.org for more information about cuill.com
    http://www.aboutus.org/cuill.com"AboutUs: cuill.com
    Register your domain name at http://www.register.com


    so compaining to stanford university may be in order too...
     
  11. Hawke

    Hawke Aspirant

    28
    0
    +0
    Twiceler Robot is an experiemental bot designed by Cuill Inc. and many webmasters would see this bot skimming through their sites on a frequent basis. I think it was designed for the Cuill search engine (which is claimed to be made better than Google as it is taking on a different strategy)

    It was also designed by ex-employees at Google, (I think management positions) That's the most I know about it. Was not worried about it, as it has been at my site for two months now, but I guess I should do some more research on this.
     
  12. pingu

    pingu Stealin ur Bandwidth

    358
    145
    +1
    interestingly I had a response from the company..

    fair dooos for replying but I wish they would do their testing in a lab rather than on live sites

     
  13. mc1457

    mc1457 Neophyte

    1
    1
    +0
    That's BS...I have no issues with any other crawler. Twiceler has DOS'd my server SEVERAL times in the past two weeks. So I need to block ALL bots from my site to keep THEM from bringing it to its knees? Or ask them to please not DOS my server every morning. Bull @#$#.

    The 38.99... addresses belong to Cogent. The abuse address is abuse@cogentco.com.

    64.1... and 208.36... belong to XO Communications. Abuse address is abuse@xo.com.

    Suggest you all send an email to all of these abuse addresses detailing the abusive traffic and cc contact@cuill.com.

    It makes me furious that this ass-clown makes it OUR problem for his bots poor behavior. Maybe he should throttle the requests so he doesn't slam our servers. But that would take actual software architecture and programming skills...

    I emailed Cuill this morning. As soon as I hear back I plan to complain to their ISPs and include the response if I don't get some assurance that they acknowledge this is their problem and needs to be fixed.
     
  14. minstrel

    minstrel Tazmanian Veteran

    10,000
    727
    +105
    I get occasional visits from twiceler showing up in my logs but I'm certainly not being pounded - Yahoo Slurp is much more greedy ins spidering on my sites.

    That said, if I had your problem, I'd be complaining to any and all I could find as well. Start by taking them up on this statement, with copies to the ISPs you see sending the bot:

    Then you have some documentation that your "cease and desist" request has been ignored if it continues.
     
  15. pingu

    pingu Stealin ur Bandwidth

    358
    145
    +1

    no need to block all bots. the .htaccess file way stops it (for now)
     
  16. prepress_forums

    prepress_forums Aspirant

    21
    0
    +0
    Experimental Robot? Experimenting in what? The bot was logged going around IP blocks in .htaccess by just rotating to another IP address. They list 22 different IP addresses on their site.

    These guys are saying on another site that they will obey robots.txt after 7 days? WTF, the site says the owners are Ex Google Folks. So, they know very well what they are doing. They have no search capability for a human visitor to search their results... So, they offer the webmaster no traffic at all that is legitimate visitors. Why are they beating the heck out of our sites from 22 servers that autohack around an IP block and ignore a robots.txt?

    If you email the admin, they will stop visiting. But what kind of protocol is that?? This should be shut down.
     
  17. islandgirl

    islandgirl Neophyte

    1
    1
    +0
    First post around here. I googled cuill.com and ended up here. Thanks for the .htaccess assistance, Big_doug.

    I have noticed this bot around my various logs lately.

    Since when does a search engine harvest email addresses or try to do injections on a database driven site? My suspicions are now warranted that this is spam activity, not a legitimate startup:

    /var/log/httpd/access_log:64.1.215.166 - - [20/Feb/2008:12:38:40 -0600] "GET /directory/weblist.php?cat=1 HTTP/1.0" 200 8223 "-" "Mozilla/5.0 (Twiceler-0.9 http://www.cuill.com/twiceler/robot.html)"
    /var/log/httpd/access_log:64.1.215.166 - - [20/Feb/2008:12:48:45 -0600] "GET /directory/weblist.php?cat=2 HTTP/1.0" 200 9459 "-" "Mozilla/5.0 (Twiceler-0.9 http://www.cuill.com/twiceler/robot.html)"
    /var/log/httpd/access_log:64.1.215.166 - - [20/Feb/2008:14:45:09 -0600] "GET /directory/weblist.php?cat=3 HTTP/1.0" 200 8443 "-" "Mozilla/5.0 (Twiceler-0.9 http://www.cuill.com/twiceler/robot.html)"
    /var/log/httpd/access_log:64.1.215.166 - - [20/Feb/2008:14:54:42 -0600] "GET /directory/'mailto:someemailaddy@aol.com' HTTP/1.0" 404 8519 "-" "Mozilla/5.0 (Twiceler-0.9 http://www.cuill.com/twiceler/robot.html)"
    /var/log/httpd/access_log:64.1.215.166 - - [20/Feb/2008:21:00:35 -0600] "GET /directory/'list.php?id=34' HTTP/1.0" 404 8519 "-" "Mozilla/5.0 (Twiceler-0.9 http://www.cuill.com/twiceler/robot.html)"
     
  18. amoona

    amoona Aspirant

    13
    1
    +0
    actually, I don't read my logs very often. Hmm~~seems I need to take some time and check it from now on.
     
  19. Big_doug

    Big_doug Neophyte

    4
    0
    +0
    Twiceler update

    For all of you, including me, who have suffered from the 'rampaging' twiceler robot, an interesting bit of news! ;-)

    Big_doug

    Cuil the search engine who are responsible for the Twiceler Robot is available to use. I typed Twiceler into the search box on Cuil which gave 34,127,017 results for twiceler. After a few pages of results, this message was displayed!

    "No results because of high load...

    Due to excessive load, our servers didn't return results. Please try your search again."

    Webmaster Info for Cuil -

    http://www.cuil.com/info/webmaster_info/

    This gives the following information -

    Webmaster Info

    Cuil is the biggest search engine on the planet. In our quest to let users search as much of the Internet as possible, Cuil has indexed more than 120 billion pages so far.

    If you would like Cuil to crawl your site and have it included in our index, please let us know.

    Twiceler is the name of our robot Web crawler. The user-agent is “twiceler.” We understand that many small sites are bandwidth-limited, so we support the robots.txt Crawl-delay directive. You can read about robots.txt at robotstxt.org and there is a simple generator of the file at mcanerin.com.

    If you have modified your robots.txt file for Twiceler, it may take several days for us to re-read the file. If you need something blocked right away, please let us know.

    Got a Twiceler question? If you have questions or concerns about Twiceler you can contact Jim. Jim’s the guy who keeps track of Twiceler, when he’s not busy with his horses.

    If you would prefer that we not crawl your site at all we are happy to oblige. Just drop Jim a note to that effect and he will place your site or IP address on our do-not-crawl list. Be sure to be explicit about the site to block as email address domains frequently differ from the site in question.

    Occasionally, we have seen other Web crawling robots masquerading as Twiceler. You can be sure it’s Cuil crawling your site if the robot has one of the following IP addresses:

    38.99.13.121 38.99.44.101 64.1.215.166 208.36.144.6
    38.99.13.122 38.99.44.102 64.1.215.162 208.36.144.7
    38.99.13.123 38.99.44.103 64.1.215.163 208.36.144.8
    38.99.13.124 38.99.44.104 64.1.215.164 208.36.144.9
    38.99.13.125 38.99.44.105 64.1.215.165 208.36.144.10
    38.99.13.126 38.99.44.106

    To all those who have contacted us to let us know that they are happy to have their site included in a Web index for the first time, thank you for being a part of the biggest search engine on the Web—Cuil!
     
  20. minstrel

    minstrel Tazmanian Veteran

    10,000
    727
    +105
    Isn't Cuil associated with Google?
     
Verification:
Draft saved Draft deleted