[opensuse] What is the rapid-fire googlebot traffic going to my webserver? Legit or DSA?
Listmates:
On a couple of occasions today, I have had machine gun speed googlebot
crawling all over my site. tcpdump shows the culprit as:
17:21:51.123736 IP nirvana.3111skyline.com.http >
crawl-66-249-71-154.googlebot.com.46612: . ack 260 win 54
On Monday 16 February 2009 00:35:06 David C. Rankin wrote:
The IP 66.249.71.154 is google alright, but who knows if it is spoofed. I killed it with everything I could think of:
google crawls the net to index web servers so people can do web searches. If you want your site to show up when people do web searches, I suggest you stop blocking it It's easy to spoof IP addresses in DoS attacks, because then it's one way traffic, it doesn't have to get a response. But it's much harder to do in a TCP connection, when the packets have to be responded to. If you want to be sure, you can verify the MAC address of the packets. If it's the same as that of your default gateway, it comes from outside your provider's "local" LAN, and is almost guaranteed to be genuine. But considering that it's something google does in order to maintain their database, I see no reason to suspect foul play. Be paranoid when they start trying to execute CGI scripts or similar on your site. Not for simple reads. Anders -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
The IP 66.249.71.154 is google alright, but who knows if it is spoofed. I killed it with everything I could think of:
google crawls the net to index web servers so people can do web searches. If you want your site to show up when people do web searches, I suggest you stop blocking it
also, IIRC, Google are pretty good at obeying robots.txt, so if you don't want them crawling, you just need to specify so in the robots.txt -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Anders Johansson wrote:
On Monday 16 February 2009 00:35:06 David C. Rankin wrote:
The IP 66.249.71.154 is google alright, but who knows if it is spoofed. I killed it with everything I could think of:
google crawls the net to index web servers so people can do web searches. If you want your site to show up when people do web searches, I suggest you stop blocking it
It's easy to spoof IP addresses in DoS attacks, because then it's one way traffic, it doesn't have to get a response. But it's much harder to do in a TCP connection, when the packets have to be responded to.
If you want to be sure, you can verify the MAC address of the packets. If it's the same as that of your default gateway, it comes from outside your provider's "local" LAN, and is almost guaranteed to be genuine.
But considering that it's something google does in order to maintain their database, I see no reason to suspect foul play.
Be paranoid when they start trying to execute CGI scripts or similar on your site. Not for simple reads.
Anders
That's what has me worried about this one. Even with my robot.txt configured to block crawling, not 5 minutes ago, I got another flood of googlebot requests that literally brought my internet connection to it's knees. I don't mind google indexing, but I would kind of like to be able to use my internet connection while they are doing it. Case-in-point, just five minutes ago while I was ssh'ed into the office, all of a sudden I noticed my cursor lagging far behind my typing (and I *don't* type fast at all). I check rbpllc.com - no traffic there, I check my home site 3111skyline.com - WTF?? The xterm I was running tcpdump in was just screaming lines of text. I just stopped the web server and -- thought for a minute. I can't imagine a legitimate bot causing enough traffic to drown your site out. I admit, I draw blanks trying to interpret whether what I'm seeing is legitimate or not. Whatever this bot is, it caused enough traffic to completely filled the scroll-buffer of my xterm in less than 1 second?? I have made the scroll buffer lines that I was able to catch available in hopes somebody can take a peek and let me know if it looks normal: http://www.3111skyline.com/download/webdev/dos/access_log-20090215 Also, I raised the firewall and added a custom rule to take all traffic from 66.249.0.0/24 and redirected to port 9345 (unassigned). Hopefully this works like an iptables drop. If not or if it will cause problems, somebody please let me know. Any knowledge you can give to help me sort through this will be appreciated. Thanks. -- David C. Rankin, J.D.,P.E. Rankin Law Firm, PLLC 510 Ochiltree Street Nacogdoches, Texas 75961 Telephone: (936) 715-9333 Facsimile: (936) 715-9339 www.rankinlawfirm.com -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
David C. Rankin wrote:
I admit, I draw blanks trying to interpret whether what I'm seeing is legitimate or not. Whatever this bot is, it caused enough traffic to completely filled the scroll-buffer of my xterm in less than 1 second?? I have made the scroll buffer lines that I was able to catch available in hopes somebody can take a peek and let me know if it looks normal:
http://www.3111skyline.com/download/webdev/dos/access_log-20090215
Also, I raised the firewall and added a custom rule to take all traffic from 66.249.0.0/24 and redirected to port 9345 (unassigned). Hopefully this works like an iptables drop. If not or if it will cause problems, somebody please let me know.
Any knowledge you can give to help me sort through this will be appreciated. Thanks.
Also, since I have always hated the firewall, I don't know how well it is ^\(mis\)*configured. So if you have problems getting to the site, let me know and I'll try again;-) -- David C. Rankin, J.D.,P.E. Rankin Law Firm, PLLC 510 Ochiltree Street Nacogdoches, Texas 75961 Telephone: (936) 715-9333 Facsimile: (936) 715-9339 www.rankinlawfirm.com -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Sun, Feb 15, 2009 at 3:35 PM, David C. Rankin
Listmates:
On a couple of occasions today, I have had machine gun speed googlebot crawling all over my site. tcpdump shows the
Well as Anders says, (and I'm sure you know) Google spiders web sites. However they honor robots.txt and they spider in such a way as to cause very little load. There is no reason to undertake heroic measures to block them, because they can spider you from any on of several hundred thousand addresses. They do honor robots.txt but if they follow a link into your subdirectory they will not always check the root of the web, so a robots.txt in each directory keeps them at bay. I doubt its spoofed. You really can't spoof an address off your local subnet for anything but denial of service attacks, unless you have control of upstream routers. But its not their practice to swamp a server. -- ----------JSA--------- Someone stole my tag line, so now I have this rental. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Sun, 15 Feb 2009, David C. Rankin wrote:-
I get 2-3 packets a minute here which is no big deal. Earlier, the screen would just scroll continually and internet speed was taking a *big* hit.
Looks like the spidering bot has finally got around to going through your site. My site had the same issue when Google, and the other search engines, initially spider'd it. That was a while ago and now I only see Google, MSN and Yahoo recheck links every few days. Also, after the initial spidering, they don't go through all the links at once. Normally I see them checking about 5-10 links per day, at least for the normal pages. My gallery pages seem to be hit a little more frequently, but not enough to be a big issue.
So what is the consensus of the brain trust as to what I'm looking at and what this is? Legit or something I should be concerned with? Any other defensive measures I should take?
There is a "crawl-delay" option that you can add to the robots.txt file, however Google doesn't support it. An alternative is to create an account with Google and then use their "Webmaster Tools" to slow down the crawl rate. Regards, David Bolt -- Team Acorn: http://www.distributed.net/ OGR-NG @ ~100Mnodes RC5-72 @ ~1Mkeys/s | openSUSE 10.3 32b | openSUSE 11.0 32b | openSUSE 10.2 64b | openSUSE 10.3 64b | openSUSE 11.0 64b | openSUSE 11.1 64b TOS 4.02 | openSUSE 10.3 PPC | RISC OS 3.6 | RISC OS 3.11 -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
David Bolt wrote:
On Sun, 15 Feb 2009, David C. Rankin wrote:-
I get 2-3 packets a minute here which is no big deal. Earlier, the screen would just scroll continually and internet speed was taking a *big* hit.
Looks like the spidering bot has finally got around to going through your site. My site had the same issue when Google, and the other search engines, initially spider'd it. That was a while ago and now I only see Google, MSN and Yahoo recheck links every few days. Also, after the initial spidering, they don't go through all the links at once. Normally I see them checking about 5-10 links per day, at least for the normal pages. My gallery pages seem to be hit a little more frequently, but not enough to be a big issue.
So what is the consensus of the brain trust as to what I'm looking at and what this is? Legit or something I should be concerned with? Any other defensive measures I should take?
There is a "crawl-delay" option that you can add to the robots.txt file, however Google doesn't support it. An alternative is to create an account with Google and then use their "Webmaster Tools" to slow down the crawl rate.
Regards, David Bolt
Alright, on the words of all the people that are a whole lot smarter that I am on this stuff, I reset the config: rcSuSEfirewall2 stop vi /etc/hosts.deny; (:3,6d) vi /etc/apache2/default-server.conf (:.d) rcapache2 reload mv /srv/www/robots.txt /tmp The bot is back at it.... -- David C. Rankin, J.D.,P.E. Rankin Law Firm, PLLC 510 Ochiltree Street Nacogdoches, Texas 75961 Telephone: (936) 715-9333 Facsimile: (936) 715-9339 www.rankinlawfirm.com -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
participants (5)
-
Anders Johansson
-
David Bolt
-
David C. Rankin
-
John Andersen
-
Philip Dowie