IIS6/IIS7以上、Nginx、Apache拦截屏蔽垃圾蜘蛛UA爬行降低负载方法IIS7.5如何限制某UserAgent 禁止访问
最近網站訪問非常慢,cpu占用非常高,服務器負載整體也非常高,打開日志發(fā)現有很多不知名的蜘蛛一直在爬行我的站點,根據經驗肯定是這里的問題,于是根據我的情況寫了規(guī)則做了屏蔽,屏蔽后負載降下來了,下面整理下iis及nginx及apache環(huán)境下如何屏蔽不知名的蜘蛛ua。海寧育嬰師
注意(請根據自己的情況調整刪除或增加ua信息,我提供的規(guī)則中包含了不常用的蜘蛛ua,幾乎用不著,若您的網站比較特殊,需要不同的蜘蛛爬取,建議仔細分析規(guī)則,將指定ua刪除即可)
IIS7.5測試ok
指定特征禁止UA訪問,返回代碼403
<rule name="NoUserAgent" stopProcessing="true"> <match url=".*" /> <conditions> <add input="{HTTP_USER_AGENT}" pattern="|特征1|特征2|特征3" /> </conditions> <action type="CustomResponse" statusCode="403" statusReason="Forbidden: Access is denied." statusDescription="You did not present a User-Agent header which is required for this site" /> </rule>例如只禁止空UA
<add input="{HTTP_USER_AGENT}" pattern="|^$|特征2|特征3" />例如禁止其他UA+空UA
<add input="{HTTP_USER_AGENT}" pattern="^$|Webdup|AcoonBot|AhrefsBot|Ezooms|EdisterBot|EC2LinkFinder|jikespider|Purebot|MJ12bot" />禁止特定蜘蛛
<rewrite> <rules> <rule name="Block Some Ip Adresses OR Bots" stopProcessing="true"> <match url="(.*)" /> <conditions logicalGrouping="MatchAny"> <add input="{HTTP_USER_AGENT}" pattern="蜘蛛名稱" ignoreCase="true" /> <!-- 來禁止特定蜘蛛 --> <add input="{HTTP_USER_AGENT}" pattern="^$" /> <!-- 禁止空 UA 訪問 --> <add input="{REMOTE_ADDR}" pattern="單獨IP或使用正則表達的IP地址" /> </conditions> <!-- 你也可以使用<action type="AbortRequest" />來直接代替下面這段代碼 --> <action type="CustomResponse" statusCode="403" statusReason="Access is forbidden." statusDescription="Access is forbidden." /> </rule> </rules> </rewrite>禁止瀏覽某文件
<rule name="Block spider"> <match url="(^robotssss.txt$)" ignoreCase="false" negate="true" /> <!-- 禁止瀏覽某文件 --> <action type="CustomResponse" statusCode="403" statusReason="Forbidden" statusDescription="Forbidden" /> </rule>1、nginx禁止垃圾蜘蛛訪問,把下列代碼放到你的nginx配置文件里面。
#禁止Scrapy等工具的抓取
2、IIS7/IIS8/IIS10及以上web服務請在網站根目錄下創(chuàng)建web.config文件,并寫入如下代碼即可;
<?xml version="1.0" encoding="UTF-8"?> <configuration> <system.webServer> <rewrite> <rules> <rule name="Block spider"> <match url="(^robots.txt$)" ignoreCase="false" negate="true" /> <conditions> <add input="{HTTP_USER_AGENT}" pattern="MegaIndex|MegaIndex.ru|BLEXBot|Qwantify|qwantify|semrush|Semrush|serpstatbot|hubspot|python|Bytespider|Go-http-client|Java|PhantomJS|SemrushBot|Scrapy|Webdup|AcoonBot|AhrefsBot|Ezooms|EdisterBot|EC2LinkFinder|jikespider|Purebot|MJ12bot|WangIDSpider|WBSearchBot|Wotbox|xbfMozilla|Yottaa|YandexBot|Jorgee|SWEBot|spbot|TurnitinBot-Agent|mail.RU|perl|Python|Wget|Xenu|ZmEu|^$" ignoreCase="true" /> </conditions> <action type="AbortRequest" /> </rule> </rules> </rewrite> </system.webServer> </configuration>3、apache請在.htaccess文件中添加如下規(guī)則即可:
<IfModule mod_rewrite.c> RewriteEngine On #Block spider RewriteCond %{HTTP_USER_AGENT} "MegaIndex|MegaIndex.ru|BLEXBot|Qwantify|qwantify|semrush|Semrush|serpstatbot|hubspot|python|Bytespider|Go-http-client|Java|PhantomJS|SemrushBot|Scrapy|Webdup|AcoonBot|AhrefsBot|Ezooms|EdisterBot|EC2LinkFinder|jikespider|Purebot|MJ12bot|WangIDSpider|WBSearchBot|Wotbox|xbfMozilla|Yottaa|YandexBot|Jorgee|SWEBot|spbot|TurnitinBot-Agent|mail.RU|perl|Python|Wget|Xenu|ZmEu|^$" [NC] RewriteRule !(^robots\.txt$) - [F] </IfModule>注:規(guī)則中默認屏蔽部分不明蜘蛛,要屏蔽其他蜘蛛按規(guī)則添加即可
附各大蜘蛛名字:
google蜘蛛:googlebot
百度蜘蛛:baiduspider
百度手機蜘蛛:baiduboxapp
yahoo蜘蛛:slurp
alexa蜘蛛:ia_archiver
msn蜘蛛:msnbot
bing蜘蛛:bingbot
altavista蜘蛛:scooter
lycos蜘蛛:lycos_spider_(t-rex)
alltheweb蜘蛛:fast-webcrawler
inktomi蜘蛛:slurp
有道蜘蛛:YodaoBot和OutfoxBot
熱土蜘蛛:Adminrtspider
搜狗蜘蛛:sogou spider
SOSO蜘蛛:sosospider
360搜蜘蛛:360spider
網絡上常見的垃圾UA列表
內容采集
FeedDemon
Java 內容采集
Jullo 內容采集
Feedly 內容采集
UniversalFeedParser 內容采集
SQL注入
BOT/0.1 (BOT for JCE)
CrawlDaddy
無用爬蟲
EasouSpider
Swiftbot
YandexBot
AhrefsBot
jikeSpider
MJ12bot
YYSpider
oBot
CC攻擊器
ApacheBench
WinHttp
TCP攻擊
HttpClient
掃描
Microsoft URL Control
ZmEu phpmyadmin
jaunty
總結
以上是生活随笔為你收集整理的IIS6/IIS7以上、Nginx、Apache拦截屏蔽垃圾蜘蛛UA爬行降低负载方法IIS7.5如何限制某UserAgent 禁止访问的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 洛谷P3356 火星探险问题(费用流)
- 下一篇: Process terminated