php如何設(shè)置禁止抓取網(wǎng)站

發(fā)布時間：2024-05-03

php禁止抓取的實現(xiàn)方法：首先通過“$_server['http_user_agent'];”方法獲取ua信息；然后將惡意“user_agent”存入數(shù)組；最后禁止空“user_agent”等主流采集程序即可。
推薦：《php教程》
我們都知道網(wǎng)絡(luò)上的爬蟲非常多，有對網(wǎng)站收錄有益的，比如百度蜘蛛（baiduspider），也有不但不遵守robots規(guī)則對服務(wù)器造成壓力，還不能為網(wǎng)站帶來流量的無用爬蟲，比如宜搜蜘蛛（yisouspider）（最新補充：宜搜蜘蛛已被uc神馬搜索收購！所以本文已去掉宜搜蜘蛛的禁封！==>相關(guān)文章)。最近張戈發(fā)現(xiàn)nginx日志中出現(xiàn)了好多宜搜等垃圾的抓取記錄，于是整理收集了網(wǎng)絡(luò)上各種禁止垃圾蜘蛛爬站的方法，在給自己網(wǎng)做設(shè)置的同時，也給各位站長提供參考。
一、apache①、通過修改 .htaccess文件二、nginx代碼
進入到nginx安裝目錄下的conf目錄，將如下代碼保存為 agent_deny.conf
cd /usr/local/nginx/conf
vim agent_deny.conf
#禁止scrapy等工具的抓取if ($http_user_agent ~* (scrapy|curl|httpclient)) {return 403;}#禁止指定ua及ua為空的訪問if ($http_user_agent ~* "feeddemon|indy library|alexa toolbar|asktbfxtv|ahrefsbot|crawldaddy|coolpadwebkit|java|feedly|universalfeedparser|apachebench|microsoft url control|swiftbot|zmeu|obot|jaunty|python-urllib|lightdeckreports bot|yyspider|digext|httpclient|mj12bot|heritrix|easouspider|ezooms|^$" ) {return 403;}#禁止非get|head|post方式的抓取if ($request_method !~ ^(get|head|post)$) {return 403;}然后，在網(wǎng)站相關(guān)配置中的 location / { 之后插入如下代碼：
include agent_deny.conf;
如張戈博客的配置：
[marsge@mars_server ~]$ cat /usr/local/nginx/conf/zhangge.conf
location / {try_files $uri $uri/ /index.php?$args;#這個位置新增1行：include agent_deny.conf;rewrite ^/sitemap_360_sp.txt$ /sitemap_360_sp.php last;rewrite ^/sitemap_baidu_sp.xml$ /sitemap_baidu_sp.php last;rewrite ^/sitemap_m.xml$ /sitemap_m.php last;保存后，執(zhí)行如下命令，平滑重啟nginx即可：/usr/local/nginx/sbin/nginx -s reload 三、php代碼
將如下方法放到貼到網(wǎng)站入口文件index.php中的第一個 <?php 之后即可：
//獲取ua信息
$ua = $_server['http_user_agent'];//將惡意user_agent存入數(shù)組$now_ua = array('feeddemon ','bot/0.1 (bot for jce)','crawldaddy ','java','feedly','universalfeedparser','apachebench','swiftbot','zmeu','indy library','obot','jaunty','yandexbot','ahrefsbot','mj12bot','winhttp','easouspider','httpclient','microsoft url control','yyspider','jaunty','python-urllib','lightdeckreports bot');//禁止空user_agent，dedecms等主流采集程序都是空user_agent，部分sql注入工具也是空user_agent
if(!$ua) {header("content-type: text/html; charset=utf-8");die('請勿采集本站，因為采集的站長木有小jj！');}else{foreach($now_ua as $value )//判斷是否是數(shù)組中存在的uaif(eregi($value,$ua)) {header("content-type: text/html; charset=utf-8");die('請勿采集本站，因為采集的站長木有小jj！');}}四、測試效果
如果是vps，那非常簡單，使用curl -a 模擬抓取即可，比如：
模擬宜搜蜘蛛抓?。?br>curl -i -a 'yisouspider' zhang.ge
模擬ua為空的抓?。?br>curl -i -a '' zhang.ge
模擬百度蜘蛛的抓?。?br>curl -i -a 'baiduspider' zhang.ge
修改網(wǎng)站目錄下的.htaccess，添加如下代碼即可（2種代碼任選）：三次抓取結(jié)果截圖如下：
可以看出，宜搜蜘蛛和ua為空的返回是403禁止訪問標識，而百度蜘蛛則成功返回200，說明生效！
補充：第二天，查看nginx日志的效果截圖：
①、ua信息為空的垃圾采集被攔截：
②、被禁止的ua被攔截：
因此，對于垃圾蜘蛛的收集，我們可以通過分析網(wǎng)站的訪問日志，找出一些沒見過的的蜘蛛（spider）名稱，經(jīng)過查詢無誤之后，可以將其加入到前文代碼的禁止列表當中，起到禁止抓取的作用。
五、附錄：ua收集
下面是網(wǎng)絡(luò)上常見的垃圾ua列表，僅供參考，同時也歡迎你來補充。
feeddemon 內(nèi)容采集bot/0.1 (bot for jce) sql注入crawldaddy sql注入java 內(nèi)容采集jullo 內(nèi)容采集feedly 內(nèi)容采集universalfeedparser 內(nèi)容采集apachebench cc攻擊器swiftbot 無用爬蟲yandexbot 無用爬蟲ahrefsbot 無用爬蟲yisouspider 無用爬蟲（已被uc神馬搜索收購，此蜘蛛可以放開?。﹎j12bot 無用爬蟲zmeu phpmyadmin 漏洞掃描winhttp 采集cc攻擊easouspider 無用爬蟲httpclient tcp攻擊microsoft url control 掃描yyspider 無用爬蟲jaunty wordpress爆破掃描器obot 無用爬蟲python-urllib 內(nèi)容采集indy library 掃描flightdeckreports bot 無用爬蟲linguee bot 無用爬蟲

上一個：混凝土結(jié)構(gòu)細部構(gòu)造防水堵漏施工方法
下一個：一氧化碳檢測儀的日常保養(yǎng)維護工作

紅色毛癬菌LAMP試劑盒反應(yīng)五要素
單相繼電保護測試儀使用過程中常見問題
做網(wǎng)站大概多少錢？做網(wǎng)站需要做什么？
海運有哪些走法,海運有哪些走法
win7系統(tǒng)cpu占用率過高怎么辦(windows7電腦cpu占用過高怎么辦)
花肥跟著花市“火”
瓦楞紙板及瓦楞紙箱質(zhì)量檢驗檢測之粘合強度
朵唯手機死機怎么關(guān)機，朵唯手機屏幕壞了怎么快捷關(guān)機
保護膜主要分類和應(yīng)用
美國ASCO防爆電磁閥產(chǎn)品技術(shù)特點

亚洲国产成人,色呦呦内射午夜,无码一级片,无码人妻少妇色欲AV一区二区

php如何設(shè)置禁止抓取網(wǎng)站