久久精品熟女亚洲AV,最刺激的仑乱视频播放

抓取策略：那些網(wǎng)頁(yè)是我們需要去下載的，那些是無(wú)需下載的，那些網(wǎng)頁(yè)是我們優(yōu)先下載的，定義清楚之后，能節(jié)省很多無(wú)謂的爬取。更新策略：監(jiān)控列表頁(yè)來(lái)發(fā)現(xiàn)新的頁(yè)面；定期check 頁(yè)面是否過(guò)期等等。抽取策略：我們應(yīng)該如何的從網(wǎng)頁(yè)中抽取我們想要的內(nèi)容，不僅僅包含最終的目標(biāo)內(nèi)容，還有下一步要抓取的url.抓取頻率：我們需要合理的去下載一個(gè)網(wǎng)站，卻又不失效率。

讓我對(duì)“如何和爬蟲(chóng)對(duì)話 ”這個(gè)課題有了一些思考，下面歸納的主要用于迎合上面提到的爬蟲(chóng)“抓取策略”。

1、通過(guò) robots.txt 和爬蟲(chóng)對(duì)話：搜索引擎發(fā)現(xiàn)一個(gè)新站，原則上第一個(gè)訪問(wèn)的就是 robots.txt 文件，可以通過(guò) allow/disallow 語(yǔ)法告訴搜索引擎那些文件目錄可以被抓取和不可以被抓取。

關(guān)于 robots.txt 的詳細(xì)介紹：about /robots.txt另外需要注意的是：allow/disallow 語(yǔ)法的順序是有區(qū)別的

2、通過(guò) meta tag 和爬蟲(chóng)對(duì)話：比如有的時(shí)候我們希望網(wǎng)站列表頁(yè)不被搜索引擎收錄但是又希望搜索引擎抓取，那么可以通過(guò) ＜meta name=“robots” content=“noindex，follow”＞告訴爬蟲(chóng)，其他常見(jiàn)的還有 noarchive，nosnippet，noodp 等。

3、通過(guò) rel=“nofollow” 和爬蟲(chóng)對(duì)話：關(guān)于 rel=“nofollow” 最近國(guó)平寫了一篇文章《如何用好 nofollow》很值得一讀，相信讀完之后你會(huì)有很大的啟發(fā)。

4、通過(guò) rel=“canonical” 和爬蟲(chóng)對(duì)話：關(guān)于 rel=“canonical” 谷歌網(wǎng)站站長(zhǎng)工具幫助有很詳細(xì)的介紹：深入了解 rel=“canonical”

5、通過(guò)網(wǎng)站地圖和爬蟲(chóng)對(duì)話：比較常見(jiàn)的是 xml 格式 sitemap 和 html 格式 sitemap，xml 格式 sitemap 可以分割處理或者壓縮壓縮，另外，sitemap 的地址可以寫入到 robots.txt 文件。

6、通過(guò)網(wǎng)站管理員工具和搜索引擎對(duì)話：我們接觸最多的就是谷歌網(wǎng)站管理員工具，可以設(shè)定 googlebot 抓取的頻率，屏蔽不想被抓取的鏈接，控制 sitelinks 等，另外，Bing 和 Yahoo 也都有管理員工具，百度有一個(gè)百度站長(zhǎng)平臺(tái)，內(nèi)測(cè)一年多了仍舊在內(nèi)測(cè)，沒(méi)有邀請(qǐng)碼無(wú)法注冊(cè)。

另外，這里面還衍生出一個(gè)概念，就是我一直比較重視的網(wǎng)站收錄比，所謂網(wǎng)站收錄比=網(wǎng)站在搜索引擎的收錄數(shù)/網(wǎng)站真實(shí)數(shù)據(jù)量，網(wǎng)站收錄比越高，說(shuō)明搜索引擎對(duì)網(wǎng)站的抓取越順利。

暫時(shí)就想到這些，目的在于嘗試性的探討如何更有效的提高網(wǎng)站在搜索引擎的收錄量。

權(quán)當(dāng)拋磚引玉，歡迎各位補(bǔ)充！

備注：

網(wǎng)絡(luò)爬蟲(chóng)（web crawler）又稱為網(wǎng)絡(luò)蜘蛛（web spider）是一段計(jì)算機(jī)程序，它從互聯(lián)網(wǎng)上按照一定的邏輯和算法抓取和下載互聯(lián)網(wǎng)的網(wǎng)頁(yè)，是搜索引擎的一個(gè)重要組成部分。

本文作者：Bruce，原文地址。

來(lái)源：月光博客

Public @ 2017-01-01 16:22:28

百度Spider新增渲染抓取UA公告

威海Spider 威海Baiduspider
1002

為了給搜索用戶更好的體驗(yàn)、對(duì)站點(diǎn)實(shí)現(xiàn)更好地索引和呈現(xiàn)，百度搜索需要訪問(wèn)網(wǎng)站的CSS、Javascript和圖片信息，以便更精準(zhǔn)地理解頁(yè)面內(nèi)容，實(shí)現(xiàn)搜索結(jié)果最優(yōu)排名，百度搜索會(huì)全面啟用最新UA來(lái)訪問(wèn)站點(diǎn)的上述資源。從3月24日（2017）開(kāi)始，百度搜索抽取了部分優(yōu)質(zhì)站點(diǎn)進(jìn)行抓取內(nèi)測(cè)，可能會(huì)對(duì)站點(diǎn)服務(wù)器造成一定壓力影響，請(qǐng)盡量不要對(duì)UA進(jìn)行封禁，以免造成不可逆轉(zhuǎn)的損失。最新UA如下：PC：Mozill

Public @ 2020-05-17 15:56:38

如何正確識(shí)別Baiduspider移動(dòng)ua

威海Spider 威海Baiduspider
1442

百度站長(zhǎng)平臺(tái)發(fā)布公告宣布新版Baiduspider移動(dòng)ua上線，同時(shí)公布了PC版Baiduspider UA，那么該如何正確識(shí)別移動(dòng)UA呢？我們百度站長(zhǎng)平臺(tái)技術(shù)專家孫權(quán)老師給出了答案：新版移動(dòng)UA:Mozilla/5.0 (Linux;u;Android 4.2.2;zh-cn;) AppleWebKit/534.46 (KHTML,like Gecko) Version/5.1 Mobile S

Public @ 2010-04-10 15:38:45

哪些網(wǎng)站垃圾蜘蛛可以屏蔽？屏蔽無(wú)流量搜索引擎抓取

威海Spider 威海Spider
966

網(wǎng)站做的越大，蜘蛛越多?？墒怯袝r(shí)候會(huì)發(fā)現(xiàn)：網(wǎng)站被各種搜索引擎的蜘蛛抓的服務(wù)器都快崩潰了，嚴(yán)重的占用了服務(wù)器的資源。這個(gè)時(shí)候要怎么辦呢？百度蜘蛛：Baiduspider谷歌蜘蛛：Googlebot360蜘蛛：360SpiderSOSO蜘蛛：Sosospider神馬蜘蛛：YisouSpider微軟必應(yīng)： BingBot在國(guó)內(nèi)，我們不要把這幾個(gè)蜘蛛使用robots.txt屏蔽就可以了，至于其他的，都可以

Public @ 2020-10-09 16:22:29

什么是搜索引擎蜘蛛

威海Spider 威海Spider
527

搜索引擎蜘蛛可以簡(jiǎn)單的理解為頁(yè)面信息采集工具，不需要人工去采集，它會(huì)自動(dòng)根據(jù)URL鏈接一個(gè)一個(gè)爬行過(guò)去，然后再抓取頁(yè)面的信息，然后再存到服務(wù)器的列隊(duì)中，為用戶提供目標(biāo)主題所需要的數(shù)據(jù)資源，搜索引擎蜘蛛不是所有的頁(yè)面都會(huì)抓取的，主要有三個(gè)原因：一是技術(shù)上的原因。二是服務(wù)器存儲(chǔ)方面的原因。三是提供用戶搜索數(shù)據(jù)量太大，會(huì)影響效率。所以說(shuō)，搜索引擎蜘蛛一般只是抓取那些重要的網(wǎng)頁(yè)，而在抓取的時(shí)候評(píng)價(jià)重要性主

Public @ 2017-10-04 16:22:29

更多您感興趣的搜索

基本文件流程錯(cuò)誤 SQL 調(diào)試

/www/wwwroot/briline.net/public/index.php ( 0.79 KB )
/www/wwwroot/briline.net/public/public.php ( 1.08 KB )
/www/wwwroot/briline.net/thinkphp/start.php ( 0.73 KB )
/www/wwwroot/briline.net/thinkphp/base.php ( 2.66 KB )
/www/wwwroot/briline.net/thinkphp/library/think/Loader.php ( 19.47 KB )
/www/wwwroot/briline.net/vendor/composer/autoload_namespaces.php ( 0.21 KB )
/www/wwwroot/briline.net/vendor/composer/autoload_psr4.php ( 0.84 KB )
/www/wwwroot/briline.net/vendor/composer/autoload_classmap.php ( 0.14 KB )
/www/wwwroot/briline.net/vendor/composer/autoload_files.php ( 0.42 KB )
/www/wwwroot/briline.net/vendor/qiniu/php-sdk/src/Qiniu/functions.php ( 7.10 KB )
/www/wwwroot/briline.net/vendor/qiniu/php-sdk/src/Qiniu/Config.php ( 0.70 KB )
/www/wwwroot/briline.net/vendor/topthink/think-captcha/src/helper.php ( 1.59 KB )
/www/wwwroot/briline.net/thinkphp/library/think/Route.php ( 59.82 KB )
/www/wwwroot/briline.net/thinkphp/library/think/Config.php ( 6.03 KB )
/www/wwwroot/briline.net/thinkphp/library/think/Validate.php ( 40.27 KB )
/www/wwwroot/briline.net/vendor/topthink/think-queue/src/config.php ( 0.77 KB )
/www/wwwroot/briline.net/thinkphp/library/think/Console.php ( 21.22 KB )
/www/wwwroot/briline.net/thinkphp/library/think/Error.php ( 3.59 KB )
/www/wwwroot/briline.net/thinkphp/convention.php ( 10.31 KB )
/www/wwwroot/briline.net/thinkphp/library/think/App.php ( 21.04 KB )
/www/wwwroot/briline.net/thinkphp/library/think/Request.php ( 50.94 KB )
/www/wwwroot/briline.net/app/config.php ( 11.25 KB )
/www/wwwroot/briline.net/app/database.php ( 1.41 KB )
/www/wwwroot/briline.net/thinkphp/library/think/Hook.php ( 4.76 KB )
/www/wwwroot/briline.net/app/tags.php ( 1.16 KB )
/www/wwwroot/briline.net/app/common/behavior/InitBase.php ( 8.17 KB )
/www/wwwroot/briline.net/app/common.php ( 23.29 KB )
/www/wwwroot/briline.net/thinkphp/library/think/Env.php ( 1.25 KB )
/www/wwwroot/briline.net/thinkphp/helper.php ( 17.86 KB )
/www/wwwroot/briline.net/app/function.php ( 0.78 KB )
/www/wwwroot/briline.net/app/extend.php ( 13.29 KB )
/www/wwwroot/briline.net/thinkphp/library/think/Debug.php ( 7.06 KB )
/www/wwwroot/briline.net/app/common/model/Config.php ( 0.78 KB )
/www/wwwroot/briline.net/app/common/model/ModelBase.php ( 12.18 KB )
/www/wwwroot/briline.net/thinkphp/library/think/Model.php ( 66.83 KB )
/www/wwwroot/briline.net/thinkphp/library/think/Db.php ( 6.54 KB )
/www/wwwroot/briline.net/thinkphp/library/think/Log.php ( 5.84 KB )
/www/wwwroot/briline.net/thinkphp/library/think/db/connector/Mysql.php ( 3.94 KB )
/www/wwwroot/briline.net/thinkphp/library/think/db/Connection.php ( 29.97 KB )
/www/wwwroot/briline.net/thinkphp/library/think/db/Query.php ( 86.80 KB )
/www/wwwroot/briline.net/thinkphp/library/think/db/builder/Mysql.php ( 2.16 KB )
/www/wwwroot/briline.net/thinkphp/library/think/db/Builder.php ( 30.47 KB )
/www/wwwroot/briline.net/thinkphp/library/think/Cache.php ( 6.17 KB )
/www/wwwroot/briline.net/thinkphp/library/think/cache/driver/File.php ( 7.46 KB )
/www/wwwroot/briline.net/thinkphp/library/think/cache/Driver.php ( 5.52 KB )
/www/wwwroot/briline.net/app/common/behavior/InitHook.php ( 1.25 KB )
/www/wwwroot/briline.net/app/common/model/Hook.php ( 0.77 KB )
/www/wwwroot/briline.net/thinkphp/library/think/Lang.php ( 6.95 KB )
/www/wwwroot/briline.net/thinkphp/lang/zh-cn.php ( 3.85 KB )
/www/wwwroot/briline.net/app/route.php ( 0.91 KB )
/www/wwwroot/briline.net/app/index/config.php ( 0.96 KB )
/www/wwwroot/briline.net/app/index/common.php ( 0.68 KB )
/www/wwwroot/briline.net/app/index/controller/Wiki.php ( 2.44 KB )
/www/wwwroot/briline.net/app/index/controller/IndexBase.php ( 1.10 KB )
/www/wwwroot/briline.net/app/common/controller/ControllerBase.php ( 4.75 KB )
/www/wwwroot/briline.net/thinkphp/library/think/Controller.php ( 6.20 KB )
/www/wwwroot/briline.net/thinkphp/library/traits/controller/Jump.php ( 4.97 KB )
/www/wwwroot/briline.net/thinkphp/library/think/View.php ( 6.86 KB )
/www/wwwroot/briline.net/thinkphp/library/think/view/driver/Think.php ( 5.61 KB )
/www/wwwroot/briline.net/thinkphp/library/think/Template.php ( 46.46 KB )
/www/wwwroot/briline.net/thinkphp/library/think/template/driver/File.php ( 2.24 KB )
/www/wwwroot/briline.net/app/index/logic/Wiki.php ( 6.16 KB )
/www/wwwroot/briline.net/app/index/logic/IndexBase.php ( 0.79 KB )
/www/wwwroot/briline.net/app/common/logic/LogicBase.php ( 0.83 KB )
/www/wwwroot/briline.net/app/common/model/Article.php ( 0.78 KB )
/www/wwwroot/briline.net/app/common/model/ArticleTongji.php ( 0.79 KB )
/www/wwwroot/briline.net/thinkphp/library/think/paginator/driver/Bootstrap.php ( 5.90 KB )
/www/wwwroot/briline.net/thinkphp/library/think/Paginator.php ( 9.45 KB )
/www/wwwroot/briline.net/thinkphp/library/think/Collection.php ( 8.63 KB )
/www/wwwroot/briline.net/runtime/temp/ead4923c25a6b3f986358f7070f93dfa.php ( 56.51 KB )
/www/wwwroot/briline.net/thinkphp/library/think/Response.php ( 8.64 KB )
/www/wwwroot/briline.net/thinkphp/library/think/debug/Html.php ( 4.27 KB )

[ DB ] CONNECT:[ UseTime:0.026326s ] mysql:dbname=briline.net;host=106.14.77.182;port=3306;charset=utf8
[ SQL ] SHOW COLUMNS FROM `ob_article` [ RunTime:0.018697s ]
[ SQL ] SELECT * FROM `ob_article` WHERE `id` = 5380 LIMIT 1 [ RunTime:0.017792s ]
[ EXPLAIN : array ( 'id' => 1, 'select_type' => 'SIMPLE', 'table' => 'ob_article', 'type' => 'const', 'possible_keys' => 'PRIMARY', 'key' => 'PRIMARY', 'key_len' => '4', 'ref' => 'const', 'rows' => 1, 'extra' => NULL, ) ]
[ SQL ] select * from `ob_article_tongji` where category_id=12 and mark_type='cate' order by times desc limit 15 [ RunTime:0.018071s ]
[ EXPLAIN : array ( 'id' => 1, 'select_type' => 'SIMPLE', 'table' => 'ob_article_tongji', 'type' => 'ALL', 'possible_keys' => NULL, 'key' => NULL, 'key_len' => NULL, 'ref' => NULL, 'rows' => 608, 'extra' => 'Using where; Using filesort', ) ]
[ SQL ] select * from `ob_article_tongji` where category_id=12 and mark_type='tags' order by times desc limit 100 [ RunTime:0.017969s ]
[ EXPLAIN : array ( 'id' => 1, 'select_type' => 'SIMPLE', 'table' => 'ob_article_tongji', 'type' => 'ALL', 'possible_keys' => NULL, 'key' => NULL, 'key_len' => NULL, 'ref' => NULL, 'rows' => 608, 'extra' => 'Using where; Using filesort', ) ]
[ SQL ] select * from `ob_article_tongji` where category_id=12 and mark_type='tags' order by rand() limit 30 [ RunTime:0.018185s ]
[ EXPLAIN : array ( 'id' => 1, 'select_type' => 'SIMPLE', 'table' => 'ob_article_tongji', 'type' => 'ALL', 'possible_keys' => NULL, 'key' => NULL, 'key_len' => NULL, 'ref' => NULL, 'rows' => 608, 'extra' => 'Using where; Using temporary; Using filesort', ) ]
[ SQL ] SELECT * FROM `ob_article` WHERE `id` = 5380 LIMIT 1 [ RunTime:0.017690s ]
[ EXPLAIN : array ( 'id' => 1, 'select_type' => 'SIMPLE', 'table' => 'ob_article', 'type' => 'const', 'possible_keys' => 'PRIMARY', 'key' => 'PRIMARY', 'key_len' => '4', 'ref' => 'const', 'rows' => 1, 'extra' => NULL, ) ]
[ SQL ] update `ob_article` set views=views+1 where id=5380 [ RunTime:0.020960s ]
[ SQL ] SELECT COUNT(*) AS tp_count FROM `ob_article` WHERE `category_id` = 12 AND `cate` = '威海Spider' AND `status` <> -1 LIMIT 1 [ RunTime:0.024899s ]
[ EXPLAIN : array ( 'id' => 1, 'select_type' => 'SIMPLE', 'table' => 'ob_article', 'type' => 'ALL', 'possible_keys' => NULL, 'key' => NULL, 'key_len' => NULL, 'ref' => NULL, 'rows' => 8035, 'extra' => 'Using where', ) ]
[ SQL ] SELECT * FROM `ob_article` WHERE `category_id` = 12 AND `cate` = '威海Spider' AND `status` <> -1 ORDER BY rand() LIMIT 0,2 [ RunTime:0.035462s ]
[ EXPLAIN : array ( 'id' => 1, 'select_type' => 'SIMPLE', 'table' => 'ob_article', 'type' => 'ALL', 'possible_keys' => NULL, 'key' => NULL, 'key_len' => NULL, 'ref' => NULL, 'rows' => 8035, 'extra' => 'Using where; Using temporary; Using filesort', ) ]
[ SQL ] SELECT COUNT(*) AS tp_count FROM `ob_article` WHERE `category_id` = 12 AND `tags` = '威海Spider' AND `status` <> -1 LIMIT 1 [ RunTime:0.024369s ]
[ EXPLAIN : array ( 'id' => 1, 'select_type' => 'SIMPLE', 'table' => 'ob_article', 'type' => 'ALL', 'possible_keys' => NULL, 'key' => NULL, 'key_len' => NULL, 'ref' => NULL, 'rows' => 8035, 'extra' => 'Using where', ) ]
[ SQL ] SELECT * FROM `ob_article` WHERE `category_id` = 12 AND `tags` = '威海Spider' AND `status` <> -1 ORDER BY rand() LIMIT 0,2 [ RunTime:0.033811s ]
[ EXPLAIN : array ( 'id' => 1, 'select_type' => 'SIMPLE', 'table' => 'ob_article', 'type' => 'ALL', 'possible_keys' => NULL, 'key' => NULL, 'key_len' => NULL, 'ref' => NULL, 'rows' => 8035, 'extra' => 'Using where; Using temporary; Using filesort', ) ]

0.476560s

Categories

Tags

如何和搜索引擎爬蟲(chóng)對(duì)話

百度Spider新增渲染抓取UA公告

如何正確識(shí)別Baiduspider移動(dòng)ua

哪些網(wǎng)站垃圾蜘蛛可以屏蔽？屏蔽無(wú)流量搜索引擎抓取

什么是搜索引擎蜘蛛

更多您感興趣的搜索

Categories

Tags

如何和搜索引擎爬蟲(chóng)對(duì)話

百度Spider新增渲染抓取UA公告

如何正確識(shí)別Baiduspider移動(dòng)ua

哪些網(wǎng)站垃圾蜘蛛可以屏蔽？屏蔽無(wú)流量搜索引擎抓取

什么是搜索引擎蜘蛛

更多您感興趣的搜索

哪些網(wǎng)站垃圾蜘蛛可以屏蔽？屏蔽無(wú)流量搜索引擎抓取