nutch开发(六)
nutch開發(六)
文章目錄
- nutch開發(六)
- 1.nutch1.18整合solr-8.11.0
- 1.1 配置index-writers.xml文件
- 1.2 solr core字段的配置
- 1.3 solr配置Ik分詞器
- 1.4 nutch metatags plugs插件修改配置
- 2.測試自定義的插件是否運行成功
- 運行parserchecker
- 2.1 IDEA創建啟動
- 2.2 運行效果對等
- 2.3 解析結果分析
- 運行IndexChecker
- 2.4 IDEA創建啟動
- 2.5 運行效果對等
- 2.6 index過濾結果分析
- 3.修改抓取delay
- 4.在solr上看抓取結果
1.nutch1.18整合solr-8.11.0
1.1 配置index-writers.xml文件
在nutch1.18的conf目錄下面有一個index-writers.xml文件,該文件的配置會傳遞給indexer-solr
<writer id="indexer_solr_1" class="org.apache.nutch.indexwriter.solr.SolrIndexWriter"><parameters><param name="type" value="http"/><param name="url" value="http://localhost:8983/solr/nutch"/><param name="collection" value=""/><param name="weight.field" value=""/><param name="commitSize" value="1000"/><param name="auth" value="false"/><param name="username" value="username"/><param name="password" value="password"/></parameters><mapping><copy><!-- <field source="content" dest="search"/> --><!-- <field source="title" dest="title,search"/> --></copy><rename><!--這里是字段名的重命名--><field source="metatag.description" dest="description"/><field source="metatag.keywords" dest="keywords"/><field source="metatag.icon" dest="icon"/><field source="metatag.good" dest="good"/><field source="metatag.collections" dest="collections"/></rename><remove><field source="segment"/></remove></mapping></writer>1.2 solr core字段的配置
我配置了以下字段,具體看我下面的field字段配置
- good:點贊數
- keywords:文章關鍵詞
- pulishedTime:文章發布時間
- title:標題
- description:文章描述
1.3 solr配置Ik分詞器
solr 使用IK分詞器_鴨梨的藥丸哥的博客-CSDN博客
1.4 nutch metatags plugs插件修改配置
因為我們添加了自定義的parse-blog插件,解析出的Matedata是比較多的,我們可以修改index-metatags插件的配置對,對解析出的數據進行index生成。
注意:還有別忘了在plugin.includes中添加parse-blog的引用
<property><name>index.parse.md</name> <value>metatag.description,metatag.keywords,metatag.good,metatag.collections,metatag.icon,pulishedTime,webName</value><description>Comma-separated list of keys to be taken from the parse metadata to generate fields.Can be used e.g. for 'description' or 'keywords' provided that these values are generatedby a parser (see parse-metatags plugin)</description> </property>2.測試自定義的插件是否運行成功
運行parserchecker
在運行整個爬取過程前,先觀察配置正確,自定義的插件是否有生效。
2.1 IDEA創建啟動
點擊主菜單依次選擇: Run -> Edit Configurations ,點擊 + 號,選擇創建 Application :
- Name : ParserChecker
- Main Class :org.apache.nutch.parse.ParserChecker
- Program arguments : https://blog.csdn.net/qq_43203949/article/details/122626378
注意:Program arguments的填充是跟你nutch提供的腳本傳遞的參數一樣的
2.2 運行效果對等
運行效果等價于使用nutch(Bin版本)的bin/目錄下的nutch命令一樣。
./nutch parsercheck https://blog.csdn.net/qq_43203949/article/details/1226263782.3 解析結果分析
因為我除了使用自定義的parse-blog插件外,還使用了其他插件,所以解析的Metadata比較多。
Parse Metadata:OriginalCharEncoding = utf-8renderer = webkitkeywords = Nutch開發(一)metatag.good = 0force-rendering = webkitmetatag.description = Nutch開發和使用教程metatag.icon = https://g.csdnimg.cn/static/logo/favicon32.icoapplicable-device = pcshenma-site-verification = 5a59773ab8077d4a62bf469ab966a63b_1497598848description = Nutch開發和使用教程csdn-baidu-search = {"autorun":true,"install":true,"keyword":"Nutch開發(一)"}metatag.collections = 0CharEncodingForConversion = utf-8referrer = alwaysviewport = width=device-width, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0, user-scalable=nowebName = csdnreport = {"pid": "blog", "spm":"1001.2101"}metatag.keywords = Nutch開發(一)pulishedTime = 2022-01-21運行IndexChecker
2.4 IDEA創建啟動
點擊主菜單依次選擇: Run -> Edit Configurations ,點擊 + 號,選擇創建 Application :
- Name : IndexChecker
- Main Class :org.apache.nutch.indexer.IndexingFiltersChecker
- Program arguments : https://blog.csdn.net/qq_43203949/article/details/122626378
注意:Program arguments的填充是跟你nutch提供的腳本傳遞的參數一樣的
2.5 運行效果對等
運行效果等價于使用nutch(Bin版本)的bin/目錄下的nutch命令一樣。
./nutch indexcheck https://blog.csdn.net/qq_43203949/article/details/1226263782.6 index過濾結果分析
parsing: https://blog.csdn.net/qq_43203949/article/details/122626378 contentType: text/html metatag.good : 0 metatag.description : Nutch開發和使用教程 metatag.icon : https://g.csdnimg.cn/static/logo/favicon32.ico title : Nutch開發(一)_鴨梨的藥丸哥的博客-CSDN博客 metatag.collections : 0 url : https://blog.csdn.net/qq_43203949/article/details/122626378 content : Nutch開發(一)_鴨梨的藥丸哥的博客-CSDN博客 Nutch開發(一) 鴨梨的藥丸哥 于 2022-01-21 17:47:03 發布 533 收藏 分類專欄: 搜索技術 文章標簽: intel tstamp : Thu Feb 17 17:13:53 CST 2022 webName : csdn digest : bc16673003db2f6216d16bdaf092ee61 host : blog.csdn.net id : https://blog.csdn.net/qq_43203949/article/details/122626378 metatag.keywords : Nutch開發(一) pulishedTime : 2022-01-213.修改抓取delay
如果直接進行抓取,你會發現雖然使用了很多fetcher線程,但是抓取速度還是非常慢,原因是nutch對同一個域下面的網頁爬取是有延時操作的,默認同一域下爬取一個網頁后延時5s在爬下一個資源。
<property><name>fetcher.server.delay</name><value>5.0</value><!--服務器爬取延時值,當fetcher.threads.per.queue=1時才生效,并且會被robots.txt協議中的延時值覆蓋--><!--注意點:是同一服務器下的爬取延時--><description>The number of seconds the fetcher will delay between successive requests to the same server. Note that this might getoverridden by a Crawl-Delay from a robots.txt and is used ONLY if fetcher.threads.per.queue is set to 1.</description> </property><property><name>fetcher.server.min.delay</name><value>0.0</value><!--服務器爬取延時值,當fetcher.threads.per.queue>1時生效,用于代替fetcher.server.delay--><description>The minimum number of seconds the fetcher will delay between successive requests to the same server. This value is applicable ONLYif fetcher.threads.per.queue is greater than 1 (i.e. the host blockingis turned off).</description> </property><property><name>fetcher.max.crawl.delay</name><value>30</value><!--最大爬取延時,robots.txt協議規定延時值大于該值時,會跳過該網頁的爬取--><description>If the Crawl-Delay in robots.txt is set to greater than this value (inseconds) then the fetcher will skip this page, generating an error report.If set to -1 the fetcher will never skip such pages and will wait theamount of time retrieved from robots.txt Crawl-Delay, however long thatmight be.</description> </property><property><name>fetcher.min.crawl.delay</name><value>${fetcher.server.delay}</value><!--最小爬取延時,robots.txt協議規定延時值小于該值時,使用該值進行網頁爬取--><description>Minimum Crawl-Delay (in seconds) accepted in robots.txt, even if therobots.txt specifies a shorter delay. By default the minimum Crawl-Delayis set to the value of `fetcher.server.delay` which guarantees thata value set in the robots.txt cannot make the crawler more aggressivethan the default configuration.</description> </property><property><name>fetcher.threads.fetch</name><value>10</value><!--爬取線程數--><description>The number of FetcherThreads the fetcher should use.This is also determines the maximum number of requests that aremade at once (each FetcherThread handles one connection). The totalnumber of threads running in distributed mode will be the number offetcher threads * number of nodes as fetcher has one map task per node.</description> </property><property><name>fetcher.threads.per.queue</name><value>1</value><!--一次應該允許訪問隊列的最大線程數,當值大于1時,不再忽略robots.txt中的值,并使用fetcher.server.min.delay作為爬取延時值--><description>This number is the maximum number of threads thatshould be allowed to access a queue at one time. Setting it to a value > 1 will cause the Crawl-Delay value from robots.txt tobe ignored and the value of fetcher.server.min.delay to be usedas a delay between successive requests to the same server instead of fetcher.server.delay.</description> </property>4.在solr上看抓取結果
執行完整的爬取和索引過程了,我們可以從solr的nutch core中查看到被indexer-solr插件在solr中建立的document文檔。
總結
以上是生活随笔為你收集整理的nutch开发(六)的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 美股周三:标指、纳指创两年来最长连涨,小
- 下一篇: 为 Cybertruck 而来,特斯拉从