當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

nutch开发(六)

發布時間：2024/9/19 编程问答 26 豆豆

生活随笔收集整理的這篇文章主要介紹了 nutch开发(六) 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

nutch開發(六)

文章目錄

- nutch開發(六)
- - 1.nutch1.18整合solr-8.11.0
  - - 1.1 配置index-writers.xml文件
    - 1.2 solr core字段的配置
    - 1.3 solr配置Ik分詞器
    - 1.4 nutch metatags plugs插件修改配置
  - 2.測試自定義的插件是否運行成功
  - - 運行parserchecker
    - - 2.1 IDEA創建啟動
      - 2.2 運行效果對等
      - 2.3 解析結果分析
    - 運行IndexChecker
    - - 2.4 IDEA創建啟動
      - 2.5 運行效果對等
      - 2.6 index過濾結果分析
  - 3.修改抓取delay
  - 4.在solr上看抓取結果

1.nutch1.18整合solr-8.11.0

1.1 配置index-writers.xml文件

在nutch1.18的conf目錄下面有一個index-writers.xml文件，該文件的配置會傳遞給indexer-solr

1.2 solr core字段的配置

我配置了以下字段，具體看我下面的field字段配置

good：點贊數
keywords：文章關鍵詞
pulishedTime：文章發布時間
title：標題
description：文章描述

1.3 solr配置Ik分詞器

solr 使用IK分詞器_鴨梨的藥丸哥的博客-CSDN博客

1.4 nutch metatags plugs插件修改配置

因為我們添加了自定義的parse-blog插件，解析出的Matedata是比較多的，我們可以修改index-metatags插件的配置對，對解析出的數據進行index生成。

注意：還有別忘了在plugin.includes中添加parse-blog的引用

<property><name>index.parse.md</name> <value>metatag.description,metatag.keywords,metatag.good,metatag.collections,metatag.icon,pulishedTime,webName</value><description>Comma-separated list of keys to be taken from the parse metadata to generate fields.Can be used e.g. for 'description' or 'keywords' provided that these values are generatedby a parser (see parse-metatags plugin)</description> </property>

2.測試自定義的插件是否運行成功

運行parserchecker

在運行整個爬取過程前，先觀察配置正確，自定義的插件是否有生效。

2.1 IDEA創建啟動

點擊主菜單依次選擇： Run -> Edit Configurations ，點擊 + 號，選擇創建 Application ：

Name ： ParserChecker
Main Class ：org.apache.nutch.parse.ParserChecker
Program arguments : https://blog.csdn.net/qq_43203949/article/details/122626378

注意：Program arguments的填充是跟你nutch提供的腳本傳遞的參數一樣的

2.2 運行效果對等

運行效果等價于使用nutch（Bin版本）的bin/目錄下的nutch命令一樣。

./nutch parsercheck https://blog.csdn.net/qq_43203949/article/details/122626378

2.3 解析結果分析

因為我除了使用自定義的parse-blog插件外，還使用了其他插件，所以解析的Metadata比較多。

Parse Metadata:OriginalCharEncoding = utf-8renderer = webkitkeywords = Nutch開發(一)metatag.good = 0force-rendering = webkitmetatag.description = Nutch開發和使用教程metatag.icon = https://g.csdnimg.cn/static/logo/favicon32.icoapplicable-device = pcshenma-site-verification = 5a59773ab8077d4a62bf469ab966a63b_1497598848description = Nutch開發和使用教程csdn-baidu-search = {"autorun":true,"install":true,"keyword":"Nutch開發(一)"}metatag.collections = 0CharEncodingForConversion = utf-8referrer = alwaysviewport = width=device-width, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0, user-scalable=nowebName = csdnreport = {"pid": "blog", "spm":"1001.2101"}metatag.keywords = Nutch開發(一)pulishedTime = 2022-01-21

運行IndexChecker

2.4 IDEA創建啟動

點擊主菜單依次選擇： Run -> Edit Configurations ，點擊 + 號，選擇創建 Application ：

Name ： IndexChecker
Main Class ：org.apache.nutch.indexer.IndexingFiltersChecker
Program arguments : https://blog.csdn.net/qq_43203949/article/details/122626378

注意：Program arguments的填充是跟你nutch提供的腳本傳遞的參數一樣的

2.5 運行效果對等

運行效果等價于使用nutch（Bin版本）的bin/目錄下的nutch命令一樣。

./nutch indexcheck https://blog.csdn.net/qq_43203949/article/details/122626378

2.6 index過濾結果分析

parsing: https://blog.csdn.net/qq_43203949/article/details/122626378 contentType: text/html metatag.good : 0 metatag.description : Nutch開發和使用教程 metatag.icon : https://g.csdnimg.cn/static/logo/favicon32.ico title : Nutch開發(一)_鴨梨的藥丸哥的博客-CSDN博客 metatag.collections : 0 url : https://blog.csdn.net/qq_43203949/article/details/122626378 content : Nutch開發(一)_鴨梨的藥丸哥的博客-CSDN博客 Nutch開發(一) 鴨梨的藥丸哥于 2022-01-21 17:47:03 發布 533 收藏分類專欄：搜索技術文章標簽： intel tstamp : Thu Feb 17 17:13:53 CST 2022 webName : csdn digest : bc16673003db2f6216d16bdaf092ee61 host : blog.csdn.net id : https://blog.csdn.net/qq_43203949/article/details/122626378 metatag.keywords : Nutch開發(一) pulishedTime : 2022-01-21

3.修改抓取delay

如果直接進行抓取，你會發現雖然使用了很多fetcher線程，但是抓取速度還是非常慢，原因是nutch對同一個域下面的網頁爬取是有延時操作的，默認同一域下爬取一個網頁后延時5s在爬下一個資源。

<property><name>fetcher.server.delay</name><value>5.0</value><description>The number of seconds the fetcher will delay between successive requests to the same server. Note that this might getoverridden by a Crawl-Delay from a robots.txt and is used ONLY if fetcher.threads.per.queue is set to 1.</description> </property><property><name>fetcher.server.min.delay</name><value>0.0</value><description>The minimum number of seconds the fetcher will delay between successive requests to the same server. This value is applicable ONLYif fetcher.threads.per.queue is greater than 1 (i.e. the host blockingis turned off).</description> </property><property><name>fetcher.max.crawl.delay</name><value>30</value><description>If the Crawl-Delay in robots.txt is set to greater than this value (inseconds) then the fetcher will skip this page, generating an error report.If set to -1 the fetcher will never skip such pages and will wait theamount of time retrieved from robots.txt Crawl-Delay, however long thatmight be.</description> </property><property><name>fetcher.min.crawl.delay</name><value>${fetcher.server.delay}</value><description>Minimum Crawl-Delay (in seconds) accepted in robots.txt, even if therobots.txt specifies a shorter delay. By default the minimum Crawl-Delayis set to the value of `fetcher.server.delay` which guarantees thata value set in the robots.txt cannot make the crawler more aggressivethan the default configuration.</description> </property><property><name>fetcher.threads.fetch</name><value>10</value><description>The number of FetcherThreads the fetcher should use.This is also determines the maximum number of requests that aremade at once (each FetcherThread handles one connection). The totalnumber of threads running in distributed mode will be the number offetcher threads * number of nodes as fetcher has one map task per node.</description> </property><property><name>fetcher.threads.per.queue</name><value>1</value><description>This number is the maximum number of threads thatshould be allowed to access a queue at one time. Setting it to a value > 1 will cause the Crawl-Delay value from robots.txt tobe ignored and the value of fetcher.server.min.delay to be usedas a delay between successive requests to the same server instead of fetcher.server.delay.</description> </property>

4.在solr上看抓取結果

執行完整的爬取和索引過程了，我們可以從solr的nutch core中查看到被indexer-solr插件在solr中建立的document文檔。

總結

以上是生活随笔為你收集整理的nutch开发(六)的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

Nutch

上一篇：美股周三：标指、纳指创两年来最长连涨，小
下一篇：为 Cybertruck 而来，特斯拉从

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

编程问答

nutch开发(六)

nutch開發(六)

文章目錄

1.nutch1.18整合solr-8.11.0

1.1 配置index-writers.xml文件

1.2 solr core字段的配置

1.3 solr配置Ik分詞器

1.4 nutch metatags plugs插件修改配置

2.測試自定義的插件是否運行成功

運行parserchecker

2.1 IDEA創建啟動

2.2 運行效果對等

2.3 解析結果分析

運行IndexChecker

2.4 IDEA創建啟動

2.5 運行效果對等

2.6 index過濾結果分析

3.修改抓取delay

4.在solr上看抓取結果

總結