Nutch 入门
http://runtool.blog.163.com/blog/static/183144445201251625612309/
參考資料:
1.http://blog.csdn.net/forwen/article/details/4804733
2.SequenceFile介紹?:http://blog.163.com/jiayouweijiewj@126/blog/static/17123217720101121103928847/
3.http://blog.163.com/bit_runner/blog/static/53242218201141393943980/
4.http://blog.163.com/jiayouweijiewj@126/blog/static/171232177201011475716354/
1. Nutch是什么?
Nutch是一個開源的網頁抓取工具,主要用于收集網頁數據,然后對其進行分析,建立索引,以提供相應的接口來對其網頁數據進行查詢的一套工具。其底層使用了Hadoop來做分布式計算與存儲,索引使用了Solr分布式索引框架來做,Solr是一個開源的全文索引框架,從Nutch 1.3開始,其集成了這個索引架構
?
2. 在哪里要可以下載到最新的Nutch?
在下面地址中可以下載到最新的Nutch 1.3二進制包和源代碼
http://mirror.bjtu.edu.cn/apache//nutch/
?
3. 如何配置Nutch?
? ?3.1 對下載后的壓縮包進行解壓,然后cd $HOME/nutch-1.3/runtime/local
? ?3.2 配置bin/nutch這個文件的權限,使用chmod +x bin/nutch?
? ?3.3 配置JAVA_HOME,使用export JAVA_HOME=$PATH
4. 抓取前要做什么準備工作?
4.1 配置http.agent.name這個屬性,如下
?
1.? <pre?name=“code”?class=“html”><property>??
2.? ????<name>http.agent.name</name>??
3.? ????<value>My?Nutch?Spider</value>??
4.? </property>??
?4.2 建立一個地址目錄,mkdir -p urls
? ?在這個目錄中建立一個url文件,寫上一些url,如
?1.? http://nutch.apache.org/??
4.3 然后運行如下命令
?
1.? bin/nutch?crawl?urls?-dir?crawl?-depth?3?-topN?5??
注意,這里是不帶索引的,如果要對抓取的數據建立索引,運行如下命令
?
1.? bin/nutch?crawl?urls?-solr?http://localhost:8983/solr/?-depth?3?-topN?5??
5. Nutch的抓取流程是什么樣子的?
5.1 初始化crawlDb,注入初始url
?
1.? <pre?name=“code”?class=“html”>bin/nutch?inject???
2.? Usage:?Injector?<crawldb>?<url_dir>??
?
在我本地運行這個命令后的輸出結果如下:
?
1.? lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$?bin/nutch?inject?db/crawldb?urls/??
2.? ????????Injector:?starting?at?2011-08-22?10:50:01??
3.? ????????Injector:?crawlDb:?db/crawldb??
4.? ????????Injector:?urlDir:?urls??
5.? ????????Injector:?Converting?injected?urls?to?crawl?db?entries.??
6.? ????????Injector:?Merging?injected?urls?into?crawl?db.??
7.? ????????Injector:?finished?at?2011-08-22?10:50:05,?elapsed:?00:00:03??
5.2 產生新的抓取urls
?
1.? bin/nutch?generate??
2.? Usage:?Generator?<crawldb>?<segments_dir>?[-force]?[-topN?N]?[-numFetchers?numFetchers]?[-adddays?numDays]?[-noFilter]?[-noNorm][-maxNumSegments?num]??
本機輸出結果如下:
?
1.? lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$?bin/nutch?generate?db/crawldb/?db/segments??
2.? ????????Generator:?starting?at?2011-08-22?10:52:41??
3.? ????????Generator:?Selecting?best-scoring?urls?due?for?fetch.??
4.? ????????Generator:?filtering:?true??
5.? ????????Generator:?normalizing:?true??
6.? ????????Generator:?jobtracker?is?’local’,?generating?exactly?one?partition.??
7.? ????????Generator:?Partitioning?selected?urls?for?politeness.??
8.? ????????Generator:?segment:?db/segments/20110822105243???//?這里會產生一個新的segment??
9.? ????????Generator:?finished?at?2011-08-22?10:52:44,?elapsed:?00:00:03??
5.3 對上面產生的url進行抓取
1.? bin/nutch?fetch??
2.? Usage:?Fetcher?<segment>?[-threads?n]?[-noParsing]??
這里是本地的輸出結果:
1.? lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$?bin/nutch?fetch?db/segments/20110822105243/??
2.? ????????Fetcher:?Your?’http.agent.name’?value?should?be?listed?first?in?’http.robots.agents’?property.??
3.? ????????Fetcher:?starting?at?2011-08-22?10:56:07??
4.? ????????Fetcher:?segment:?db/segments/20110822105243??
5.? ????????Fetcher:?threads:?10??
6.? ????????QueueFeeder?finished:?total?1?records?+?hit?by?time?limit?:0??
7.? ????????fetching?http://www.baidu.com/??
8.? ????????-finishing?thread?FetcherThread,?activeThreads=1??
9.? ????????-finishing?thread?FetcherThread,?activeThreads=??
10.????????-finishing?thread?FetcherThread,?activeThreads=1??
11.????????-finishing?thread?FetcherThread,?activeThreads=1??
12.????????-finishing?thread?FetcherThread,?activeThreads=0??
13.????????-activeThreads=0,?spinWaiting=0,?fetchQueues.totalSize=0??
14.????????-activeThreads=0??
15.????????Fetcher:?finished?at?2011-08-22?10:56:09,?elapsed:?00:00:02??
我們來看一下這里的segment目錄結構
1.? lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$?ls?db/segments/20110822105243/??
2.? content??crawl_fetch??crawl_generate??
5.4 對上面的結果進行解析
1.? <pre?name=“code”?class=“html”>bin/nutch?parse??
2.? Usage:?ParseSegment?segment??
?
本機輸出結果:
?
1.? <pre?name=“code”?class=“html”>lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$?bin/nutch?parse?db/segments/20110822105243/??
2.? ParseSegment:?starting?at?2011-08-22?10:58:19??
3.? ParseSegment:?segment:?db/segments/20110822105243??
4.? ParseSegment:?finished?at?2011-08-22?10:58:22,?elapsed:?00:00:02??
?
我們再來看一下解析后的目錄結構
?
1.? <pre?name=“code”?class=“html”>lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$?ls?db/segments/20110822105243/??
2.? content??crawl_fetch??crawl_generate??crawl_parse??parse_data??parse_text??
?
這里多了三個解析后的目錄。
?
5.5 更新外鏈接數據庫
1.? bin/nutch?updatedb??
2.? Usage:?CrawlDb?<crawldb>?(-dir?<segments>?|?<seg1>?<seg2>?…)?[-force]?[-normalize]?[-filter]?[-noAdditions]??
本機輸出結果:
?
1.? <pre?name=“code”?class=“html”>lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$?bin/nutch?updatedb?db/crawldb/?-dir?db/segments/??
2.? CrawlDb?update:?starting?at?2011-08-22?11:00:09??
3.? CrawlDb?update:?db:?db/crawldb??
4.? CrawlDb?update:?segments:?[file:/home/lemo/Workspace/java/Apache/Nutch/nutch-1.3/db/segments/20110822105243]??
5.? CrawlDb?update:?additions?allowed:?true??
6.? CrawlDb?update:?URL?normalizing:?false??
7.? CrawlDb?update:?URL?filtering:?false??
8.? CrawlDb?update:?Merging?segment?data?into?db.??
9.? CrawlDb?update:?finished?at?2011-08-22?11:00:10,?elapsed:?00:00:01??
?
這時它會更新crawldb鏈接庫,這里是放在文件系統中的,像taobao抓取程序的鏈接庫是用redis來做的,一種key-value形式的NoSql數據庫。
5.6 計算反向鏈接
?
1.? <pre?name=“code”?class=“html”>bin/nutch?invertlinks??
2.? Usage:?LinkDb?<linkdb>?(-dir?<segmentsDir>?|?<seg1>?<seg2>?…)?[-force]?[-noNormalize]?[-noFilter]??
?
本地輸出結果:
?
1.? <pre?name=“code”?class=“html”>lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$?bin/nutch?invertlinks?db/linkdb?-dir?db/segments/??
2.? LinkDb:?starting?at?2011-08-22?11:02:49??
3.? LinkDb:?linkdb:?db/linkdb??
4.? LinkDb:?URL?normalize:?true??
5.? LinkDb:?URL?filter:?true??
6.? LinkDb:?adding?segment:?file:/home/lemo/Workspace/java/Apache/Nutch/nutch-1.3/db/segments/20110822105243??
7.? LinkDb:?finished?at?2011-08-22?11:02:50,?elapsed:?00:00:01??
?
5.7 使用Solr為抓取的內容建立索引
1.? bin/nutch?solrindex??
2.? Usage:?SolrIndexer?<solr?url>?<crawldb>?<linkdb>?(<segment>?…?|?-dir?<segments>??
Nutch端的輸出如下:
?
1.? lemo@debian:~/Workspace/java/Apache/Nutch/nutch-1.3$?bin/nutch?solrindex?http://127.0.0.1:8983/solr/?db/crawldb/?db/linkdb/?db/segments/*??
1.? SolrIndexer:?starting?at?2011-08-22?11:05:33??
1.? SolrIndexer:?finished?at?2011-08-22?11:05:35,?elapsed:?00:00:02??
Solr端的部分輸出如下:
1.? INFO:?SolrDeletionPolicy.onInit:?commits:num=1??
2.? ???????commit{dir=/home/lemo/Workspace/java/Apache/Solr/apache-solr-3.3.0/example/solr/data/index,segFN=segments_1,version=1314024228223,generation=1,filenames=[segments_1]??
3.? Aug?22,?2011?11:05:35?AM?org.apache.solr.core.SolrDeletionPolicy?updateCommits??
4.? INFO:?newest?commit?=?1314024228223??
5.? Aug?22,?2011?11:05:35?AM?org.apache.solr.update.processor.LogUpdateProcessor?finish??
6.? INFO:?{add=[http://www.baidu.com/]}?0?183??
7.? Aug?22,?2011?11:05:35?AM?org.apache.solr.core.SolrCore?execute??
8.? INFO:?[]?webapp=/solr?path=/update?params={wt=javabin&version=2}?status=0?QTime=183??
9.? Aug?22,?2011?11:05:35?AM?org.apache.solr.update.DirectUpdateHandler2?commit??
10.INFO:?start?commit(optimize=false,waitFlush=true,waitSearcher=true,expungeDeletes=false)??
5.8 在Solr的客戶端查詢
在瀏覽器中輸入
?
1.? http://localhost:8983/solr/admin/??
查詢條件為baidu
輸出的XML結構為
如果你要以HTML結構顯示把Solr的配置文件solrconfig.xml中的content改為如下就可以
<field name=”content” type=”text”
stored=”true” indexed=”true”/>
1.? </pre><blockquote?style=“margin-top:?0px;?margin-right:?0px;?margin-bottom:?0px;?margin-left:?40px;?border-top-style:?none;?border-right-style:?none;?border-bottom-style:?none;?border-left-style:?none;?border-width:?initial;?border-color:?initial;?padding-top:?0px;?padding-right:?0px;?padding-bottom:?0px;?padding-left:?0px;?”><div><blockquote?style=“margin-top:?0px;?margin-right:?0px;?margin-bottom:?0px;?margin-left:?40px;?border-top-style:?none;?border-right-style:?none;?border-bottom-style:?none;?border-left-style:?none;?border-width:?initial;?border-color:?initial;?padding-top:?0px;?padding-right:?0px;?padding-bottom:?0px;?padding-left:?0px;?”><div></div></blockquote></div></blockquote><p></p><span?style=“white-space:pre”></span><pre?name=“code”?class=“html”>????<pre?name=“code”?class=“html”><response>??
2.? <lst?name=“responseHeader”>??
3.? <int?name=“status”>0</int>??
4.? <int?name=“QTime”>0</int>??
5.? <lst?name=“params”>??
6.? <str?name=“indent”>on</str>??
7.? <str?name=“start”>0</str>??
8.? <str?name=“q”>baidu</str>??
9.? <str?name=“version”>2.2</str>??
10.<str?name=“rows”>10</str>??
11.</lst>??
12.</lst>??
13.<result?name=“response”?numFound=“1″?start=“0″>??
14.<doc>??
15.<float?name=“boost”>1.0660036</float>??
16.<str?name=“digest”>7be5cfd6da4a058001300b21d7d96b0f</str>??
17.<str?name=“id”>http://www.baidu.com/</str>??
18.<str?name=“segment”>20110822105243</str>??
19.<str?name=“title”>百度一下,你就知道</str>??
20.<date?name=“tstamp”>2011-08-22T14:56:09.194Z</date>??
21.<str?name=“url”>http://www.baidu.com/</str>??
22.</doc>??
23.</result>??
24.</response>??
?
1.? <pre?name=“code”?class=“html”><blockquote?style=“margin-top:?0px;?margin-right:?0px;?margin-bottom:?0px;?margin-left:?40px;?border-top-style:?none;?border-right-style:?none;?border-bottom-style:?none;?border-left-style:?none;?border-width:?initial;?border-color:?initial;?padding-top:?0px;?padding-right:?0px;?padding-bottom:?0px;?padding-left:?0px;?”><pre?name=“code”?class=“html”>??
6 參考
http://wiki.apache.org/nutch/RunningNutchAndSolr
?作者:http://blog.csdn.net/amuseme_lu
?
?
=====================
http://haomou.net/?p=1212
Nutch 1.6 入門安裝配置(集成solr)
測試環境?kubuntu12.04?jdk1.7.0_15?nutch?1.6?solr3.6.2
介紹
apachen?nutch?是一個用java寫的開源網絡爬蟲。使用它我們可以自動找到超鏈接,并且減少很多維護工作。例如,檢測壞的鏈接,將爬過的網站copy下來。solr是一個開源的全文檢索框架,使用它我們可以搜索nutch抓去來的網頁。集成Nutch和solr是一件非常簡單的事情。
apache?nutch?支持solr的盒外集成(out-the-box),非常簡單。nutch也不再tomcat來運行以前的那個web程序了,并且不用lucene來檢索了。
步驟:
1.安裝Nutch(二進制發行版)
首先去官網下載二進制包(apache-nutch-1.6-bin.zip),?解壓縮,將出現apache-nutch-1.6文件夾,進入文件夾cd?apache-nutch-1.6。從現在起我們將用${NUTCH_RUNTIME_HOME}代表?apache-nutch-1.6。
2.驗證是否安裝正確
運行?bin/nutch?你看見如下文字說明安裝正確了
Usage:?nutch?[-core]?COMMAND
如果出現permission?denied?那么說明沒有運行權限,給其加上運行權限chmod?+x?bin/nutch。
如果看見JAVA_HOME?not?set,那么說明你的電腦沒有安裝jdk或是沒有設置JAVA_HOME。安裝jdk很簡單這里就不說了。
3.抓取你的第一個網站
在conf/nutch-site.xml文件中添加你的代理名字
<property>
<name>http.agent.name</name>
<value>My?Nutch?Spider</value>
</property>
——————————————-示例—————————————————————–
<?xml?version=”1.0″?>
<?xml-stylesheet?type=”text/xsl”?href=”configuration.xsl”?>
<!–?Put?site-specific?property?overrides?in?this?file.?–>
<configuration>
<property>
<name>http.agent.name</name>
<value>oscar</value>
<description>HTTP?’User-Agent’?request?header.?MUST?NOT?be?empty?-
please?set?this?to?a?single?word?uniquely?related?to?your?organization.
NOTE:?You?should?also?check?other?related?properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and?set?their?values?appropriately.
</description>
</property>
</configuration>
————————————————————————————————————–
創建一個urls文件夾,mkdir?-p?urls?然后進入文件夾,創建一個文本文件seed.txt。touch?seed.txt。在文件中寫入
http://nutch.apache.org/它是我們想要抓取的網站。然后編輯conf/regex-urlfilter.txt把
#?accept?anything?else
+.
替換成
+^http://([a-z0-9]*\.)*nutch.apache.org/這樣他只會抓取nutch.apache.org域名里面的網頁。
3.1?使用crawl命令
bin/nutch?crawl?urls?-dir?crawl?-depth?3?-topN?5
其中,
urlDir就是種子url的目錄地址
-solr?<solrUrl>為solr的地址(如果沒有則為空)
-dir?是保存爬取文件的目錄
-threads?是爬取線程數量(默認10)
-depth?是爬取深度?(默認5)
-topN?是訪問的廣度?(默認是Long.max)
運行完成后你將看到這些目錄產生了
crawl/crawldb
crawl/linkdb
crawl/segments
4.部署搜索用的solr
去solr官網下載二進制的文件,解壓縮下載的文件。將得到apache-solr-3.6文件夾,接下來我們將用${APACHE_SOLR_HOME}代表該目錄。進入${APACHE_SOLR_HOME}/example。然后運行如下命令java?-jar?start.jar
5.驗證安裝是否正確
帶開瀏覽器在地址蘭中輸入
http://localhost:8983/solr/admin/
http://localhost:8983/solr/admin/stats.jsp看到如下界面說明安裝成功了
?
6.集成nutch和solr
現在我們已經把solr和nutch都安裝好了。nutch也已經抓取了數據。接下來將用solr搜索抓取到的鏈接。
運行如下命令
cp?${NUTCH_RUNTIME_HOME}/conf/schema.xml?${APACHE_SOLR_HOME}/example/solr/conf/
重啟solr
運行solr索引命令
bin/nutch?solrindex?http://127.0.0.1:8983/solr/?crawl/crawldb?-linkdb?crawl/linkdb?crawl/segments/*
這個命令將把抓取到的數據發送到solr進行索引。
如果一切進行順利的話,我們現在可以在?http://localhost:8983/solr/admin/進行搜索了。
如果你想要看到原始的HTML,改變schema.xml文件
<field?name=”content”?type=”text”?stored=”true”?indexed=”true”/>
六22013 年 6 月 2 日This entry was posted in 搜索引擎 and tagged nutch, solr. Bookmark the permalink.
Post navigation
Apache nutch1.5 & Apache solr3.6 與50位技術專家面對面20年技術見證,附贈技術全景圖總結
- 上一篇: linux 根据ip查机器名
- 下一篇: 远程过程调用(Remote Proced