【Nutch2.2.1基础教程之2.1】集成Nutch/Hbase/Solr构建搜索引擎之一:安装及运行【单机环境】...
1、下載相關軟件,并解壓
版本號如下:
(1)apache-nutch-2.2.1
(2) hbase-0.90.4?
(3)solr-4.9.0
并解壓至/usr/search
2、Nutch的配置
(1)vi /usr/search/apache-nutch-2.2.1/conf/nutch-site.xml?
<property> <name>storage.data.store.class</name> <value>org.apache.gora.hbase.store.HBaseStore</value> <description>Default class for storing data</description> </property>(2)vi /usr/search/apache-nutch-2.2.1/ivy/ivy.xml?
默認情況下,此語句被注釋掉,將其注釋符號去掉,使其生效。
<dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />(3)vi /usr/search/apache-nutch-2.2.1/conf/gora.properties?
添加以下語句:
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
以上三個步驟指定了使用HBase來進行存儲。
以下步驟才是構建基本Nutch的必要步驟。
(4)構建runtime
?cd /usr/search/apache-nutch-2.2.1/
ant runtime
(5)驗證Nutch安裝完成
[root@jediael44 apache-nutch-2.2.1]# cd /usr/search/apache-nutch-2.2.1/runtime/local/bin/
[root@jediael44 bin]# ./nutch?
Usage: nutch COMMAND
where COMMAND is one of:
?inject ? ? ? ? inject new urls into the database
?hostinject ? ? creates or updates an existing host table from a text file
?generate ? ? ? generate new batches to fetch from crawl db
?fetch ? ? ? ? ?fetch URLs marked during generate
?parse ? ? ? ? ?parse URLs marked during fetch
?updatedb ? ? ? update web table after parsing
?updatehostdb ? update host table after parsing
?readdb ? ? ? ? read/dump records from page database
?readhostdb ? ? display entries from the hostDB
?elasticindex ? run the elasticsearch indexer
?solrindex ? ? ?run the solr indexer on parsed batches
?solrdedup ? ? ?remove duplicates from solr
?parsechecker ? check the parser for a given url
?indexchecker ? check the indexing filters for a given url
?plugin ? ? ? ? load a plugin and run one of its classes main()
?nutchserver ? ?run a (local) Nutch server on a user defined port
?junit ? ? ? ? ?runs the given JUnit test
?or
?CLASSNAME ? ? ?run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
(6)vi /usr/search/apache-nutch-2.2.1/runtime/local/conf/nutch-site.xml 添加搜索任務
<property> <name>http.agent.name</name> <value>My Nutch Spider</value> </property>(7)創建seed.txt
?cd /usr/search/apache-nutch-2.2.1/runtime/local/bin/
vi seed.txt
http://nutch.apache.org/
(8)修改網頁過濾器??vi /usr/search/apache-nutch-2.2.1/conf/regex-urlfilter.txt?
?vi /usr/search/apache-nutch-2.2.1/conf/regex-urlfilter.txt?
將
# accept anything else
+.
修改為
# accept anything else
+^http://([a-z0-9]*\.)*nutch.apache.org/
3、Hbase的配置
(1)vi /usr/search/hbase-0.90.4/conf/hbase-site.xml?
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hbase.rootdir</name> <value><Your path></value> </property> <property> <name>hbase.zookeeper.property.dataDir</name> <value><Your path></value> </property> </configuration>注:此步驟可不做。若不做,則使用hbase-default.xml(/usr/search/hbase-0.90.4/src/main/resources/hbase-default.xml)中的默認值。
默認值為:
<property><name>hbase.rootdir</name><value>file:///tmp/hbase-${user.name}/hbase</value><description>The directory shared by region servers and intowhich HBase persists. The URL should be 'fully-qualified'to include the filesystem scheme. For example, to specify theHDFS directory '/hbase' where the HDFS instance's namenode isrunning at namenode.example.org on port 9000, set this value to:hdfs://namenode.example.org:9000/hbase. By default HBase writesinto /tmp. Change this configuration else all data will be loston machine restart.</description></property>即默認情況下會放在/tmp目錄,若機器重啟,有可能數據丟失。但是建議還是把這些屬性做好配置,尤其是第二個關于zoopkeeper的,否則會導致各種問題。以下將目錄配置在本地文件系統中。
<configuration> <property> <name>hbase.rootdir</name> <value>file:///home/jediael/hbaserootdir</value> </property> <property> <name>hbase.zookeeper.property.dataDir</name> <value>file:///home/jediael/hbasezookeeperdataDir</value> </property></configuration>注意,若無前綴file://,則默認是hdfs://
但在0.90.4版本,默認還是本地文件系統。
4、Solr的配置
(1)覆蓋solr的schema.xml文件。(對于solr4,應該使用schema-solr4.xml)
cp /usr/search/apache-nutch-2.2.1/conf/schema.xml /usr/search/solr-4.9.0/example/solr/collection1/conf/
(2)若使用solr3.6,則至此已經完成配置,但使用4.9,需要修改以下配置:
修改上述復制過來的schema.xml文件
刪除:<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />?
增加:<field name="_version_" type="long" indexed="true" stored="true"/>
5、啟動抓取任務
(1)啟動HBase
[root@jediael44 bin]# cd /usr/search/hbase-0.90.4/bin/
[root@jediael44 bin]# ./start-hbase.sh?
(2)啟動Solr
[root@jediael44 bin]# cd /usr/search/solr-4.9.0/example/
[root@jediael44 example]# java -jar start.jar?
(3)啟動Nutch,開始抓取任務
[root@jediael44 example]# cd /usr/search/apache-nutch-2.2.1/runtime/local/bin/
[root@jediael44 bin]#?./crawl seed.txt TestCrawl http://localhost:8983/solr 2
大功告成,任務開始執行。
關于上述過程的一些分析請見:
集成Nutch/Hbase/Solr構建搜索引擎之二:內容分析
http://blog.csdn.net/jediael_lu/article/details/37738569
轉載于:https://www.cnblogs.com/eaglegeek/p/4557894.html
總結
以上是生活随笔為你收集整理的【Nutch2.2.1基础教程之2.1】集成Nutch/Hbase/Solr构建搜索引擎之一:安装及运行【单机环境】...的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Android下/data/data/p
- 下一篇: jquery 表单重置通用方法