Nutch2.1+mysql+solr3.6.1+中文网站抓取
                                                            生活随笔
收集整理的這篇文章主要介紹了
                                Nutch2.1+mysql+solr3.6.1+中文网站抓取
小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.                        
                                
                            
                            
                            1、mysql 數(shù)據(jù)庫(kù)配置 linux mysql安裝步驟省略。 在首先進(jìn)入/etc/my.cnf (mysql為5.1的話就不用修改my.cnf,會(huì)導(dǎo)致mysql不能啟動(dòng))
在[mysqld]
下添加:
innodb_file_format=barracuda
innodb_file_per_table=true
innodb_large_prefix=true
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci 創(chuàng)建表: CREATE DATABASE nutch5 DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci; use nutch5; CREATE TABLE `webpage` (`id` varchar(767) CHARACTER SET latin1 NOT NULL,`headers` blob,`text` mediumtext DEFAULT NULL,`status` int(11) DEFAULT NULL,`markers` blob,`parseStatus` blob,`modifiedTime` bigint(20) DEFAULT NULL,`score` float DEFAULT NULL,`typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,`baseUrl` varchar(512) CHARACTER SET latin1 DEFAULT NULL,`content` mediumblob,`title` varchar(2048) DEFAULT NULL,`reprUrl` varchar(512) CHARACTER SET latin1 DEFAULT NULL,`fetchInterval` int(11) DEFAULT NULL,`prevFetchTime` bigint(20) DEFAULT NULL,`inlinks` mediumblob,`prevSignature` blob,`outlinks` mediumblob,`fetchTime` bigint(20) DEFAULT NULL,`retriesSinceFetch` int(11) DEFAULT NULL,`protocolStatus` blob,`signature` blob,`metadata` blob,PRIMARY KEY (`id`)) ENGINE=InnoDB DEFAULT CHARSET=utf8;
[html]view plaincopyprint?###############################   #?Default?SqlStore?properties?# ############################### #gora.sqlstore.jdbc.driver=org.hsqldb.jdbc.JDBCDriver #gora.sqlstore.jdbc.url=jdbc:hsqldb:hsql://localhost/nutchtest #gora.sqlstore.jdbc.user=sa #gora.sqlstore.jdbc.password=   取消以下代碼注釋,   ############################### #?MySQL?properties ################################   gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver   gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true   gora.sqlstore.jdbc.user=xxxxx(mysql用戶名)   gora.sqlstore.jdbc.password=xxxxx(mysql密碼)  
D、修改${APACHE_NUTCH_HOME}/conf/nutch-site.xml 加入如下代碼: [html]view plaincopyprint?< span>property<   < span>name<http.agent.namename<   < span>value<Your?Nutch?Spidervalue<   property<   < span>property<   < span>name<http.accept.languagename<   < span>value<ja-jp,?en-us,en-gb,en;q=0.7,*;q=0.3value<   < span>description<Value?of?the?“Accept-Language”?request?header?field.   This?allows?selecting?non-English?language?as?default?one?to?retrieve.   It?is?a?useful?setting?for?search?engines?build?for?certain?national?group.   description<   property<   < span>property<   < span>name<parser.character.encoding.defaultname<   < span>value<utf-8value<   < span>description<The?character?encoding?to?fall?back?to?when?no?other?information   is?availabledescription<   property<   < span>property<   < span>name<storage.data.store.classname<   < span>value<org.apache.gora.sql.store.SqlStorevalue<   < span>description<The?Gora?DataStore?class?for?storing?and?retrieving?data.   Currently?the?following?stores?are?available:?….   description<   property<  
 
mkdir -p urls?
echo 'http://nutch.apache.org/' < urls/seed.txt G、執(zhí)行爬行操作:?bin/nutch crawl urls -depth 3 -topN 5 執(zhí)行完在mysql中即可以查看到爬蟲抓取的內(nèi)容 3、安裝solr,對(duì)nutch抓取的內(nèi)容進(jìn)行索引 (注意:參考資料中推薦使用solr4.0版本,4.0的兩個(gè)版本我都試了,沒有成功,所以替換為3.6.1版本)? solr下載地址:http://www.fayea.com/apache-mirror/lucene/solr/3.6.1/apache-solr-3.6.1.zip A、解壓縮下載包, B、下載?http://nlp.solutions.asia/wp-content/uploads/2012/08/schema.xml替換${APACHE_SOLR_HOME}/example/solr/conf/schema.xml. C、啟動(dòng)solr
                        
                        
                        在[mysqld]
下添加:
innodb_file_format=barracuda
innodb_file_per_table=true
innodb_large_prefix=true
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci 創(chuàng)建表: CREATE DATABASE nutch5 DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci; use nutch5; CREATE TABLE `webpage` (`id` varchar(767) CHARACTER SET latin1 NOT NULL,`headers` blob,`text` mediumtext DEFAULT NULL,`status` int(11) DEFAULT NULL,`markers` blob,`parseStatus` blob,`modifiedTime` bigint(20) DEFAULT NULL,`score` float DEFAULT NULL,`typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,`baseUrl` varchar(512) CHARACTER SET latin1 DEFAULT NULL,`content` mediumblob,`title` varchar(2048) DEFAULT NULL,`reprUrl` varchar(512) CHARACTER SET latin1 DEFAULT NULL,`fetchInterval` int(11) DEFAULT NULL,`prevFetchTime` bigint(20) DEFAULT NULL,`inlinks` mediumblob,`prevSignature` blob,`outlinks` mediumblob,`fetchTime` bigint(20) DEFAULT NULL,`retriesSinceFetch` int(11) DEFAULT NULL,`protocolStatus` blob,`signature` blob,`metadata` blob,PRIMARY KEY (`id`)) ENGINE=InnoDB DEFAULT CHARSET=utf8;
?或者用數(shù)據(jù)庫(kù)備份sql:
先創(chuàng)建數(shù)據(jù)庫(kù):
CREATE DATABASE nutch2 DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;?
-- MySQL dump 10.13 Distrib 5.6.10, for Win64 (x86_64) -- -- Host: yqxt Database: nutch -- ------------------------------------------------------ -- Server version 5.6.10/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */; /*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */; /*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */; /*!40101 SET NAMES utf8 */; /*!40103 SET @OLD_TIME_ZONE=@@TIME_ZONE */; /*!40103 SET TIME_ZONE='+00:00' */; /*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */; /*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */; /*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */; /*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;-- -- Table structure for table `webpage` --DROP TABLE IF EXISTS `webpage`; /*!40101 SET @saved_cs_client = @@character_set_client */; /*!40101 SET character_set_client = utf8 */; CREATE TABLE `webpage` (`id` varchar(767) CHARACTER SET latin1 NOT NULL,`headers` blob,`text` mediumtext,`status` int(11) DEFAULT NULL,`markers` blob,`parseStatus` blob,`modifiedTime` bigint(20) DEFAULT NULL,`score` float DEFAULT NULL,`typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,`baseUrl` varchar(512) CHARACTER SET latin1 DEFAULT NULL,`content` mediumblob,`title` varchar(2048) DEFAULT NULL,`reprUrl` varchar(512) CHARACTER SET latin1 DEFAULT NULL,`fetchInterval` int(11) DEFAULT NULL,`prevFetchTime` bigint(20) DEFAULT NULL,`inlinks` mediumblob,`prevSignature` blob,`outlinks` mediumblob,`fetchTime` bigint(20) DEFAULT NULL,`retriesSinceFetch` int(11) DEFAULT NULL,`protocolStatus` blob,`signature` blob,`metadata` blob,PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; /*!40101 SET character_set_client = @saved_cs_client */; /*!40103 SET TIME_ZONE=@OLD_TIME_ZONE */;/*!40101 SET SQL_MODE=@OLD_SQL_MODE */; /*!40014 SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS */; /*!40014 SET UNIQUE_CHECKS=@OLD_UNIQUE_CHECKS */; /*!40101 SET CHARACTER_SET_CLIENT=@OLD_CHARACTER_SET_CLIENT */; /*!40101 SET CHARACTER_SET_RESULTS=@OLD_CHARACTER_SET_RESULTS */; /*!40101 SET COLLATION_CONNECTION=@OLD_COLLATION_CONNECTION */; /*!40111 SET SQL_NOTES=@OLD_SQL_NOTES */;-- Dump completed on 2013-07-01 9:36:44 View Code?
 2、安裝nutch2.1
A、?nutch下載地址:http://apache.etoak.com/nutch/2.1/apache-nutch-2.1-src.zip
[html]view plaincopyprint?
D、修改${APACHE_NUTCH_HOME}/conf/nutch-site.xml 加入如下代碼: [html]view plaincopyprint?
E、使用ant編譯?${APACHE_NUTCH_HOME}?。
F、設(shè)置待抓取的網(wǎng)站 cd ${APACHE_NUTCH_HOME}/runtime/local?mkdir -p urls?
echo 'http://nutch.apache.org/' < urls/seed.txt G、執(zhí)行爬行操作:?bin/nutch crawl urls -depth 3 -topN 5 執(zhí)行完在mysql中即可以查看到爬蟲抓取的內(nèi)容 3、安裝solr,對(duì)nutch抓取的內(nèi)容進(jìn)行索引 (注意:參考資料中推薦使用solr4.0版本,4.0的兩個(gè)版本我都試了,沒有成功,所以替換為3.6.1版本)? solr下載地址:http://www.fayea.com/apache-mirror/lucene/solr/3.6.1/apache-solr-3.6.1.zip A、解壓縮下載包, B、下載?http://nlp.solutions.asia/wp-content/uploads/2012/08/schema.xml替換${APACHE_SOLR_HOME}/example/solr/conf/schema.xml. C、啟動(dòng)solr
cd ${APACHE_SOLR_HOME}/example
java -jar start.jar
D、在瀏覽器輸入地址http://localhost:8983/solr?測(cè)試是否啟動(dòng)成功。
E、另起linux終端,輸入如下命令,使solr對(duì)nutch抓取內(nèi)容進(jìn)行索引。
cd ${APACHE_NUTCH_HOME}/runtime/local/
bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex
4、測(cè)試 在瀏覽器輸入?http://localhost:8983/solr?,看到如下界面: 
轉(zhuǎn)載于:https://www.cnblogs.com/fengfengqingqingyangyang/p/3156696.html
總結(jié)
以上是生活随笔為你收集整理的Nutch2.1+mysql+solr3.6.1+中文网站抓取的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
 
                            
                        - 上一篇: 暨怎么读,暨的拼音是什么
- 下一篇: 房贷基点是什么意思
