生活随笔
收集整理的這篇文章主要介紹了
java爬虫系列(二)——爬取动态网页
小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.
- 準(zhǔn)備工作
- 項(xiàng)目地址
- 網(wǎng)頁(yè)解析工具地址
- 啟動(dòng)網(wǎng)頁(yè)解析器
- 根據(jù)系統(tǒng)選擇所需文件
- 指定端口號(hào)啟動(dòng)工具
- 項(xiàng)目配置
- seimi.properties
- SeimiAgentDemo.java
- 分析原網(wǎng)頁(yè)代碼
- Boot.java
- 同系列文章
準(zhǔn)備工作
新手的話推薦使用seimiagent+seimicrawler的爬取方式,非常容易上手,輕松爬取動(dòng)態(tài)網(wǎng)頁(yè),目測(cè)初步上手10分鐘以內(nèi)吧。
項(xiàng)目地址
https://github.com/a252937166/seimicrawler
網(wǎng)頁(yè)解析工具地址
https://github.com/a252937166/seimiagent
啟動(dòng)網(wǎng)頁(yè)解析器
根據(jù)系統(tǒng)選擇所需文件
下載好seimiagent,根據(jù)自己的操作系統(tǒng),如果是windows,就用seimiagent.exe,如果是linux,就選擇seimiagent,mac版本暫時(shí)還沒(méi)有,我的話一般會(huì)把seimiagent放在自己的linux服務(wù)器上。
指定端口號(hào)啟動(dòng)工具
以linux為例,進(jìn)入文件所在目錄,指定8000端口,./seimiagent -p 8000,即可啟動(dòng)。
圖(1)
項(xiàng)目配置
seimi.properties
redis
.host=
127.0.0.1
redis
.port=
6379
redis
.password=
database
.driverClassName=
com.mysql.jdbc.Driver
database
.url=
database
.username=
database
.password=
seimiAgentHost=
127.0.0.1
seimiAgentPort=
8000
找到此配置文件,修改seimiAgentHost和seimiAgentPort為自己的地址信息。
SeimiAgentDemo.java
package com.ouyang.crawlers;
import cn.wanghaomiao.seimi.annotation.Crawler;
import cn.wanghaomiao.seimi.def.BaseSeimiCrawler;
import cn.wanghaomiao.seimi.struct.Request;
import cn.wanghaomiao.seimi.struct.Response;
import cn.wanghaomiao.xpath.model.JXDocument;
import org.apache.commons.lang3.StringUtils;
import org.springframework.beans.factory.annotation.Value;
/*** 這個(gè)例子演示如何使用SeimiAgent進(jìn)行復(fù)雜動(dòng)態(tài)頁(yè)面信息抓取* @author 汪浩淼 et.tw@163.com* @since 2016/4/14.*/
@Crawler(name =
"seimiagent")
public class SeimiAgentDemo extends BaseSeimiCrawler{/*** 在resource/config/seimi.properties中配置方便更換,當(dāng)然也可以自行根據(jù)情況使用自己的統(tǒng)一配置中心等服務(wù)*/@Value(
"${seimiAgentHost}")
private String seimiAgentHost;
@Value(
"${seimiAgentPort}")
private int seimiAgentPort;
@Overridepublic String[]
startUrls() {
return new String[]{
"https://www.baidu.com"};}
@Overridepublic String
seimiAgentHost() {
return this.seimiAgentHost;}
@Overridepublic int seimiAgentPort() {
return this.seimiAgentPort;}
@Overridepublic void start(Response response) {Request seimiAgentReq = Request.build(
"http://manhua.fzdm.com/2/889/",
"getHtml").useSeimiAgent()
.setSeimiAgentRenderTime(
5000);push(seimiAgentReq);}
/*** 打印網(wǎng)頁(yè)信息* @param response*/public void getHtml(Response response){
try {System.out.println(response.getContent());}
catch (Exception e) {e.printStackTrace();}}
}
找到這個(gè)demo文件,注意.useSeimiAgent(),這就是表示需要使用seimiagent來(lái)解析動(dòng)態(tài)網(wǎng)頁(yè)了,此外還可以設(shè)置cookie,param,meta這些參數(shù)。
我們使用getHtml()這個(gè)回調(diào)函數(shù)來(lái)打印網(wǎng)頁(yè)信息,對(duì)比一下看看,有沒(méi)有順利解析成功。
分析原網(wǎng)頁(yè)代碼
我們使用chrome瀏覽器的network查看網(wǎng)頁(yè)原始代碼。
圖(2)
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="utf-8">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta http-equiv="Content-Language" content="utf-8" />
<meta content="all" name="robots" />
<title>海賊王889話 風(fēng)之動(dòng)漫
</title>
<meta name="keywords" content="海賊王889話 " />
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta http-equiv="Cache-Control" content="no-transform" />
<meta http-equiv="Cache-Control" content="no-siteapp" />
<meta name="applicable-device" content="pc,mobile" />
<meta name="HandheldFriendly" content="true" />
<meta property="og:title" content="海賊王889話"/>
<meta property="og:type" content="book"/>
<meta property="og:url" id="readurl" content="http://manhua.fzdm.com/2/889/" /><link rel="stylesheet" href="//static.fzdm.com/pure/pure-min.css">
<link rel="stylesheet" href="//static.fzdm.com/pure/grids-responsive-min.css">
<link rel="stylesheet" href="//static.fzdm.com/pure/fzdm.css">
<link rel="icon" href="//static.fzdm.com/favicon.ico" mce_href="//static.fzdm.com/favicon.ico" type="image/x-icon">
<meta name="renderer" content="webkit">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<link rel="apple-touch-icon" href="//static.fzdm.com/apple-touch-icon-144x144.png" /><style>
.logo {top: -2px;height: 70px;overflow: hidden;}.logo img{height:77px}
#header {height: 70px;
}
#header ul {top: 8px;}.pure-menu.pure-menu-open, .pure-menu.pure-menu-horizontal li .pure-menu-children {text-align: left;height: 70px;background: none;}</style><script>
var _hmt = _hmt || [];
(function() {
var hm = document.createElement("script");
hm.src = "//hm.baidu.com/hm.js?cb51090e9c10cda176f81a7fa92c3dfc";
var s = document.getElementsByTagName("script")[0];
s.parentNode.insertBefore(hm, s);
})();
</script></head>
<body>
<script src="//static.fzdm.com/jquery-1.9.1.min.js?v=1"></script>
<script src="//static.fzdm.com/fzdm.js?v=1"></script>
<script src="//static.fzdm.com/u.js"></script><script src="//dup.baidustatic.com/js/dm.js"></script>
<div id="header">
<div class="pure-g">
<div class="pure-menu pure-menu-open pure-menu-horizontal">
<div class="logo">
<a href="//www.fzdm.com"><img src="//static.fzdm.com/css/logo.png" alt="風(fēng)之動(dòng)漫" /></a>
</div>
<ul>
<li><a href="//www.fzdm.com/"> 首頁(yè)
</a></li>
<li><a href="//news.fzdm.com/">動(dòng)漫新聞
</a></li>
<li><a href="//manhua.fzdm.com/">在線漫畫(huà)
</a></li>
<li><a href="//flash.fzdm.com/">動(dòng)漫flash
</a></li></ul></div></div></div></div><center>
</center><br><br>
<div id="weizhi">位置:
<a href="//www.fzdm.com">首頁(yè)
</a> -
<a href="../../">在線漫畫(huà)
</a> -
<a href="../">海賊王
</a> - 海賊王889話
<h4 style="float:right;margin-right: 100px;"><a href="#comments">海賊王889話討論區(qū)
</a></h4></div>
<div id="mh">
<h1>海賊王889話
</h1><div id="mhimg0"><h2><a href="//manhua.fzdm.com/2/889/">《無(wú)法觀看》請(qǐng)點(diǎn)擊此處~
</a></h2></div><center><div id="share">
<div class="bdsharebuttonbox"><a href="#" class="bds_more" data-cmd="more">分享
<strong>海賊王889話漫畫(huà)
</strong>到:
</a><a href="#" class="bds_qzone" data-cmd="qzone" title="分享到QQ空間">QQ空間
</a><a href="#" class="bds_weixin" data-cmd="weixin" title="分享到微信">微信
</a><a href="#" class="bds_sqq" data-cmd="sqq" title="分享到QQ好友">QQ好友
</a><a href="#" class="bds_tsina" data-cmd="tsina" title="分享到新浪微博">微博
</a><a href="#" class="bds_tqq" data-cmd="tqq" title="分享到騰訊微博">騰訊
</a><a href="#" class="bds_renren" data-cmd="renren" title="分享到人人網(wǎng)">人人網(wǎng)
</a><a href="#" class="bds_fbook" data-cmd="fbook" title="分享到Facebook">Facebook
</a><a href="#" class="bds_baidu" data-cmd="baidu" title="分享到百度搜藏">百度搜藏
</a><a href="#" class="bds_bdhome" data-cmd="bdhome" title="分享到百度新首頁(yè)">百度首頁(yè)
</a><a class="bds_count" data-cmd="count"></a></div>
</div><div id="ad">
<script src='//m.xmshqh.com/fz2.js'></script>
</div></center>
<div class="navigation">
<a href="index_0.html" id="mhona">第1頁(yè)
</a><a href="index_1.html">2
</a><a href="index_2.html">3
</a><a href="index_3.html">4
</a><a href="index_4.html">5
</a><a href="index_5.html">6
</a><a href="index_6.html">7
</a><a href="index_7.html">8
</a><a href="index_8.html">9
</a><a href="index_9.html">10
</a><a href="index_10.html">11
</a><a href="index_11.html">12
</a><a href="index_12.html">13
</a><a href="index_13.html">14
</a><a href="index_14.html">15
</a><a href="index_15.html">16
</a><a href="index_16.html">17
</a><a href='index_1.html' id="mhona">下一頁(yè)
</a></div><br />
<br />
<script type="text/javascript">document.write('<a style="display:none!important" id="tanx-a-mm_10028503_120355_28042038"></a>');tanx_s = document.createElement("script");tanx_s.type = "text/javascript";tanx_s.charset = "gbk";tanx_s.id = "tanx-s-mm_10028503_120355_28042038";tanx_s.async = true;tanx_s.src = "//p.tanx.com/ex?i=mm_10028503_120355_28042038";tanx_h = document.getElementsByTagName("head")[0];if(tanx_h)tanx_h.insertBefore(tanx_s,tanx_h.firstChild);
</script><script type="text/javascript">document.write('<a style="display:none!important" id="tanx-a-mm_10028503_120355_28058018"></a>');tanx_s = document.createElement("script");tanx_s.type = "text/javascript";tanx_s.charset = "gbk";tanx_s.id = "tanx-s-mm_10028503_120355_28058018";tanx_s.async = true;tanx_s.src = "//p.tanx.com/ex?i=mm_10028503_120355_28058018";tanx_h = document.getElementsByTagName("head")[0];if(tanx_h)tanx_h.insertBefore(tanx_s,tanx_h.firstChild);
</script>
<script type="text/javascript">document.write('<a style="display:none!important" id="tanx-a-mm_10028503_120355_28066012"></a>');tanx_s = document.createElement("script");tanx_s.type = "text/javascript";tanx_s.charset = "gbk";tanx_s.id = "tanx-s-mm_10028503_120355_28066012";tanx_s.async = true;tanx_s.src = "//p.tanx.com/ex?i=mm_10028503_120355_28066012";tanx_h = document.getElementsByTagName("head")[0];if(tanx_h)tanx_h.insertBefore(tanx_s,tanx_h.firstChild);
</script>
<br /><br><br>
<script charset="gbk" src="//p.tanx.com/ex?i=mm_10028503_120355_41360495"></script><br /><br />
<br /><br /><div id="weizhi">熱門(mén)漫畫(huà)導(dǎo)航:
<a href='//manhua.fzdm.com/91/' target=_blank>美食的俘虜漫畫(huà)
</a> -
<a href='//manhua.fzdm.com/7/ 'target=_blank>死神漫畫(huà)
</a> -
<a href='//manhua.fzdm.com/39/' target=_blank>進(jìn)擊的巨人漫畫(huà)
</a> -
<a href='//manhua.fzdm.com/35/' target=_blank>家庭教師漫畫(huà)
</a> -
<a href="//manhua.fzdm.com/27/" target=_blank>妖精的尾巴漫畫(huà)
</a> -
<a href="//manhua.fzdm.com/1/" target=_blank>火影忍者漫畫(huà)
</a> -
<a href='//manhua.fzdm.com/53/' target=_blank>黑子的籃球漫畫(huà)
</a> -
<a href='//manhua.fzdm.com/45/' target=_blank>惡魔奶爸漫畫(huà)
</a> -
<a href='//manhua.fzdm.com/51/' target=_blank>史上最強(qiáng)弟子兼一漫畫(huà)
</a> -
<a href='//manhua.fzdm.com/74/' target=_blank>王者天下漫畫(huà)
</a> -
<a href='//manhua.fzdm.com/56/' target=_blank>七原罪漫畫(huà)
</a> -
<a href='//manhua.fzdm.com/141/' target=_blank>暗殺教室漫畫(huà)
</a></div>
<div id="mhimg1"></div>
<script type="text/javascript">var mhurl = "2017/12/22064917941533.jpg";
var mhss = getCookie("picHost");
if (mhss == "") {mhss = "p1.xiaoshidi.net";
}
if (mhurl.indexOf("2015") != -1 || mhurl.indexOf("2016") != -1|| mhurl.indexOf("2017") != -1 || mhurl.indexOf("2018") != -1){}else{mhss = mhss.replace(/p1/,"p0");
};var mhpicurl = mhss+"/"+mhurl;
if (mhurl.indexOf("http") != -1){mhpicurl = mhurl;
};
function nofind(){var img=event.srcElement;img.src="http://p1.xiaoshidi.net/"+mhurl;
var exp = new Date();
exp.setTime(exp.getTime() - 1);
document.cookie = "picHost=0;path=/;domain=fzdm.com;expires="+exp.toGMTString();
img.onerror=null;
};
$("#mhimg0").html('<a href="index_1.html"><img src="http://'+mhpicurl+'" id="mhpic" alt="海賊王889話" onerror="nofind();" /></a>');var mhurl1 = "2017/12/22064917942026.jpg";
mhpicurl = mhss+"/"+mhurl1;
$("#mhimg1").html('<img src="http://'+mhpicurl+'" width="0" height="0" id="mhpic1" />');</script><br />
<br />
<br />
<script>
if (document.location.protocol == "http:"){
window._bd_share_config={"common":{"bdSnsKey":{},"bdText":"海賊王889話 風(fēng)之動(dòng)漫","bdUrl":"//manhua.fzdm.com/2/889/","bdDesc":"海賊王889話","bdMini":"2","bdMiniList":false,"bdSign":"","bdPic":"","bdStyle":"0","bdSize":"16"},"share":{"bdSize":16}};with(document)0[(getElementsByTagName('head')[0]||body).appendChild(createElement('script')).src='http://bdimg.share.baidu.com/static/api/js/share.js?v=89860593.js?cdnversion='+~(-new Date()/36e5)];
};
</script>
</div><div class="clear"></div>
<div id="footer">
<div id="hd">
<div class="bg"></div>
<br><a href="//www.fzdm.com/about">關(guān)于我們
</a> |
<a href="//www.fzdm.com/lianxi">聯(lián)系我們
</a> |
<a href="//www.fzdm.com/map">網(wǎng)站地圖
</a><br />
Copyright ⓒ 2014-2015 風(fēng)之動(dòng)漫 版本beta 0.3
<br />
</div>
</div><div style="display:none;" ><script src="//static.fzdm.com/stat.js"></script></div></body>
</html>
請(qǐng)注意這段代碼
<div id="mhimg0"><h2><a href="//manhua.fzdm.com/2/889/">《無(wú)法觀看》請(qǐng)點(diǎn)擊此處~
</a></h2></div>
如果直接爬取原網(wǎng)頁(yè),肯定沒(méi)法獲取圖片的,那么圖片從哪來(lái)呢?
var mhurl =
"2017/12/22064917941533.jpg";
var mhss = getCookie(
"picHost");
if (mhss ==
"") {mhss =
"p1.xiaoshidi.net";
}
if (mhurl.indexOf(
"2015") != -
1 || mhurl.indexOf(
"2016") != -
1|| mhurl.indexOf(
"2017") != -
1 || mhurl.indexOf(
"2018") != -
1){}
else{mhss = mhss.replace(
/p1/,
"p0");
};
var mhpicurl = mhss+
"/"+mhurl;
if (mhurl.indexOf(
"http") != -
1){mhpicurl = mhurl;
};
function nofind(){var img=event.srcElement;img.src=
"http://p1.xiaoshidi.net/"+mhurl;
var exp =
new Date();
exp.setTime(exp.getTime() -
1);
document.cookie =
"picHost=0;path=/;domain=fzdm.com;expires="+exp.toGMTString();
img.onerror=
null;
};
$(
"#mhimg0").html(
'<a href="index_1.html"><img src="http://'+mhpicurl+
'" id="mhpic" alt="海賊王889話" onerror="nofind();" /></a>');
這一段js代碼在網(wǎng)頁(yè)加載完后自動(dòng)運(yùn)行,修改了<div id="mhimg0"></div>的內(nèi)容,才有了圖片。
js沒(méi)有多余請(qǐng)求,只是修改了網(wǎng)頁(yè)內(nèi)容,這種情況如果想直接java解析,只能用正則,而且如果js代碼稍有變化,正則解析就不行了。所以針對(duì)這種網(wǎng)頁(yè),直接用SeimiAgent把js渲染之后的網(wǎng)頁(yè)返回給我們是最好的處理方式。
Boot.java
package com.ouyang.main;
import cn.wanghaomiao.seimi.core.Seimi;
/*** @author 汪浩淼 [et.tw@163.com]* @since 2015/10/21.*/
public class Boot {public static void main(String[] args){Seimi s =
new Seimi();s.goRun(
"seimiagent");}
}
goRun("seimiagent");填寫(xiě)對(duì)用爬蟲(chóng)名就行了。
啟動(dòng)main函數(shù):
圖(3)
seimiagent的解析信息,
windows版本是后臺(tái)運(yùn)行,沒(méi)有解析信息。
控制臺(tái)信息:
圖(4)
<div id="mhimg0"><a href="index_1.html"><img src="http://p1.xiaoshidi.net/2017/12/22064917941533.jpg" id="mhpic" alt="海賊王889話" onerror="nofind();"></a></div>
這段代碼,很明顯表示,網(wǎng)頁(yè)信息已經(jīng)是成功解析后的了。
總的來(lái)說(shuō)這套爬蟲(chóng)框架還是很簡(jiǎn)單的,想要深入了解框架的同學(xué),可以看看下一篇關(guān)于實(shí)戰(zhàn)的文章。
同系列文章
java爬蟲(chóng)系列(一)——爬蟲(chóng)入門(mén)
java爬蟲(chóng)系列(三)——漫畫(huà)網(wǎng)站爬取實(shí)戰(zhàn)
java爬蟲(chóng)系列(四)——?jiǎng)討B(tài)網(wǎng)頁(yè)爬蟲(chóng)升級(jí)版
java爬蟲(chóng)系列(五)——今日頭條文章爬蟲(chóng)實(shí)戰(zhàn)
總結(jié)
以上是生活随笔為你收集整理的java爬虫系列(二)——爬取动态网页的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。