Java爬虫技术(二)爬取京东iPhone商品信息并生成Json日志
準備
配置maven環境
下載瀏覽器驅動,并引入;
下載瀏覽器驅動
前往華為云鏡像站下載谷歌瀏覽器驅動
https://mirrors.huaweicloud.com/home
要下載與自己電腦上谷歌瀏覽器版本相匹配的;
引入pom.xml依賴
<dependency><groupId>org.seleniumhq.selenium</groupId><artifactId>selenium-java</artifactId><version>3.141.59</version></dependency><dependency><groupId>org.seleniumhq.selenium</groupId><artifactId>selenium-api</artifactId><version>3.141.59</version></dependency>開始爬取
目標網址
https://search.jd.com/Search?keyword=iphone%2013&psort=3&wq=iphone%2013&psort=3&pvid=37f665a84459460eb713df08bdcd7799&page=1&s=1&click=0
目標是爬取所有47頁的商品的 價格,商品名稱,商品描述;
第一頁網址:
https://search.jd.com/Search?keyword=iphone%2013&psort=3&wq=iphone%2013&psort=3&pvid=37f665a84459460eb713df08bdcd7799&page=1&s=1&click=0
第二頁網址:
https://search.jd.com/Search?keyword=iphone%2013&psort=3&wq=iphone%2013&psort=3&pvid=37f665a84459460eb713df08bdcd7799&page=3&s=61&click=0
第三頁網址:
https://search.jd.com/Search?keyword=iphone%2013&psort=3&wq=iphone%2013&psort=3&pvid=37f665a84459460eb713df08bdcd7799&page=5&s=121&click=0
經過分析發現規律: page=1 , 3 , 5 為奇數 --2i+1
s =1,60,121 依次加60-- ((i-1)*60)+1
主要代碼與操作步驟
定義一個瀏覽器對象
ChromeDriver webDriver = new ChromeDriver();利用循環爬取45頁的數據
for(int i=1;i<45;i++) {webDriver.get("https://search.jd.com/search?keyword=iphone%2013&wq=iphone%2013&cid3=655&psort=3&page="+((2*i)-1)+"&s="+((i-1)*60)+1+"&click=0");將頁面滑到最底部,因為京東頁面若不滑到最下面會導致有些數據加載不出來
((JavascriptExecutor)webDriver).executeScript("window.scrollTo(0,document.body.scrollHeight)");獲取源碼
String pageSource = webDriver.getPageSource();選中一個單元獲取其路徑
查看價格路徑
在其循環體下
div.p-price > strong > i同理:店鋪名稱路徑 div.p-shop > span > a
商品描述路徑:div.p-name.p-name-type-2 > a > em
還要創建一個product類并建對象:
package com.zygxy.shop;public class Product {private String price;private String shopname;private String shopcontext;public String getPrice() {return price;}public void setPrice(String price) {this.price = price;}public String getShopname() {return shopname;}public void setShopname(String shopname) {this.shopname = shopname;}public String getShopcontext() {return shopcontext;}public void setShopcontext(String shopcontext) {this.shopcontext = shopcontext;} }全部代碼
package com.zygxy.shop;import com.alibaba.fastjson.JSONObject; import lombok.extern.slf4j.Slf4j; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import org.openqa.selenium.JavascriptExecutor; import org.openqa.selenium.chrome.ChromeDriver;import java.io.IOException; import java.util.Properties; @Slf4j public class JD {public static void main(String[] args) throws IOException {Properties properties = new Properties();properties.load(JD.class.getClassLoader().getResourceAsStream("application.properties"));System.out.println(properties.getProperty("chromedriver"));System.setProperty("webdriver.chrome.driver",properties.getProperty("chromedriver"));ChromeDriver webDriver = new ChromeDriver();for (int i=1;i<45;i++){webDriver.get("https://search.jd.com/search?keyword=iphone%2013&psort=3&wq=iphone%2013&psort=3&pvid=aa23e9f58a714e9087f316ab6aa993bd&cid3=655&cid2=653&page="+((2*i)-1)+"&s="+((i-1)*60)+1+"&click=0");((JavascriptExecutor) webDriver).executeScript("window.scrollTo(0,document.body.scrollHeight)");String pageSource = webDriver.getPageSource();Document parse = Jsoup.parse(pageSource);Elements select = parse.select("#J_goodsList > ul > li > div ");for(Element e : select){Product product = new Product();Element price = e.selectFirst(" div.p-price > strong > i");product .setPrice(price.text());Element shop_name = e.selectFirst(" div.p-shop > span > a");product.setShopname(shop_name.text());Element shopcontext = e.selectFirst(" div.p-name.p-name-type-2 > a > em");product.setShopcontext(shopcontext.text());String json = JSONObject.toJSONString(product);log.info(json);}}} }pom.xml
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"><modelVersion>4.0.0</modelVersion><groupId>org.example</groupId><artifactId>NovelParse</artifactId><version>1.0-SNAPSHOT</version><properties><maven.compiler.source>8</maven.compiler.source><maven.compiler.target>8</maven.compiler.target></properties><dependencies><!-- https://mvnrepository.com/artifact/org.jsoup/jsoup --><dependency><groupId>org.jsoup</groupId><artifactId>jsoup</artifactId><version>1.13.1</version></dependency><!-- https://mvnrepository.com/artifact/commons-io/commons-io --><dependency><groupId>commons-io</groupId><artifactId>commons-io</artifactId><version>2.4</version></dependency><!-- https://mvnrepository.com/artifact/org.seleniumhq.selenium/selenium-java --><dependency><groupId>org.seleniumhq.selenium</groupId><artifactId>selenium-java</artifactId><version>3.141.59</version></dependency><!-- https://mvnrepository.com/artifact/org.seleniumhq.selenium/selenium-api --><dependency><groupId>org.seleniumhq.selenium</groupId><artifactId>selenium-api</artifactId><version>3.141.59</version></dependency><!-- https://mvnrepository.com/artifact/com.alibaba/fastjson --><dependency><groupId>com.alibaba</groupId><artifactId>fastjson</artifactId><version>1.2.75</version></dependency><!-- https://mvnrepository.com/artifact/ch.qos.logback/logback-core --><dependency><groupId>ch.qos.logback</groupId><artifactId>logback-core</artifactId><version>1.2.3</version></dependency><!-- https://mvnrepository.com/artifact/ch.qos.logback/logback-classic --><dependency><groupId>ch.qos.logback</groupId><artifactId>logback-classic</artifactId><version>1.2.3</version></dependency><dependency><groupId>org.slf4j</groupId><artifactId>slf4j-api</artifactId><version>1.7.25</version><scope>compile</scope></dependency><!-- https://mvnrepository.com/artifact/org.projectlombok/lombok --><dependency><groupId>org.projectlombok</groupId><artifactId>lombok</artifactId><version>1.18.12</version><scope>provided</scope></dependency></dependencies><build><plugins><!-- 如果已經在Maven的全局配置中,配置了JDK編譯的界別,這個插件可以省略 --><!-- <plugin><artifactId>maven-compiler-plugin</artifactId><version>2.3.2</version><configuration><source>1.8</source><target>1.8</target></configuration></plugin> --><!-- 在mvn:package階段使用maven-assembly-plugin可以將當前項目依賴的Jar中的字節碼也打包!默認的打包插件maven-jar-plugin,只會將自己寫的代碼打包,默認倉庫中已經安裝了所需的依賴!--><plugin><artifactId>maven-assembly-plugin</artifactId><configuration><descriptorRefs><descriptorRef>jar-with-dependencies</descriptorRef></descriptorRefs><archive><manifest><mainClass>com.zygxy.JD</mainClass></manifest></archive></configuration><executions><execution><id>make-assembly</id><phase>package</phase><goals><goal>single</goal></goals></execution></executions></plugin></plugins></build> </project>resources配置
application.properties
本地驅動地址
chromedriver =E:\shixun\pachong\src\main\resources\chromedriver.exe
logback.xml 定義了日志的輸出地址和格式
總結
以上是生活随笔為你收集整理的Java爬虫技术(二)爬取京东iPhone商品信息并生成Json日志的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Java爬虫技术(一)普通网站爬取图片
- 下一篇: Linux突然连不上网,ping不通百度