當前位置：首頁 > 编程语言 > java >内容正文

java

Java爬虫技术(一)普通网站爬取图片

發布時間：2025/3/20 java 19 豆豆

生活随笔收集整理的這篇文章主要介紹了 Java爬虫技术(一)普通网站爬取图片小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

爬蟲簡單介紹

用戶和網站服務器的操作如下
而爬蟲需要做的是模擬仿照用戶機,去向服務器發送請求數據,并接受響應數據,接著去解析數據,獲得我們想要的數據

步驟大致分為

準備好要爬取的網址
定義爬蟲的參數
開始爬
獲取爬取的數據
使用xpath技術去解析數據
獲取我們想要的數據

準備

新建一個maven項目,并配置pom.xml

爬蟲jar包工具,jsoup

<dependency><groupId>org.jsoup</groupId><artifactId>jsoup</artifactId><version>1.13.1</version></dependency>

IO流傳輸下載jar包 commons-io

<dependency><groupId>commons-io</groupId><artifactId>commons-io</artifactId><version>2.4</version></dependency>

爬取網站圖片練習

爬取當前頁面的圖片

https://dou.yuanmazg.com/doutu?page=1

主要操作步驟和代碼

選定一張圖片復制它的selector

#pic-detail > div > div.col-sm-9 > div.page-content > a:nth-child(2) > img

a:nth-child(2)

a 后面的字符代表該圖片是該頁面下面的第幾張圖片

那么把后面的字符去掉,就可以代表全部的圖片了

Elements select =dom.select("#pic-detail > div > div.col-sm-9 > div.page-content > a > img");

定義爬蟲的參數

Connection.Response response = Jsoup.connect("https://dou.yuanmazg.com/doutu?page=1").header("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0").ignoreContentType(true).timeout(10000).execute(); .header("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0")

這一句意思是

告訴目標服務器我們是以用戶瀏覽器去訪問的

execute()相當于一個回車鍵;

輸出文件到目標文件夾下

byte[] bytes = imgResponse.bodyAsBytes();IOUtils.write(bytes,new FileOutputStream(new File("d://斗圖啦//"+filename)));

String img_url = element.attr(“data-original”);

提取"data-original"里面的數據

完整代碼

package com.zygxy.parse;import org.apache.commons.io.IOUtils; import org.jsoup.Connection; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements;import java.io.BufferedInputStream; import java.io.File; import java.io.FileOutputStream; import java.io.IOException;public class Jsoup_Study {public static void main(String[] args) throws IOException {//Jsoup 模擬瀏覽器發起請求String website="http://dou.yuanmazg.com";Connection.Response response = Jsoup.connect("https://dou.yuanmazg.com/doutu?page=1").header("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36").ignoreContentType(true).timeout(10000).execute();//System.out.println(response.header("Content-Type")); //響應頭System.out.println(response.body()); //響應體String html=response.body();//Jsoup 解析HTMLDocument dom = Jsoup.parse(html);//選擇器//獲取多個//#pic-detail > div > div.col-sm-9 > div.page-content > a:nth-child(2) > imgElements select = dom.select("#pic-detail > div > div.col-sm-9 > div.page-content > a > img");//獲取單個for (Element element:select){String img_url=element.attr("data-original");String realurl=website+img_url;int i = img_url.lastIndexOf("/");String filename=img_url.substring(i+1);System.out.println(filename);System.out.println(realurl);Connection.Response imgResponse = Jsoup.connect(realurl).ignoreContentType(true).timeout(10000).maxBodySize(10 * 1024 * 1024) //10M的緩沖區.execute();//因為圖片是二進制音頻視頻圖片都用byte[] bytes = imgResponse.bodyAsBytes();IOUtils.write(bytes,new FileOutputStream(new File("d://斗圖啦//"+filename)));}} }

總結

以上是生活随笔為你收集整理的Java爬虫技术(一)普通网站爬取图片的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Echarts开源可视化库学习(三)主题
下一篇： Java爬虫技术(二)爬取京东iPhon