當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Jsoup解析html

發布時間：2024/1/1 编程问答 32 豆豆

生活随笔收集整理的這篇文章主要介紹了 Jsoup解析html 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

從一個URL，文件或字符串中解析 HTML

使用DOM或CSS選擇器來查找、取出數據

可操作HTML元素、屬性、文木

二、入門案例

我們用Jsoup來提取下http://www.cnblogs.com/?博客園的網頁title（標題） ?和（口號）；

這里我們要用到HttpClient來獲取網頁內容：

gradle 配置：

//添加 httpclient 支持

// https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient

compile group: ‘org.apache.httpcomponents’, name: ‘httpclient’, version: ‘4.5.7’

//添加 jsoup 支持

// https://mvnrepository.com/artifact/org.jsoup/jsoup

compile group: ‘org.jsoup’, name: ‘jsoup’, version: ‘1.11.3’

maven 項目：

org.apache.httpcomponents

httpclient

4.5.7

org.jsoup

jsoup

1.11.3

/**

輸入一個網址返回這個網址的字符串

public String getHtml(String str) throws IOException {

CloseableHttpClient httpclient = HttpClients.createDefault(); // 創建httpclient實例

HttpGet httpget = new HttpGet(str); // 創建httpget實例

CloseableHttpResponse response = httpclient.execute(httpget); // 執行get請求

HttpEntity entity = response.getEntity(); // 獲取返回實體

String content = EntityUtils.toString(entity, “utf-8”);

response.close(); // 關閉流和釋放系統資源

return content;

}

/**

爬取博客園
1、網頁標題
2、口號

@Test

public void test() throws IOException {

Document doc = Jsoup.parse(getHtml(“http://www.cnblogs.com/”)); // 解析網頁得到文檔對象

Elements elements = doc.getElementsByTag(“title”); // 獲取tag是title的所有DOM元素

Element element = elements.get(0); // 獲取第1個元素

String title = element.text(); // 返回元素的文本

System.out.println(“網頁標題：” + title);

Element element2 = doc.getElementById(“site_nav_top”); // 獲取id=site_nav_top的DOM元素

String navTop = element2.text(); // 返回元素的文本

System.out.println(“口號：” + navTop);

}

輸出：

網頁標題：博客園 - 代碼改變世界

口號：代碼改變世界

/**

獲取文章的 url

@Test

public void test5() throws IOException {

Document doc = Jsoup.parse(getHtml(“http://www.cnblogs.com/”)); // 解析網頁得到文檔對象

Elements linkElements = doc.select("#post_list .post_item .post_item_body h3 a"); //通過選擇器查找所有博客鏈接DOM

for (Element e : linkElements) {

System.out.println(e.attr(“href”));

}

三、Jsoup查找DOM元素

Jsoup提供了豐富的API來給我們查找我們需要的DOM元素，常用的如下：

getElementById(Stringid) 根據 id 來查詢 DOM

getElementsByTag(StringtagName) 根據 tag 名稱來查詢 DOM

getElementsByClass(StringclassName) 根據樣式名稱來查詢 DOM

getElementsByAttribute(Stringkey) 根據屬性名來查詢 DOM

getElementsByAttributeValue(Stringkey,Stringvalue) 根據屬性名和屬性值來查詢 DOM

/**

Jsoup 查找 DOM 元素

@Test

public void test2() throws IOException {

Document doc = Jsoup.parse(getHtml(“http://www.cnblogs.com/”)); // 解析網頁得到文檔對象

Elements itemElements = doc.getElementsByClass(“post_item”); // 根據樣式名稱來查詢DOM

System.out.println("=輸出post_item========");

for (Element e : itemElements) {

System.out.println(e.html());//獲取里面所有的 html 包括文本

System.out.println("\n");

}

Elements widthElements = doc.getElementsByAttribute(“width”); // 根據屬性名稱來查詢DOM（id class type 等）,用的少一般很難找用這種方法

System.out.println("=輸出with的DOM========");

for (Element e : widthElements) {

System.out.println(e.toString());//不能用 e.html() 這里需要輸出 DOM

}

Elements targetElements = doc.getElementsByAttributeValue(“target”, “_blank”);

System.out.println("=輸出target-_blank的DOM========");

for (Element e : targetElements) {

System.out.println(e.toString());

}

四、Jsoup使用選擇器語法查找DOM元素

我們前面通過標簽名，Id，Class樣式等來搜索DOM，這些是不能滿足實際開發需求的，很多時候我們需要尋找有規律的DOM集合，很多個有規律的標簽層次，這時候，選擇器就用上了。css jquery 都有，Jsoup支持css，jquery類似的選擇器語法。

/**

有層級關系

@Test

public void test3() throws IOException {

Document doc = Jsoup.parse(getHtml(“http://www.cnblogs.com/”)); // 解析網頁得到文檔對象

Elements linkElements = doc.select("#post_list .post_item .post_item_body h3 a"); //通過選擇器查找所有博客鏈接DOM（范圍重小到大）

for (Element e : linkElements) {

System.out.println(“博客標題：” + e.text());//超鏈接的內容

}

System.out.println("--------------------帶有href屬性的a元素--------------------------------");

Elements hrefElements = doc.select(“a[href]”); // 帶有href屬性的a元素

for (Element e : hrefElements) {

System.out.println(e.toString());

}

System.out.println("------------------------查找擴展名為.png的圖片----------------------------");

Elements imgElements = doc.select(“img[src$=.png]”); // 查找擴展名為.png的圖片DOM節點

for (Element e : imgElements) {

System.out.println(e.toString());

}

System.out.println("------------------------獲取第一個元素----------------------------");

Element element = doc.getElementsByTag(“title”).first(); // 獲取tag是title的所有DOM元素

String title = element.text(); // 返回元素的文本

System.out.println(“網頁標題是：” + title);

}

五、Jsoup獲取DOM元素屬性值

/**

獲取 DOM 元素屬性值

@Test

public void test4() throws IOException {

Document doc = Jsoup.parse(getHtml(“http://www.cnblogs.com/”)); // 解析網頁得到文檔對象

Elements linkElements = doc.select("#post_list .post_item .post_item_body h3 a"); //通過選擇器查找所有博客鏈接DOM

for (Element e : linkElements) {

System.out.println(“博客標題：” + e.text());//獲取里面所有的文本

System.out.println(“博客地址：” + e.attr(“href”));

System.out.println(“target：” + e.attr(“target”));

}

System.out.println("------------------------友情鏈接----------------------------");

Element linkElement = doc.select("#friend_link").first();

System.out.println(“純文本：” + linkElement.text());//去掉 html

System.out.println("------------------------Html----------------------------");

System.out.println(“Html：” + linkElement.html());

}

/**

獲取文章的 url

@Test

public void test5() throws IOException {

Document doc = Jsoup.parse(getHtml(“http://www.cnblogs.com/”)); // 解析網頁得到文檔對象

Elements linkElements = doc.select("#post_list .post_item .post_item_body h3 a"); //通過選擇器查找所有博客鏈接DOM

for (Element e : linkElements) {

System.out.println(e.attr(“href”));

}

注意：Element 的幾個獲取內容的方法區別

text()? ? ? ? ? ? 獲取的是去掉了 html 元素，也就是只用元素內容

toString()? ? ??DOM

html()? ? ? ? ? 獲取里面所有的 html 包括文本

import org.apache.http.HttpEntity;

import org.apache.http.client.methods.CloseableHttpResponse;

import org.apache.http.client.methods.HttpGet;

import org.apache.http.impl.client.CloseableHttpClient;

import org.apache.http.impl.client.HttpClients;

import org.apache.http.util.EntityUtils;

import org.jsoup.Jsoup;

import org.jsoup.nodes.Document;

import org.jsoup.nodes.Element;

import org.jsoup.select.Elements;

import org.junit.Test;

import java.io.IOException;

public class Main {

/**

輸入一個網址返回這個網址的字符串

public String getHtml(String str) throws IOException {

CloseableHttpClient httpclient = HttpClients.createDefault(); // 創建httpclient實例

HttpGet httpget = new HttpGet(str); // 創建httpget實例

CloseableHttpResponse response = httpclient.execute(httpget); // 執行get請求

HttpEntity entity = response.getEntity(); // 獲取返回實體

String content = EntityUtils.toString(entity, “utf-8”);

response.close(); // 關閉流和釋放系統資源

return content;

}

/**

爬取博客園
1、網頁標題
2、口號

@Test

public void test() throws IOException {

Document doc = Jsoup.parse(getHtml(“http://www.cnblogs.com/”)); // 解析網頁得到文檔對象

Elements elements = doc.getElementsByTag(“title”); // 獲取tag是title的所有DOM元素

Element element = elements.get(0); // 獲取第1個元素

String title = element.text(); // 返回元素的文本

System.out.println(“網頁標題：” + title);

Element element2 = doc.getElementById(“site_nav_top”); // 獲取id=site_nav_top的DOM元素

String navTop = element2.text(); // 返回元素的文本

System.out.println(“口號：” + navTop);

}

/**

Jsoup 查找 DOM 元素

@Test

public void test2() throws IOException {

Document doc = Jsoup.parse(getHtml(“http://www.cnblogs.com/”)); // 解析網頁得到文檔對象

Elements itemElements = doc.getElementsByClass(“post_item”); // 根據樣式名稱來查詢DOM

System.out.println("=輸出post_item========");

for (Element e : itemElements) {

System.out.println(e.html());//獲取里面所有的 html 包括文本

System.out.println("\n");

}

Elements widthElements = doc.getElementsByAttribute(“width”); // 根據屬性名稱來查詢DOM（id class type 等）,用的少一般很難找用這種方法

System.out.println("=輸出with的DOM========");

for (Element e : widthElements) {

System.out.println(e.toString());//不能用 e.html() 這里需要輸出 DOM

}

Elements targetElements = doc.getElementsByAttributeValue(“target”, “_blank”);

System.out.println("=輸出target-_blank的DOM========");

for (Element e : targetElements) {

System.out.println(e.toString());

總結

以上是生活随笔為你收集整理的Jsoup解析html的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

html
Jsoup

上一篇： wincc中面板实例和画面窗口示例
下一篇：安卓版百度网盘 10.0 VIP

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

编程问答

Jsoup解析html

總結