當前位置：首頁 > 编程语言 > python >内容正文

python

Python爬虫自学之第（②）篇——BeautifulSoup解析网页

發布時間：2023/12/10 python 29 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python爬虫自学之第（②）篇——BeautifulSoup解析网页小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

題外話：

《Pi Network 免費挖礦國外熱門項目一個π幣大約值3元到10元》相信過去BTC的人，信不信未來的PI，了解一下，唯一一個高度與之持平的項目

學了requests，了解了偽裝技巧后，終于能爬到些比較正常的網頁源碼（html文檔）了，但這離結果還差最后和是最重要的一步——篩選。這個過程就像在泥沙中淘金一樣，沒有合適的篩子，你就會把有價值的漏掉了，或者做了無用功把沒用的也篩了出來。
淘金者看土質，做篩子。對應到爬蟲領域就是觀察html，定制篩選器。

稍稍了解HTML

信息都在網頁源碼里，瀏覽器通過解析源碼來加載我們所看到的東西，那我們是不是也應該學下如何看源碼呢？——是的

但不要方，這不是html語法課，做爬蟲的，只需了解一下html的原理和標簽關系就行了，這跟認親戚一樣簡單，你會看家族樹的話根本不成問題。

示例html，足以解釋所有節點關系。

<head><title>這些有尖括號的叫做標簽（或節點）,成對存在。<head><title>是標簽名，標簽間可以放字符串。
標簽可以擁有屬性，屬性在尖括號里，如title標簽有名為lang的屬性，屬性值為"en"。
A節點被B節點包起來，A就是B的子，或B是A的父。如book和title都是是bookstore的子，但是book是bookstore的直接子（只有一層包含關系）
有同一個直接父的標簽互相為兄弟，如title，author，year，price互為兄弟。

好了，準備以下代碼信息，用來練習獲取內容:

from bs4 import BeautifulSoup

from bs4 import BeautifulSoup #準備代碼信息，用來練習獲取內容 html =''' <html> <head><title>The Dormouse's story</title></head> <body> <h1><b>123456</b></h1> <p class="title" name="dromouse"><b>The Dormouse's story</b>aaaaa </p> <p class="title" name="dromouse" title='new'><b>The Dormouse's story</b>a</p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;<a href="http://example.com/tillie" class="siterr" id="link4">Tillie</a>; <a href="http://example.com/tillie" class="siterr" id="link5">Tillie</a>; and they lived at the bottom of a well. </p> <p class="story">...</p> <ul id="ulone"><li>01</li><li>02</li><li>03</li><li>04</li><li>05</li> </ul> <div class='div11'><ul id="ultwo"><li>0001</li><li>0002</li><li>0003</li><li>0004</li><li>0005</li></ul> </div> </body> </html>

1.得到beautifulsoup對象

soup = BeautifulSoup(html,'html.parser') #選擇解析器

2.獲取內容

(1)獲取標題對象

print(soup.title)

獲取標題文本字符串:

print(soup.title.string) #返回迭代器，出現換行就找不到了 print(soup.title.text) print(soup.title.get_text()) title = soup.find('title').get_text() print(title)

通過上下級關系獲取對象

print(soup.title.parent) print(soup.title.child) print(soup.title.children)

（2）獲取第一個p標簽

print(soup.p.get_text()) print(soup.find('p').text) #獲取p的子標簽們 (空行也看成了一個children) print(soup.p.children) for i,echo in enumerate(soup.p.children):print(i,echo)

（3）獲取標簽的屬性

#只能找到第一個a標簽 print('1',soup.a) print('2',soup.a.name) #應該這樣寫 print(soup.a.attrs) print(soup.a.attrs['href']) print(soup.a.attrs['id']) print(soup.a.attrs['class'][0]) #獲得的是一個列表，可以用下標查詢

（4）獲取多個

print(soup.find('p')) #獲取一個 print(soup.find_all('p')) #獲取soup內的p標簽返回一個列表

（5）多層查詢
find_all查詢返回的是列表，使用下標尋找想要的內容

print(soup.find_all('ul')) print(soup.find_all('ul')[0].find_all('li'))

(6)通過指定的屬性，獲取對象

print(soup.find(id='ulone')) #單個對象 print(soup.find('ul',id='ulone')) print(soup.find_all('ul',id='ulone')) #可以使用下標查詢

class是關鍵字要這么寫class_

print('class1',soup.find_all('p',class_='title')) print('class2',soup.find_all('p',attrs={'class':'title'})) #更通用 print('class3',soup.find_all('p',attrs={'class':'title','title':'new'})) #多條件

用函數作為參數，獲取元素

def judgeTilte1(t):if t=='title':return True print(soup.find_all(class_=judgeTilte1))

判斷長度

import re #正則表達式 reg = re.compile("sis") def judgeTilte2(t):#返回長度為6，且包含'sis'的t參數return len(str(t))==6 and bool(re.search(reg,t)) print(soup.find_all(class_=judgeTilte2))

#獲取文本內容

# <p class="title" name="dromouse"> # <b>The Dormouse's story</b> # aaaaa # </p> print(soup.find('p').text) print(soup.find('p').string) #返回迭代器，出現換行就找不到了<b>， print(soup.find('p').get_text())

獲取可以加limit的，限制訪問個數

print(soup.find_all('a',limit=2))

recursive = True 尋找子孫；recursive = False只找子

print(soup.find_all('body')[0].find_all('ul',recursive = False))

總結

以上是生活随笔為你收集整理的Python爬虫自学之第（②）篇——BeautifulSoup解析网页的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：用android程序帮美女换衣服源码
下一篇： ORACLE建表语句转化为MySQL建表