beautifulsoup4
                                                            生活随笔
收集整理的這篇文章主要介紹了
                                beautifulsoup4
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.                        
                                環境為:
- Python3.6
- windows
- pycharm2017.2.4
安裝:
# 安裝beautifulsoup4 pip install beautifulsoup4# 安裝解析器 pip install lxml# 另一個可供選擇的解析器是純Python實現的 html5lib,html5lib的解析方式與瀏覽器相同pip install html5lib?
?
?
基本使用
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | html_doc?=?""" <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ #基本使用:容錯處理,文檔的容錯能力指的是在html代碼不完整的情況下,使用該模塊可以識別該錯誤。<br>#使用BeautifulSoup解析上述代碼,能夠得到一個 BeautifulSoup 的對象,并能按照標準的縮進格式的結構輸出 from?bs4?import?BeautifulSoup soup=BeautifulSoup(html_doc,'lxml')?#具有容錯功能 res=soup.prettify()?#處理好縮進,結構化顯示 print(res) | 
標簽選擇器
| 1 | 即直接通過標簽名字選擇,選擇速度快,如果存在多個相同的標簽則只返回第一個<br><br> | 
?
from bs4 import BeautifulSouphtml_doc = """ <html><head><title>The Dormouse's story</title></head> <body><p>first tag</p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie<i>this i tag</i></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p><p class="story">...</p> """soup = BeautifulSoup(html_doc, 'lxml')# 獲取標簽的名稱 # print(soup.head) # <head><title>The Dormouse's story</title></head># 獲取標簽的屬性 # print(soup.p.name) # p# 直接獲取標簽,如果存在多個相同的標簽則只返回第一個 # print(soup.p) # <p>first tag</p># 獲取標簽的內容,# print(soup.p.string) # first tag # print(soup.a.string) # None # print(soup.p.text) # first tag # print(soup.a.text) # Elsiethis i tag # print(soup.a.contents) # ['Elsie', <i>this i tag</i>] """ 注意 contents獲取選中標簽內的所有的值,包括里面的標簽 string 只能獲取當前標簽,而無法獲取子標簽的內容,如果存在子標簽,則返回None text則獲取包括子標簽在內的所有值 """# 嵌套選擇 # print(soup.head.title.string) # The Dormouse's story # print(soup.body.a.contents) # ['Elsie', <i>this i tag</i>] # print(soup.body.a.text) # Elsiethis i tag # print(soup.body.a.string) # None # print(soup.body.p.string) # first tag# 獲取子節點,子孫節點 # print(soup.contents) # 返回整個HTML頁面的所有節點 # print(soup.p.contents) # ['first tag'] # print(soup.p.children) # 得到一個迭代器,包含此標簽內錯有的子節點 # print(list(soup.a.children)) # ['Elsie', <i>this i tag</i>] # print(soup.p.descendants) # <generator object descendants at 0x00000162FFB9D570> # print(list(soup.a.descendants)) # 獲取子孫節點,p下所有的標簽都會選擇出來 ['Elsie', <i>this i tag</i>, 'this i tag'] # for i, child in enumerate(soup.p.descendants): # print(i, child) # 0 first tag# 獲取父節點,祖先節點 # print(soup.a.parent) # 獲取 a 標簽 # print(soup.a.parents) # <generator object parents at 0x0000022F8747D570> # print(list(soup.a.parents)) # a 標簽的父,父,父節點都會找出來,到html節點# 獲取兄弟節點 # print(soup.a.next_siblings) # 生成器對象 <generator object next_siblings at 0x000002418B9BD570> # print(list(soup.a.next_siblings)) beautifulsoup4標簽選擇器 View Code標準選擇器
from bs4 import BeautifulSouphtml_doc = """ <html><head><title>The Dormouse's story</title></head> <body><p>first tag</p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie<i id="i1" class="i1">this i tag</i></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p><p class="story">...</p> """soup = BeautifulSoup(html_doc, 'lxml')# 標準選擇器# 按照標簽名查找 # print(soup.find_all('a')) # 拿到所有的標簽 # print(soup.find_all('a', id='link2')) # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] # print(soup.find(id='link2')) # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> # print(soup.find_all(attrs={"class": "sister"})) # 拿到所有的類為sister的a標簽 # print(soup.find_all(class_='sister')) # 拿到的結果也是所有的類名為sister的a標簽 # 注意:soup.find_all(class_='sister' 中的class_ 的用法,要加下劃線,因為class為關鍵字,寫在attrs里面的沒影響# 嵌套查找 # print(soup.find_all('a')[0].find('i')) # 拿到 a 標簽的下級 i 標簽 <i>this i tag</i># 按照屬性查找 # print(soup.a.find_all(attrs={'id':'i1'})) # [<i class="i1" id="i1">this i tag</i>] # print(soup.a.find_all(attrs={"class":'i1'})) # [<i class="i1" id="i1">this i tag</i>] # print(soup.find_all(id='i1')) # [<i class="i1" id="i1">this i tag</i>]# 按照文本內容查找,按照完全匹配來匹配內容,不是模糊的匹配,是== 不是 in # print(soup.p.find_all(text='first tag')) # ['first tag'] beautifulsoup4標準選擇器 View CodeCSS選擇器
##該模塊提供了select方法來支持css html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b>Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1"><span>Elsie</span></a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;<div class='panel-1'><ul class='list' id='list-1'><li class='element'>Foo</li><li class='element'>Bar</li><li class='element'>Jay</li></ul><ul class='list list-small' id='list-2'><li class='element'><h1 class='yyyy'>Foo</h1></li><li class='element xxx'>Bar</li><li class='element'>Jay</li></ul></div>and they lived at the bottom of a well. </p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup=BeautifulSoup(html_doc,'lxml')#1、CSS選擇器 print(soup.p.select('.sister')) print(soup.select('.sister span'))print(soup.select('#link1')) print(soup.select('#link1 span'))print(soup.select('#list-2 .element.xxx'))print(soup.select('#list-2')[0].select('.element')) #可以一直select,但其實沒必要,一條select就可以了# 2、獲取屬性 print(soup.select('#list-2 h1')[0].attrs)# 3、獲取內容 print(soup.select('#list-2 h1')[0].get_text())CSS選擇器 View Codebeautifulsoup4中文文檔
轉載于:https://www.cnblogs.com/q240756200/p/10671952.html
總結
以上是生活随笔為你收集整理的beautifulsoup4的全部內容,希望文章能夠幫你解決所遇到的問題。
 
                            
                        - 上一篇: mahjong
- 下一篇: Python基础第三课
