當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

Python : Beautiful Soup修改文档树

發(fā)布時(shí)間：2023/12/20 python 30 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python : Beautiful Soup修改文档树小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

修改文檔樹
Beautiful Soup的強(qiáng)項(xiàng)是文檔樹的搜索,但同時(shí)也可以方便的修改文檔樹

修改tag的名稱和屬性
在 Attributes 的章節(jié)中已經(jīng)介紹過這個(gè)功能,但是再看一遍也無妨. 重命名一個(gè)tag,改變屬性的值,添加或刪除屬性:

soup = BeautifulSoup(‘Extremely bold’)
tag = soup.b

tag.name = “blockquote”
tag[‘class’] = ‘verybold’
tag[‘id’] = 1
tag

Extremely bold

del tag[‘class’]
del tag[‘id’]
tag

Extremely bold

修改 .string
給tag的 .string 屬性賦值,就相當(dāng)于用當(dāng)前的內(nèi)容替代了原來的內(nèi)容:

markup = ‘I linked to example.com’
soup = BeautifulSoup(markup)

tag = soup.a
tag.string = “New link text.”
tag

New link text.

注意: 如果當(dāng)前的tag包含了其它tag,那么給它的 .string 屬性賦值會(huì)覆蓋掉原有的所有內(nèi)容包括子tag

append()
Tag.append() 方法想tag中添加內(nèi)容,就好像Python的列表的 .append() 方法:

soup = BeautifulSoup(“Foo”)
soup.a.append(“Bar”)

soup

FooBar

soup.a.contents

[u’Foo’, u’Bar’]

BeautifulSoup.new_string() 和 .new_tag()
如果想添加一段文本內(nèi)容到文檔中也沒問題,可以調(diào)用Python的 append() 方法或調(diào)用工廠方法 BeautifulSoup.new_string() :

soup = BeautifulSoup("")
tag = soup.b
tag.append(“Hello”)
new_string = soup.new_string(" there")
tag.append(new_string)
tag

Hello there.

tag.contents

[u’Hello’, u’ there’]

如果想要?jiǎng)?chuàng)建一段注釋,或 NavigableString 的任何子類,將子類作為 new_string() 方法的第二個(gè)參數(shù)傳入:

from bs4 import Comment
new_comment = soup.new_string(“Nice to see you.”, Comment)
tag.append(new_comment)
tag

Hello there

tag.contents

[u’Hello’, u’ there’, u’Nice to see you.’]

這是Beautiful Soup 4.2.1 中新增的方法

創(chuàng)建一個(gè)tag最好的方法是調(diào)用工廠方法 BeautifulSoup.new_tag() :

soup = BeautifulSoup("")
original_tag = soup.b

new_tag = soup.new_tag(“a”, href=“http://www.example.com”)
original_tag.append(new_tag)
original_tag

new_tag.string = “Link text.”
original_tag

Link text.

第一個(gè)參數(shù)作為tag的name,是必填,其它參數(shù)選填

insert()
Tag.insert() 方法與 Tag.append() 方法類似,區(qū)別是不會(huì)把新元素添加到父節(jié)點(diǎn) .contents 屬性的最后,而是把元素插入到指定的位置.與Python列表總的 .insert() 方法的用法下同:

markup = ‘I linked to example.com’
soup = BeautifulSoup(markup)
tag = soup.a

tag.insert(1, "but did not endorse ")
tag

I linked to but did not endorse example.com

tag.contents

[u’I linked to ‘, u’but did not endorse’, example.com]

insert_before() 和 insert_after()
insert_before() 方法在當(dāng)前tag或文本節(jié)點(diǎn)前插入內(nèi)容:

soup = BeautifulSoup(“stop”)
tag = soup.new_tag(“i”)
tag.string = “Don’t”
soup.b.string.insert_before(tag)
soup.b

Don’tstop

insert_after() 方法在當(dāng)前tag或文本節(jié)點(diǎn)后插入內(nèi)容:

soup.b.i.insert_after(soup.new_string(" ever "))
soup.b

Don’t ever stop

soup.b.contents

[Don’t, u’ ever ‘, u’stop’]

clear()
Tag.clear() 方法移除當(dāng)前tag的內(nèi)容:

markup = ‘I linked to example.com’
soup = BeautifulSoup(markup)
tag = soup.a

tag.clear()
tag

extract()
PageElement.extract() 方法將當(dāng)前tag移除文檔樹,并作為方法結(jié)果返回:

markup = ‘I linked to example.com’
soup = BeautifulSoup(markup)
a_tag = soup.a

i_tag = soup.i.extract()

a_tag

I linked to

i_tag

example.com

print(i_tag.parent)
None
這個(gè)方法實(shí)際上產(chǎn)生了2個(gè)文檔樹: 一個(gè)是用來解析原始文檔的 BeautifulSoup 對(duì)象,另一個(gè)是被移除并且返回的tag.被移除并返回的tag可以繼續(xù)調(diào)用 extract 方法:

my_string = i_tag.string.extract()
my_string

u’example.com’

print(my_string.parent)

None

i_tag

decompose()
Tag.decompose() 方法將當(dāng)前節(jié)點(diǎn)移除文檔樹并完全銷毀:

markup = ‘I linked to example.com’
soup = BeautifulSoup(markup)
a_tag = soup.a

soup.i.decompose()

a_tag

I linked to

replace_with()
PageElement.replace_with() 方法移除文檔樹中的某段內(nèi)容,并用新tag或文本節(jié)點(diǎn)替代它:

markup = ‘I linked to example.com’
soup = BeautifulSoup(markup)
a_tag = soup.a

new_tag = soup.new_tag(“b”)
new_tag.string = “example.net”
a_tag.i.replace_with(new_tag)

a_tag

I linked to example.net

replace_with() 方法返回被替代的tag或文本節(jié)點(diǎn),可以用來瀏覽或添加到文檔樹其它地方

wrap()
PageElement.wrap() 方法可以對(duì)指定的tag元素進(jìn)行包裝 [8] ,并返回包裝后的結(jié)果:

soup = BeautifulSoup(“

I wish I was bold.

”)
soup.p.string.wrap(soup.new_tag(“b”))

I wish I was bold.

soup.p.wrap(soup.new_tag(“div”))

I wish I was bold.

該方法在 Beautiful Soup 4.0.5 中添加

unwrap()
Tag.unwrap() 方法與 wrap() 方法相反.將移除tag內(nèi)的所有tag標(biāo)簽,該方法常被用來進(jìn)行標(biāo)記的解包:

markup = ‘I linked to example.com’
soup = BeautifulSoup(markup)
a_tag = soup.a

a_tag.i.unwrap()
a_tag

I linked to example.com

與 replace_with() 方法相同, unwrap() 方法返回被移除的tag

輸出
格式化輸出
prettify() 方法將Beautiful Soup的文檔樹格式化后以Unicode編碼輸出,每個(gè)XML/HTML標(biāo)簽都獨(dú)占一行

markup = ‘I linked to example.com’
soup = BeautifulSoup(markup)
soup.prettify()

‘\n \n \n \n \n…’

print(soup.prettify())

I linked to

example.com

BeautifulSoup 對(duì)象和它的tag節(jié)點(diǎn)都可以調(diào)用 prettify() 方法:

print(soup.a.prettify())

I linked to

example.com

壓縮輸出
如果只想得到結(jié)果字符串,不重視格式,那么可以對(duì)一個(gè) BeautifulSoup 對(duì)象或 Tag 對(duì)象使用Python的 unicode() 或 str() 方法:

str(soup)

‘I linked to example.com’

unicode(soup.a)

u’I linked to example.com’

str() 方法返回UTF-8編碼的字符串,可以指定編碼的設(shè)置.

還可以調(diào)用 encode() 方法獲得字節(jié)碼或調(diào)用 decode() 方法獲得Unicode.

輸出格式
Beautiful Soup輸出是會(huì)將HTML中的特殊字符轉(zhuǎn)換成Unicode,比如“&lquot;”:

soup = BeautifulSoup("“Dammit!” he said.")
unicode(soup)

u’\u201cDammit!\u201d he said.’

如果將文檔轉(zhuǎn)換成字符串,Unicode編碼會(huì)被編碼成UTF-8.這樣就無法正確顯示HTML特殊字符了:

str(soup)

‘\xe2\x80\x9cDammit!\xe2\x80\x9d he said.’

get_text()
如果只想得到tag中包含的文本內(nèi)容,那么可以嗲用 get_text() 方法,這個(gè)方法獲取到tag中包含的所有文版內(nèi)容包括子孫tag中的內(nèi)容,并將結(jié)果作為Unicode字符串返回:

markup = ‘\nI linked to example.com\n’
soup = BeautifulSoup(markup)

soup.get_text()
u’\nI linked to example.com\n’
soup.i.get_text()
u’example.com’
可以通過參數(shù)指定tag的文本內(nèi)容的分隔符:

soup.get_text("|")

u’\nI linked to |example.com|\n’
還可以去除獲得文本內(nèi)容的前后空白:

soup.get_text("|", strip=True)

u’I linked to|example.com’
或者使用 .stripped_strings 生成器,獲得文本列表后手動(dòng)處理列表:

[text for text in soup.stripped_strings]

[u’I linked to’, u’example.com’]

指定文檔解析器
如果僅是想要解析HTML文檔,只要用文檔創(chuàng)建 BeautifulSoup 對(duì)象就可以了.Beautiful Soup會(huì)自動(dòng)選擇一個(gè)解析器來解析文檔.但是還可以通過參數(shù)指定使用那種解析器來解析當(dāng)前文檔.

BeautifulSoup 第一個(gè)參數(shù)應(yīng)該是要被解析的文檔字符串或是文件句柄,第二個(gè)參數(shù)用來標(biāo)識(shí)怎樣解析文檔.如果第二個(gè)參數(shù)為空,那么Beautiful Soup根據(jù)當(dāng)前系統(tǒng)安裝的庫自動(dòng)選擇解析器,解析器的優(yōu)先數(shù)序: lxml, html5lib, Python標(biāo)準(zhǔn)庫.在下面兩種條件下解析器優(yōu)先順序會(huì)變化:

要解析的文檔是什么類型: 目前支持, “html”, “xml”, 和 “html5”
指定使用哪種解析器: 目前支持, “l(fā)xml”, “html5lib”, 和 “html.parser”
安裝解析器章節(jié)介紹了可以使用哪種解析器,以及如何安裝.

如果指定的解析器沒有安裝,Beautiful Soup會(huì)自動(dòng)選擇其它方案.目前只有 lxml 解析器支持XML文檔的解析,在沒有安裝lxml庫的情況下,創(chuàng)建 beautifulsoup 對(duì)象時(shí)無論是否指定使用lxml,都無法得到解析后的對(duì)象

解析器之間的區(qū)別
Beautiful Soup為不同的解析器提供了相同的接口,但解析器本身時(shí)有區(qū)別的.同一篇文檔被不同的解析器解析后可能會(huì)生成不同結(jié)構(gòu)的樹型文檔.區(qū)別最大的是HTML解析器和XML解析器,看下面片段被解析成HTML結(jié)構(gòu):

BeautifulSoup("")

因?yàn)榭諛?biāo)簽不符合HTML標(biāo)準(zhǔn),所以解析器把它解析成

同樣的文檔使用XML解析如下(解析XML需要安裝lxml庫).注意,空標(biāo)簽依然被保留,并且文檔前添加了XML頭,而不是被包含在標(biāo)簽內(nèi):

BeautifulSoup("", “xml”)

<?xml version="1.0" encoding="utf-8"?>

HTML解析器之間也有區(qū)別,如果被解析的HTML文檔是標(biāo)準(zhǔn)格式,那么解析器之間沒有任何差別,只是解析速度不同,結(jié)果都會(huì)返回正確的文檔樹.

但是如果被解析文檔不是標(biāo)準(zhǔn)格式,那么不同的解析器返回結(jié)果可能不同.下面例子中,使用lxml解析錯(cuò)誤格式的文檔,結(jié)果

標(biāo)簽被直接忽略掉了:

BeautifulSoup("

", “l(fā)xml”)

使用html5lib庫解析相同文檔會(huì)得到不同的結(jié)果:

BeautifulSoup("

", “html5lib”)

html5lib庫沒有忽略掉

標(biāo)簽,而是自動(dòng)補(bǔ)全了標(biāo)簽,還給文檔樹添加了標(biāo)簽.

使用pyhton內(nèi)置庫解析結(jié)果如下:

BeautifulSoup("

", “html.parser”)

與lxml [7] 庫類似的,Python內(nèi)置庫忽略掉了

標(biāo)簽,與html5lib庫不同的是標(biāo)準(zhǔn)庫沒有嘗試創(chuàng)建符合標(biāo)準(zhǔn)的文檔格式或?qū)⑽臋n片段包含在標(biāo)簽內(nèi),與lxml不同的是標(biāo)準(zhǔn)庫甚至連標(biāo)簽都沒有嘗試去添加.

因?yàn)槲臋n片段“

”是錯(cuò)誤格式,所以以上解析方式都能算作”正確”,html5lib庫使用的是HTML5的部分標(biāo)準(zhǔn),所以最接近”正確”.不過所有解析器的結(jié)構(gòu)都能夠被認(rèn)為是”正常”的.

不同的解析器可能影響代碼執(zhí)行結(jié)果,如果在分發(fā)給別人的代碼中使用了 BeautifulSoup ,那么最好注明使用了哪種解析器,以減少不必要的麻煩.

編碼
任何HTML或XML文檔都有自己的編碼方式,比如ASCII 或 UTF-8,但是使用Beautiful Soup解析后,文檔都被轉(zhuǎn)換成了Unicode:

markup = “

Sacr\xc3\xa9 bleu!

”
soup = BeautifulSoup(markup)
soup.h1

Sacré bleu!

soup.h1.string

u’Sacr\xe9 bleu!’

這不是魔術(shù)(但很神奇),Beautiful Soup用了編碼自動(dòng)檢測(cè) 子庫來識(shí)別當(dāng)前文檔編碼并轉(zhuǎn)換成Unicode編碼. BeautifulSoup 對(duì)象的 .original_encoding 屬性記錄了自動(dòng)識(shí)別編碼的結(jié)果:

soup.original_encoding
‘utf-8’
編碼自動(dòng)檢測(cè) 功能大部分時(shí)候都能猜對(duì)編碼格式,但有時(shí)候也會(huì)出錯(cuò).有時(shí)候即使猜測(cè)正確,也是在逐個(gè)字節(jié)的遍歷整個(gè)文檔后才猜對(duì)的,這樣很慢.如果預(yù)先知道文檔編碼,可以設(shè)置編碼參數(shù)來減少自動(dòng)檢查編碼出錯(cuò)的概率并且提高文檔解析速度.在創(chuàng)建 BeautifulSoup 對(duì)象的時(shí)候設(shè)置 from_encoding 參數(shù).

下面一段文檔用了ISO-8859-8編碼方式,這段文檔太短,結(jié)果Beautiful Soup以為文檔是用ISO-8859-7編碼:

markup = b"

\xed\xe5\xec\xf9

"
soup = BeautifulSoup(markup)
soup.h1

νεμω

soup.original_encoding 'ISO-8859-7' 通過傳入 from_encoding 參數(shù)來指定編碼方式:

soup = BeautifulSoup(markup, from_encoding=“iso-8859-8”)
soup.h1

????

soup.original_encoding 'iso8859-8' 少數(shù)情況下(通常是UTF-8編碼的文檔中包含了其它編碼格式的文件),想獲得正確的Unicode編碼就不得不將文檔中少數(shù)特殊編碼字符替換成特殊Unicode編碼,“REPLACEMENT CHARACTER” (U+FFFD, �) [9] . 如果Beautifu Soup猜測(cè)文檔編碼時(shí)作了特殊字符的替換,那么Beautiful Soup會(huì)把 UnicodeDammit 或 BeautifulSoup 對(duì)象的 .contains_replacement_characters 屬性標(biāo)記為 True .這樣就可以知道當(dāng)前文檔進(jìn)行Unicode編碼后丟失了一部分特殊內(nèi)容字符.如果文檔中包含�而 .contains_replacement_characters 屬性是 False ,則表示�就是文檔中原來的字符,不是轉(zhuǎn)碼失敗.

輸出編碼
通過Beautiful Soup輸出文檔時(shí),不管輸入文檔是什么編碼方式,輸出編碼均為UTF-8編碼,下面例子輸入文檔是Latin-1編碼:

markup = b’’’

Sacr\xe9 bleu!

'''

soup = BeautifulSoup(markup)
print(soup.prettify())

Sacré bleu!

注意,輸出文檔中的標(biāo)簽的編碼設(shè)置已經(jīng)修改成了與輸出編碼一致的UTF-8.

如果不想用UTF-8編碼輸出,可以將編碼方式傳入 prettify() 方法:

print(soup.prettify(“l(fā)atin-1”))

…

還可以調(diào)用 BeautifulSoup 對(duì)象或任意節(jié)點(diǎn)的 encode() 方法,就像Python的字符串調(diào)用 encode() 方法一樣:

soup.p.encode(“l(fā)atin-1”)

‘
Sacr\xe9 bleu!
’

soup.p.encode(“utf-8”)

‘
Sacr\xc3\xa9 bleu!
’

如果文檔中包含當(dāng)前編碼不支持的字符,那么這些字符將唄轉(zhuǎn)換成一系列XML特殊字符引用,下面例子中包含了Unicode編碼字符SNOWMAN:

markup = u"\N{SNOWMAN}"
snowman_soup = BeautifulSoup(markup)
tag = snowman_soup.b
SNOWMAN字符在UTF-8編碼中可以正常顯示(看上去像是?),但有些編碼不支持SNOWMAN字符,比如ISO-Latin-1或ASCII,那么在這些編碼中SNOWMAN字符會(huì)被轉(zhuǎn)換成“&#9731”:

print(tag.encode(“utf-8”))

?

print tag.encode(“l(fā)atin-1”)

?

print tag.encode(“ascii”)

?

Unicode, dammit! (靠!)
編碼自動(dòng)檢測(cè) 功能可以在Beautiful Soup以外使用,檢測(cè)某段未知編碼時(shí),可以使用這個(gè)方法:

from bs4 import UnicodeDammit
dammit = UnicodeDammit(“Sacr\xc3\xa9 bleu!”)
print(dammit.unicode_markup)

Sacré bleu!

dammit.original_encoding

‘utf-8’

如果Python中安裝了 chardet 或 cchardet 那么編碼檢測(cè)功能的準(zhǔn)確率將大大提高.輸入的字符越多,檢測(cè)結(jié)果越精確,如果事先猜測(cè)到一些可能編碼,那么可以將猜測(cè)的編碼作為參數(shù),這樣將優(yōu)先檢測(cè)這些編碼:

dammit = UnicodeDammit(“Sacr\xe9 bleu!”, [“l(fā)atin-1”, “iso-8859-1”])
print(dammit.unicode_markup)

Sacré bleu!

dammit.original_encoding

‘latin-1’

編碼自動(dòng)檢測(cè) 功能中有2項(xiàng)功能是Beautiful Soup庫中用不到的

智能引號(hào)
使用Unicode時(shí),Beautiful Soup還會(huì)智能的把引號(hào) [10] 轉(zhuǎn)換成HTML或XML中的特殊字符:

markup = b"

I just \x93love\x94 Microsoft Word\x92s smart quotes

UnicodeDammit(markup, [“windows-1252”], smart_quotes_to=“html”).unicode_markup

u’
I just “l(fā)ove” Microsoft Word’s smart quotes
’

UnicodeDammit(markup, [“windows-1252”], smart_quotes_to=“xml”).unicode_markup

u’
I just “l(fā)ove” Microsoft Word’s smart quotes
’

也可以把引號(hào)轉(zhuǎn)換為ASCII碼:

UnicodeDammit(markup, [“windows-1252”], smart_quotes_to=“ascii”).unicode_markup

u’
I just “l(fā)ove” Microsoft Word’s smart quotes
’

很有用的功能,但是Beautiful Soup沒有使用這種方式.默認(rèn)情況下,Beautiful Soup把引號(hào)轉(zhuǎn)換成Unicode:

UnicodeDammit(markup, [“windows-1252”]).unicode_markup

u’
I just \u201clove\u201d Microsoft Word\u2019s smart quotes
’

矛盾的編碼
有時(shí)文檔的大部分都是用UTF-8,但同時(shí)還包含了Windows-1252編碼的字符,就像微軟的智能引號(hào) [10] 一樣.一些包含多個(gè)信息的來源網(wǎng)站容易出現(xiàn)這種情況. UnicodeDammit.detwingle() 方法可以把這類文檔轉(zhuǎn)換成純UTF-8編碼格式,看個(gè)簡(jiǎn)單的例子:

snowmen = (u"\N{SNOWMAN}" * 3)
quote = (u"\N{LEFT DOUBLE QUOTATION MARK}I like snowmen!\N{RIGHT DOUBLE QUOTATION MARK}")
doc = snowmen.encode(“utf8”) + quote.encode(“windows_1252”)
這段文檔很雜亂,snowmen是UTF-8編碼,引號(hào)是Windows-1252編碼,直接輸出時(shí)不能同時(shí)顯示snowmen和引號(hào),因?yàn)樗鼈兙幋a不同:

print(doc)

???�I like snowmen!�

print(doc.decode(“windows-1252”))

a??a??a??“I like snowmen!”

如果對(duì)這段文檔用UTF-8解碼就會(huì)得到 UnicodeDecodeError 異常,如果用Windows-1252解碼就回得到一堆亂碼.幸好, UnicodeDammit.detwingle() 方法會(huì)吧這段字符串轉(zhuǎn)換成UTF-8編碼,允許我們同時(shí)顯示出文檔中的snowmen和引號(hào):

new_doc = UnicodeDammit.detwingle(doc)
print(new_doc.decode(“utf8”))

???“I like snowmen!”

UnicodeDammit.detwingle() 方法只能解碼包含在UTF-8編碼中的Windows-1252編碼內(nèi)容,但這解決了最常見的一類問題.

在創(chuàng)建 BeautifulSoup 或 UnicodeDammit 對(duì)象前一定要先對(duì)文檔調(diào)用 UnicodeDammit.detwingle() 確保文檔的編碼方式正確.如果嘗試去解析一段包含Windows-1252編碼的UTF-8文檔,就會(huì)得到一堆亂碼,比如: a??a??a??“I like snowmen!”.

UnicodeDammit.detwingle() 方法在Beautiful Soup 4.1.0版本中新增

解析部分文檔
如果僅僅因?yàn)橄胍檎椅臋n中的標(biāo)簽而將整片文檔進(jìn)行解析,實(shí)在是浪費(fèi)內(nèi)存和時(shí)間.最快的方法是從一開始就把標(biāo)簽以外的東西都忽略掉. SoupStrainer 類可以定義文檔的某段內(nèi)容,這樣搜索文檔時(shí)就不必先解析整篇文檔,只會(huì)解析在 SoupStrainer 中定義過的文檔. 創(chuàng)建一個(gè) SoupStrainer 對(duì)象并作為 parse_only 參數(shù)給 BeautifulSoup 的構(gòu)造方法即可.

SoupStrainer
SoupStrainer 類接受與典型搜索方法相同的參數(shù)：name , attrs , recursive , text , **kwargs 。下面舉例說明三種 SoupStrainer 對(duì)象：

from bs4 import SoupStrainer

only_a_tags = SoupStrainer(“a”)

only_tags_with_id_link2 = SoupStrainer(id=“l(fā)ink2”)

def is_short_string(string):
return len(string) < 10

only_short_strings = SoupStrainer(text=is_short_string)
再拿“愛麗絲”文檔來舉例，來看看使用三種 SoupStrainer 對(duì)象做參數(shù)會(huì)有什么不同:

html_doc = “”"

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

"""

print(BeautifulSoup(html_doc, “html.parser”, parse_only=only_a_tags).prettify())

Elsie

Lacie

Tillie

print(BeautifulSoup(html_doc, “html.parser”, parse_only=only_tags_with_id_link2).prettify())

Lacie

print(BeautifulSoup(html_doc, “html.parser”, parse_only=only_short_strings).prettify())

Elsie

,

Lacie

and

Tillie

…

還可以將 SoupStrainer 作為參數(shù)傳入搜索文檔樹中提到的方法.這可能不是個(gè)常用用法,所以還是提一下:

soup = BeautifulSoup(html_doc)
soup.find_all(only_short_strings)

[u’\n\n’, u’\n\n’, u’Elsie’, u’,\n’, u’Lacie’, u’ and\n’, u’Tillie’,

u’\n\n’, u’…’, u’\n’]

總結(jié)

以上是生活随笔為你收集整理的Python : Beautiful Soup修改文档树的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：人眼立体成像原理
下一篇： kafka 维护消费状态跟踪的方法和消费

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

python

Python : Beautiful Soup修改文档树

Extremely bold

Extremely bold

New link text.

FooBar

[u’Foo’, u’Bar’]

Hello there.

[u’Hello’, u’ there’]

Hello there

[u’Hello’, u’ there’, u’Nice to see you.’]

這是Beautiful Soup 4.2.1 中新增的方法

Link text.

I linked to but did not endorse example.com

[u’I linked to ‘, u’but did not endorse’, example.com]

Don’tstop

Don’t ever stop

[Don’t, u’ ever ‘, u’stop’]

I linked to

example.com

u’example.com’

None

I linked to

I linked to example.net

I wish I was bold.

I wish I was bold.

I linked to example.com

‘\n \n \n \n \n…’

I linked to

example.com

I linked to

example.com

‘I linked to example.com’

u’I linked to example.com’

u’\u201cDammit!\u201d he said.’

‘\xe2\x80\x9cDammit!\xe2\x80\x9d he said.’

soup.get_text("|")

soup.get_text("|", strip=True)

[u’I linked to’, u’example.com’]

<?xml version="1.0" encoding="utf-8"?>

Sacr\xc3\xa9 bleu!

Sacré bleu!

u’Sacr\xe9 bleu!’

\xed\xe5\xec\xf9

νεμω

????

Sacré bleu!

…

‘Sacr\xe9 bleu!’

‘Sacr\xc3\xa9 bleu!’

?

?

?

Sacré bleu!

‘utf-8’

Sacré bleu!

‘latin-1’

u’I just “l(fā)ove” Microsoft Word’s smart quotes’

u’I just “l(fā)ove” Microsoft Word’s smart quotes’

u’I just “l(fā)ove” Microsoft Word’s smart quotes’

u’I just \u201clove\u201d Microsoft Word\u2019s smart quotes’

???�I like snowmen!�

a??a??a??“I like snowmen!”

???“I like snowmen!”

Elsie

Lacie

Tillie

Lacie

Elsie

,

Lacie

and

Tillie

…

[u’\n\n’, u’\n\n’, u’Elsie’, u’,\n’, u’Lacie’, u’ and\n’, u’Tillie’,

u’\n\n’, u’…’, u’\n’]

總結(jié)

‘
Sacr\xe9 bleu!
’

‘
Sacr\xc3\xa9 bleu!
’

u’
I just “l(fā)ove” Microsoft Word’s smart quotes
’

u’
I just “l(fā)ove” Microsoft Word’s smart quotes
’

u’
I just “l(fā)ove” Microsoft Word’s smart quotes
’

u’
I just \u201clove\u201d Microsoft Word\u2019s smart quotes
’