當前位置：首頁 > 前端技术 > HTML >内容正文

HTML

使用HTMLParser模块解析HTML页面

發布時間：2023/12/10 HTML 27 豆豆

生活随笔收集整理的這篇文章主要介紹了使用HTMLParser模块解析HTML页面小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

??? HTMLParser是python用來解析html和xhtml文件格式的模塊。它可以分析出html里面的標簽、數據等等，是一種處理html的簡便途徑。HTMLParser采用的是一種事件驅動的模式，當HTMLParser找到一個特定的標記時，它會去調用一個用戶定義的函數，以此來通知程序處理。它主要的回調函數的命名都是以handler_開頭的，都HTMLParser的成員函數。當我們使用時，就從HTMLParser派生出新的類，然后重新定義這幾個以handler_開頭的函數即可。和在htmllib中的解析器不同，這個解析器并不是基于sgmllib模塊的SGML解析器。?

?htmllib模塊和sgmllib模塊從Python2.6開始不鼓勵使用，3.0以后的版本中被移除~~~

HTMLParser?

class?HTMLParser.HTMLParser

The?HTMLParser?class is instantiated without arguments.

HTMLParser類不需要參數進行實例化。

An?HTMLParser?instance is fed HTML data and calls handler functions when tags begin and end. The?HTMLParser?class is meant to be overridden by the user to provide a desired behavior.

一個HTMLParser實例傳入HTML數據并且當傳入開始和結束的tags參數時調用handler函數。HTMLParser類通過被用戶重寫方法來提供所需要的行為。

Unlike the parser in?htmllib, this parser does not check that end tags match start tags or call the end-tag handler for elements which are closed implicitly by closing an outer element.

與htmllib中的解析器不同的是，這個解析器不檢查結尾標簽與開始標簽的匹配或者對由關閉外部元素表明是關閉的元素調用結束標簽handler

An exception is defined as well:

異常也被定義了：

exception?HTMLParser.HTMLParseError

Exception raised by the?HTMLParser?class when it encounters an error while parsing. This exception provides three attributes:?msg?is a brief message explaining the error,?lineno?is the number of the line on which the broken construct was detected, and?offset?is the number of characters into the line at which the construct starts.

當遇到解析遇到錯誤時，該類將產生一個異常。該異常提供了三個屬性：msg是用來解釋錯誤的消息，lineno是檢測到打斷構造的行數，offset則是該行產生該構造的字符數。

HTMLParser?instances have the following methods:

HTMLParser實例有以下方法：

HTMLParser.reset()

Reset the instance. Loses all unprocessed data. This is called implicitly at instantiation time.

重置該實例。失去所有未處理的數據。這個在實例化對象時被隱含地調用。

HTMLParser.feed(data)

Feed some text to the parser. It is processed insofar as it consists of complete elements; incomplete data is buffered until more data is fed or?close()?is called.

提供一些文本給解析器。在由完整元素組成的限度內進行處理，不完整的數據被緩沖直到更多的數據提供或者close()被調用。

HTMLParser.close()

Force processing of all buffered data as if it were followed by an end-of-file mark. This method may be redefined by a derived class to define additional processing at the end of the input, but the redefined version should always call theHTMLParser?base class method?close().

強制將所有的緩沖數據按跟在結束標記的數據一樣進行處理。該方法可以通過派生類定義對輸入結尾的額外處理來進行重定義，但是重定義的版本應該總是調用HTMLParser基類方法close()

HTMLParser.getpos()

Return current line number and offset.

返回當前行數和位移值。

HTMLParser.get_starttag_text()

Return the text of the most recently opened start tag. This should not normally be needed for structured processing, but may be useful in dealing with HTML “as deployed” or for re-generating input with minimal changes (whitespace between attributes can be preserved, etc.).

返回文本最近的開放標簽。

HTMLParser.handle_starttag(tag,?attrs)

This method is called to handle the start of a tag. It is intended to be overridden by a derived class; the base class implementation does nothing.

The?tag?argument is the name of the tag converted to lower case. The?attrs?argument is a list of?(name,?value)?pairs containing the attributes found inside the tag’s?<>?brackets. The?name?will be translated to lower case, and quotes in the?value?have been removed, and character and entity references have been replaced. For instance, for the tag?<A HREF="http://www.cwi.nl/">, this method would be called as?handle_starttag('a',?[('href',?'http://www.cwi.nl/')]).

該方法用來處理開始標簽。其目的是被派生類重寫；基類什么也不實現。tag參數是轉換成小寫的標簽名稱。attrs參數是一個(name，value)對包含了在標簽<>中得屬性。name將會轉換成小寫，并且value中得引號會被引出，并且字符串和實體引用將會被替代。

例如，對于標簽<A HREF="http://www.cwi.nl/">，該方法將會調用為handle_starttag('a',?[('href',?'http://www.cwi.nl/')])

Changed in version 2.6:?All entity references from?htmlentitydefs?are now replaced in the attribute values.

2.6的版本變動：所有來自htmlentitydefs的實體引用現在在屬性值中被替代了。

HTMLParser.handle_startendtag(tag,?attrs)

Similar to?handle_starttag(), but called when the parser encounters an XHTML-style empty tag (<a?.../>). This method may be overridden by subclasses which require this particular lexical information; the default implementation simple calls handle_starttag()?and?handle_endtag().

類似于handle_starttag()，不過用來處理遇到XHTML風格的空標簽(<a?.../>)。

HTMLParser.handle_endtag(tag)

This method is called to handle the end tag of an element. It is intended to be overridden by a derived class; the base class implementation does nothing. The?tag?argument is the name of the tag converted to lower case.

該方法用來處理元素的結束標簽。

HTMLParser.handle_data(data)

This method is called to process arbitrary data. It is intended to be overridden by a derived class; the base class implementation does nothing.

該方法用來處理任意的數據。

HTMLParser.handle_charref(name)

This method is called to process a character reference of the form?&#ref;. It is intended to be overridden by a derived class; the base class implementation does nothing.

該方法用來處理&#ref;形式的字符引用。

HTMLParser.handle_entityref(name)

This method is called to process a general entity reference of the form?&name;?where?name?is an general entity reference. It is intended to be overridden by a derived class; the base class implementation does nothing.

該方法用來處理形式&name;的一般實體引用，參數name是一般的實體引用。

HTMLParser.handle_comment(data)

This method is called when a comment is encountered. The?comment?argument is a string containing the text between the?--and?--?delimiters, but not the delimiters themselves. For example, the comment??will cause this method to be called with the argument?'text'. It is intended to be overridden by a derived class; the base class implementation does nothing.

該方法用來處理遇到的評論。

HTMLParser.handle_decl(decl)

Method called when an SGML?doctype?declaration is read by the parser. The?decl?parameter will be the entire contents of the declaration inside the?<!...>?markup. It is intended to be overridden by a derived class; the base class implementation does nothing.

當解析器讀到一個SGML的doctype聲明該方法被調用。

HTMLParser.unknown_decl(data)

Method called when an unrecognized SGML declaration is read by the parser. The?data?parameter will be the entire contents of the declaration inside the?<!...>?markup. It is sometimes useful to be overridden by a derived class; the base class implementation throws an?HTMLParseError.

當解析器讀到一個未被識別SGML聲明時將調用該方法。

HTMLParser.handle_pi(data)

Method called when a processing instruction is encountered. The?data?parameter will contain the entire processing instruction. For example, for the processing instruction?<?proc?color='red'>, this method would be called as?handle_pi("proccolor='red'"). It is intended to be overridden by a derived class; the base class implementation does nothing.

當遇到一個處理指令時將調用該方法。

Note

The?HTMLParser?class uses the SGML syntactic rules for processing instructions. An XHTML processing instruction using the trailing?'?'?will cause the?'?'?to be included in?data.

實際運用

以網絡爬蟲的抓取url為例，我們需要抓取網易首頁上的所有鏈接。首先得介紹一點HTML的知識，以下內容來自于w3cshool

什么是HTML 超鏈接

超鏈接可以是一個字，一個詞，或者一組詞，也可以是一幅圖像，您可以點擊這些內容來跳轉到新的文檔或者當前文檔中的某個部分。

當您把鼠標指針移動到網頁中的某個鏈接上時，箭頭會變為一只小手。

我們通過使用 <a> 標簽在 HTML 中創建鏈接。

有兩種使用 <a> 標簽的方式：

通過使用 href 屬性 - 創建指向另一個文檔的鏈接

通過使用 name 屬性 - 創建文檔內的書簽

HTML 鏈接語法

鏈接的 HTML 代碼很簡單。它類似這樣：

href 屬性規定鏈接的目標。

開始標簽和結束標簽之間的文字被作為超級鏈接來顯示。

實例

<a href="http://www.w3school.com.cn/">Visit W3School</a>

編寫代碼

????? 從上面我們得知鏈接在起始標簽<a>中，href屬性指向我們需要解析的鏈接。那么重寫handle_startag()方法來實現這個目的。

#?-*-?coding:?utf-8?-*- """
Created?on?Tue?Aug?30?09:46:45?2011

@author:?Nupta
"""
import?urllib2
import?HTMLParser

class?MyParser(HTMLParser.HTMLParser):
???????
????def?handle_starttag(self,?tag,?attrs):
????????if?tag?==?'a':
????????????for?name,value?in?attrs:
????????????????if?name?==?'href'?and?value.startswith('http'):
????????????????????print?value

if?__name__?==?'__main__':
????url=raw_input(u'輸入地址：'.encode('cp936'))
????f=urllib2.urlopen(url).read()
????my=MyParser()
????try:
????????my.feed(content)
????except?HTMLParser.HTMLParseError,e:
????????print?e

問題分析

????? 輸出地鏈接很多，先省略大部分，注意看最后一行：

http://www.hd315.gov.cn/beian/view.asp?bianhao=0102000102300012
http://www.itrust.org.cn/yz/pjwx.asp?wm=2012043533
http://www.bj.cyberpolice.cn/index.htm
malformed?start?tag,?at?line?3339,?column?44

????? 在讀取的html文件中，第3339行的第44列讀到一個有缺陷的開始標簽，發生HTMLParseError異常。從給出的信息來看也就是html文件中的第3338行的43個元素。因為前面使用的是read()方法，這里我們需要使用readlines()把html文件讀入一個列表中。

????print?f[3338][34:67]

????? 看看結果就明白為什么了：

<a?href=\'http://mail.163.com/\'

????? 這兩個轉義字符導致了解析器的解析異常，要是不知道如何判斷html代碼的正確性，請點擊W3C的傳送門，輸入代碼即可獲得分析結果：

???Line 1, Column 9:?an attribute value must be a literal unless it contains only name characters

?<a href=\'http://mail.163.com/\'

You have used a character that is not considered a "name character" in an attribute value. Which characters are considered "name characters" varies between the different document types, but a good rule of thumb is that unless the value contains?only?lower or upper case letters in the range a-z you must put quotation marks around the value. In fact, unless you have?extremefile size requirements it is a very very good idea to?always?put quote marks around your attribute values. It is never wrong to do so, and very often it is absolutely necessary.

從下面126郵箱的登錄頁面的檢測結果可以看到，目前HTML要符合XHTML規范仍有很長的路要走。

尋求解決辦法

??????等待別人修補HTML代碼那是不靠譜的，如果能將HTML代碼轉換為規范的格式，那該多好啊。下一篇將介紹使用Beautiful Soup來解決這個問題。

轉載于:https://www.cnblogs.com/yuxc/archive/2011/08/30/2159307.html

總結

以上是生活随笔為你收集整理的使用HTMLParser模块解析HTML页面的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。