當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

get 到的html代码如何转码,爬虫网页转码逻辑

發布時間：2023/12/20 编程问答 31 豆豆

生活随笔收集整理的這篇文章主要介紹了 get 到的html代码如何转码,爬虫网页转码逻辑小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

爬蟲網頁轉碼邏輯

最先出現的編碼格式是ASCII碼，這種編碼規則是美國人制定的，大致的規則是用一個字節(8個bit)去表示出現的字符，其實由于在老美的世界里中總共出現的字符也不超過128個，而一個字節能夠表示256種字符，所以當時這種編碼的方式是沒有問題的。

后來計算機在全世界普及起來，不同國家的語言都面臨著如何在計算機中表示的問題，比如我們的漢字常用的就有幾千個，顯然最開始一個字節的ASIIC碼表示就不夠用了,這個時候就出現了Unicode編碼，確切的說它只是一種表示規則，并不對應具體的實現形式。Uni-這個前綴在英文中表示的是統一的含義，它試圖把全世界的語言用一種統一的編碼表示，但是Unicode只規定了字符對應的二進制數據，但是沒有規定這種二進制數據在內存中具體用幾個字節存儲，然后就亂套了，各國在實現Unicode時都發揮了自己的聰明才智，出現了類似utf-16,utf-32等等的形式，在這種情況下，Unicode的理想并沒有實現，直到互聯網的普及，utf-8的出現，utf-8的出現真正實現了大一統，它在實現Unicode規范的同時，又擴展了自己的規則，utf-8規定了任意一種字符編碼后的機器碼都是占用6個字節。

很多人在這里有個誤會，就是容易把Bytes和編程語言里的其它數據類型混淆，其實Bytes才是計算機里真正的數據類型，也是網絡數據傳輸中唯一的數據格式，什么Json，Xml這些格式的字符串最后想傳輸也都得轉成Bytes的數據類型才能通過socket進行傳輸，而Bytes的數據與字符串類型數據的轉換就是編碼與解碼的轉換，utf-8是編解碼時指定的格式。

這里再簡單說一下序列化與反序列化，序列化可以分為本地和網絡，對于本地序列化，往往就是將內存中的對象持久化到本地的硬盤，此時序列化做的工作就是將對象和一些對象的相關信息序列化成字符串，然后字符串以某種格式(比如utf-8)進行編碼變成bytes類型，存儲到硬盤。反序列化就是先將硬盤中的bytes類型中的數據讀到內存經過解碼變成字符串，然后對字符串進行反序列化解析生成對象。

Request的編碼判斷：

bytes str unicode

1. str/bytes

>> s = '123'

>> type(s)

str

>> s = b'123'

bytes

2. str 與 bytes 之間的類型轉換

python str與bytes之間的轉換

str 與 bytes 之間的類型轉換如下：

str ? bytes：bytes(s, encoding='utf8')

bytes ? str：str(b, encoding='utf-8')

此外還可通過編碼解碼的形式對二者進行轉換，

str 編碼成 bytes 格式：str.encode(s)

bytes 格式編碼成 str 類型：bytes.decode(b)

3. strings 分別在 Python2、Python 3下

What is tensorflow.compat.as_str()?

Python 2 將 strings 處理為原生的 bytes 類型，而不是 unicode，

Python 3 所有的 strings 均是 unicode 類型。

1, BefaultSoup 轉碼邏輯

代碼位置 python2.7/site-packages/bs4/dammit.py

@property

def encodings(self):

"""Yield a number of encodings that might work for this markup."""

tried = set()

for e in self.override_encodings:

if self._usable(e, tried):

yield e

# Did the document originally start with a byte-order mark

# that indicated its encoding?

if self._usable(self.sniffed_encoding, tried):

yield self.sniffed_encoding

# Look within the document for an XML or HTML encoding

# declaration.

if self.declared_encoding is None:

self.declared_encoding = self.find_declared_encoding(

self.markup, self.is_html)

if self._usable(self.declared_encoding, tried):

yield self.declared_encoding

# Use third-party character set detection to guess at the

# encoding.

if self.chardet_encoding is None:

self.chardet_encoding = chardet_dammit(self.markup)

if self._usable(self.chardet_encoding, tried):

yield self.chardet_encoding

# As a last-ditch effort, try utf-8 and windows-1252.

for e in ('utf-8', 'windows-1252'):

if self._usable(e, tried):

yield e

解釋：這段代碼包含了幾個編碼測試函數流程，優先級如下：

1， self.override_encodings 用戶定義的編碼

2， self.sniffed_encoding

self.markup, self.sniffed_encoding = self.strip_byte_order_mark(markup)

這個函數通過檢查網頁開始的空格的編碼格式來判斷網頁的編碼

@classmethod

def strip_byte_order_mark(cls, data):

"""If a byte-order mark is present, strip it and return the encoding it implies."""

encoding = None

if isinstance(data, unicode):

# Unicode data cannot have a byte-order mark.

return data, encoding

if (len(data) >= 4) and (data[:2] == b'\xfe\xff') \

and (data[2:4] != '\x00\x00'):

encoding = 'utf-16be'

data = data[2:]

elif (len(data) >= 4) and (data[:2] == b'\xff\xfe') \

and (data[2:4] != '\x00\x00'):

encoding = 'utf-16le'

data = data[2:]

elif data[:3] == b'\xef\xbb\xbf':

encoding = 'utf-8'

data = data[3:]

elif data[:4] == b'\x00\x00\xfe\xff':

encoding = 'utf-32be'

data = data[4:]

elif data[:4] == b'\xff\xfe\x00\x00':

encoding = 'utf-32le'

data = data[4:]

return data, encoding

3, self.declared_encoding

self.declared_encoding = self.find_declared_encoding(

self.markup, self.is_html)

這個函數通過正則匹配來找到html前面的聲明

正則匹配串

xml_encoding_re = re.compile(

'^'.encode(), re.I)

html_meta_re = re.compile(

']+charset\s*=\s*["\']?([^>]*?)[ /;\'">]'.encode(), re.I)

@classmethod

def find_declared_encoding(cls, markup, is_html=False, search_entire_document=False):

"""Given a document, tries to find its declared encoding.

An XML encoding is declared at the beginning of the document.

An HTML encoding is declared in a tag, hopefully near the

beginning of the document.

"""

if search_entire_document:

xml_endpos = html_endpos = len(markup)

else:

xml_endpos = 1024

html_endpos = max(2048, int(len(markup) * 0.05))

declared_encoding = None

declared_encoding_match = xml_encoding_re.search(markup, endpos=xml_endpos)

if not declared_encoding_match and is_html:

declared_encoding_match = html_meta_re.search(markup, endpos=html_endpos)

if declared_encoding_match is not None:

declared_encoding = declared_encoding_match.groups()[0].decode(

'ascii', 'replace')

if declared_encoding:

return declared_encoding.lower()

return None

self.chardet_encoding = chardet_dammit(self.markup)

很明顯，這個是根據chardet包來判斷， chardet根據正文的編碼匹配來統計，會有個confidence的輔助判斷

import chardet

def chardet_dammit(s):

return chardet.detect(s)['encoding']

2，Request 轉碼邏輯

response = requests.get(url, verify=False, headers=configSpider.get_head())

requests 提供了兩個編碼識別結果

requests.encoding

位置： python2.7/site-packages/requests/adapters.py

```

response.encoding = get_encoding_from_headers(response.headers)

```

位置：python2.7/site-packages/requests/utils.py

```

def get_encoding_from_headers(headers):

"""Returns encodings from given HTTP Header Dict.

:param headers: dictionary to extract encoding from.

:rtype: str

"""

content_type = headers.get('content-type')

if not content_type:

return None

content_type, params = cgi.parse_header(content_type)

if 'charset' in params:

return params['charset'].strip("'\"")

if 'text' in content_type:

return 'ISO-8859-1'

```

cgi.parse_header()函數

```

def parse_header(line):

"""Parse a Content-type like header.

Return the main content-type and a dictionary of options.

"""

parts = _parseparam(';' + line)

key = parts.next()

pdict = {}

for p in parts:

i = p.find('=')

if i >= 0:

name = p[:i].strip().lower()

value = p[i+1:].strip()

if len(value) >= 2 and value[0] == value[-1] == '"':

value = value[1:-1]

value = value.replace('\\\\', '\\').replace('\\"', '"')

pdict[name] = value

return key, pdict

```

這個就是取的響應頭 header的聲明編碼，如果有charset具體的編碼則給出，如果是text/html 則返回 'ISO-8859-1'

很多網頁Response-Headers都是直接給一個content-type: text/html, 用 'ISO-8859-1'明顯是亂碼了

response.apparent_encoding

Request還有一個apparent_encoding的編碼，這個很簡單也是來自于正文的chardet，也并不能保證完全準確的

3， Request的content和text

```

@property

def content(self):

"""Content of the response, in bytes."""

if self._content is False:

# Read the contents.

if self._content_consumed:

raise RuntimeError(

'The content for this response was already consumed')

if self.status_code == 0 or self.raw is None:

self._content = None

else:

self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()

self._content_consumed = True

# don't need to release the connection; that's been handled by urllib3

# since we exhausted the data.

return self._content

@property

def text(self):

"""Content of the response, in unicode.

If Response.encoding is None, encoding will be guessed using

``chardet``.

The encoding of the response content is determined based solely on HTTP

headers, following RFC 2616 to the letter. If you can take advantage of

non-HTTP knowledge to make a better guess at the encoding, you should

set ``r.encoding`` appropriately before accessing this property.

"""

# Try charset from content-type

content = None

encoding = self.encoding

if not self.content:

return str('')

# Fallback to auto-detected encoding.

if self.encoding is None:

encoding = self.apparent_encoding

# Decode unicode from given encoding.

try:

content = str(self.content, encoding, errors='replace')

except (LookupError, TypeError):

# A LookupError is raised if the encoding was not found which could

# indicate a misspelling or similar mistake.

# A TypeError can be raised if encoding is None

# So we try blindly encoding.

content = str(self.content, errors='replace')

return content

```

content是bytes 字節流格式的，而text是將其轉為str

content = str(self.content, encoding, errors='replace')

如果網頁正好是utf-8格式的，因為編碼環境# -*- coding: utf-8 -*-，所以content直接可用；否則依然會有亂碼問題

綜上，最好的解決方案是結合源碼的實現以及自身的需求來實現一套方案：

Headers 聲明編碼

網頁開始的空格檢測

正文聲明編碼

chardet 模塊檢測編碼

對于調用Request包，簡單處理：

if response.encoding == 'ISO-8859-1':

response.encoding = response.apparent_encoding

response.text

或者借用bs4的方法

from bs4.dammit import EncodingDetector

self.detector = EncodingDetector(

markup, override_encodings, is_html, exclude_encodings)

print self.detector.encoding

總結

以上是生活随笔為你收集整理的get 到的html代码如何转码,爬虫网页转码逻辑的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：用Nmap工具查找Downadup/Co
下一篇： javaIO学习下：javase学习（三

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

编程问答

get 到的html代码如何转码,爬虫网页转码逻辑

總結