get 到的html代码如何转码,爬虫网页转码逻辑
爬蟲網頁轉碼邏輯
最先出現的編碼格式是ASCII碼,這種編碼規則是美國人制定的,大致的規則是用一個字節(8個bit)去表示出現的字符,其實由于在老美的世界里中總共出現的字符也不超過128個,而一個字節能夠表示256種字符,所以當時這種編碼的方式是沒有問題的。
后來計算機在全世界普及起來,不同國家的語言都面臨著如何在計算機中表示的問題,比如我們的漢字常用的就有幾千個,顯然最開始一個字節的ASIIC碼表示就不夠用了,這個時候就出現了Unicode編碼,確切的說它只是一種表示規則,并不對應具體的實現形式。Uni-這個前綴在英文中表示的是統一的含義,它試圖把全世界的語言用一種統一的編碼表示,但是Unicode只規定了字符對應的二進制數據,但是沒有規定這種二進制數據在內存中具體用幾個字節存儲,然后就亂套了,各國在實現Unicode時都發揮了自己的聰明才智,出現了類似utf-16,utf-32等等的形式,在這種情況下,Unicode的理想并沒有實現,直到互聯網的普及,utf-8的出現,utf-8的出現真正實現了大一統,它在實現Unicode規范的同時,又擴展了自己的規則,utf-8規定了任意一種字符編碼后的機器碼都是占用6個字節。
很多人在這里有個誤會,就是容易把Bytes和編程語言里的其它數據類型混淆,其實Bytes才是計算機里真正的數據類型,也是網絡數據傳輸中唯一的數據格式,什么Json,Xml這些格式的字符串最后想傳輸也都得轉成Bytes的數據類型才能通過socket進行傳輸,而Bytes的數據與字符串類型數據的轉換就是編碼與解碼的轉換,utf-8是編解碼時指定的格式。
這里再簡單說一下序列化與反序列化,序列化可以分為本地和網絡,對于本地序列化,往往就是將內存中的對象持久化到本地的硬盤,此時序列化做的工作就是將對象和一些對象的相關信息序列化成字符串,然后字符串以某種格式(比如utf-8)進行編碼變成bytes類型,存儲到硬盤。反序列化就是先將硬盤中的bytes類型中的數據讀到內存經過解碼變成字符串,然后對字符串進行反序列化解析生成對象。
Request的編碼判斷:
bytes str unicode
1. str/bytes
>> s = '123'
>> type(s)
str
>> s = b'123'
bytes
1
2
3
4
5
6
2. str 與 bytes 之間的類型轉換
python str與bytes之間的轉換
str 與 bytes 之間的類型轉換如下:
str ? bytes:bytes(s, encoding='utf8')
bytes ? str:str(b, encoding='utf-8')
此外還可通過編碼解碼的形式對二者進行轉換,
str 編碼成 bytes 格式:str.encode(s)
bytes 格式編碼成 str 類型:bytes.decode(b)
3. strings 分別在 Python2、Python 3下
What is tensorflow.compat.as_str()?
Python 2 將 strings 處理為原生的 bytes 類型,而不是 unicode,
Python 3 所有的 strings 均是 unicode 類型。
1, BefaultSoup 轉碼邏輯
代碼位置 python2.7/site-packages/bs4/dammit.py
@property
def encodings(self):
"""Yield a number of encodings that might work for this markup."""
tried = set()
for e in self.override_encodings:
if self._usable(e, tried):
yield e
# Did the document originally start with a byte-order mark
# that indicated its encoding?
if self._usable(self.sniffed_encoding, tried):
yield self.sniffed_encoding
# Look within the document for an XML or HTML encoding
# declaration.
if self.declared_encoding is None:
self.declared_encoding = self.find_declared_encoding(
self.markup, self.is_html)
if self._usable(self.declared_encoding, tried):
yield self.declared_encoding
# Use third-party character set detection to guess at the
# encoding.
if self.chardet_encoding is None:
self.chardet_encoding = chardet_dammit(self.markup)
if self._usable(self.chardet_encoding, tried):
yield self.chardet_encoding
# As a last-ditch effort, try utf-8 and windows-1252.
for e in ('utf-8', 'windows-1252'):
if self._usable(e, tried):
yield e
解釋: 這段代碼包含了幾個編碼測試函數流程, 優先級如下:
1, self.override_encodings 用戶定義的編碼
2, self.sniffed_encoding
self.markup, self.sniffed_encoding = self.strip_byte_order_mark(markup)
這個函數通過檢查網頁開始的空格的編碼格式來判斷網頁的編碼
@classmethod
def strip_byte_order_mark(cls, data):
"""If a byte-order mark is present, strip it and return the encoding it implies."""
encoding = None
if isinstance(data, unicode):
# Unicode data cannot have a byte-order mark.
return data, encoding
if (len(data) >= 4) and (data[:2] == b'\xfe\xff') \
and (data[2:4] != '\x00\x00'):
encoding = 'utf-16be'
data = data[2:]
elif (len(data) >= 4) and (data[:2] == b'\xff\xfe') \
and (data[2:4] != '\x00\x00'):
encoding = 'utf-16le'
data = data[2:]
elif data[:3] == b'\xef\xbb\xbf':
encoding = 'utf-8'
data = data[3:]
elif data[:4] == b'\x00\x00\xfe\xff':
encoding = 'utf-32be'
data = data[4:]
elif data[:4] == b'\xff\xfe\x00\x00':
encoding = 'utf-32le'
data = data[4:]
return data, encoding
3, self.declared_encoding
self.declared_encoding = self.find_declared_encoding(
self.markup, self.is_html)
這個函數通過正則匹配來找到html前面的聲明
正則匹配串
xml_encoding_re = re.compile(
'^'.encode(), re.I)
html_meta_re = re.compile(
']+charset\s*=\s*["\']?([^>]*?)[ /;\'">]'.encode(), re.I)
@classmethod
def find_declared_encoding(cls, markup, is_html=False, search_entire_document=False):
"""Given a document, tries to find its declared encoding.
An XML encoding is declared at the beginning of the document.
An HTML encoding is declared in a tag, hopefully near the
beginning of the document.
"""
if search_entire_document:
xml_endpos = html_endpos = len(markup)
else:
xml_endpos = 1024
html_endpos = max(2048, int(len(markup) * 0.05))
declared_encoding = None
declared_encoding_match = xml_encoding_re.search(markup, endpos=xml_endpos)
if not declared_encoding_match and is_html:
declared_encoding_match = html_meta_re.search(markup, endpos=html_endpos)
if declared_encoding_match is not None:
declared_encoding = declared_encoding_match.groups()[0].decode(
'ascii', 'replace')
if declared_encoding:
return declared_encoding.lower()
return None
self.chardet_encoding = chardet_dammit(self.markup)
很明顯, 這個是根據chardet包來判斷, chardet根據正文的編碼匹配來統計, 會有個confidence的輔助判斷
import chardet
def chardet_dammit(s):
return chardet.detect(s)['encoding']
2,Request 轉碼邏輯
response = requests.get(url, verify=False, headers=configSpider.get_head())
requests 提供了兩個編碼識別結果
requests.encoding
位置: python2.7/site-packages/requests/adapters.py
```
response.encoding = get_encoding_from_headers(response.headers)
```
位置:python2.7/site-packages/requests/utils.py
```
def get_encoding_from_headers(headers):
"""Returns encodings from given HTTP Header Dict.
:param headers: dictionary to extract encoding from.
:rtype: str
"""
content_type = headers.get('content-type')
if not content_type:
return None
content_type, params = cgi.parse_header(content_type)
if 'charset' in params:
return params['charset'].strip("'\"")
if 'text' in content_type:
return 'ISO-8859-1'
```
cgi.parse_header()函數
```
def parse_header(line):
"""Parse a Content-type like header.
Return the main content-type and a dictionary of options.
"""
parts = _parseparam(';' + line)
key = parts.next()
pdict = {}
for p in parts:
i = p.find('=')
if i >= 0:
name = p[:i].strip().lower()
value = p[i+1:].strip()
if len(value) >= 2 and value[0] == value[-1] == '"':
value = value[1:-1]
value = value.replace('\\\\', '\\').replace('\\"', '"')
pdict[name] = value
return key, pdict
```
這個就是取的響應頭 header的聲明編碼,如果有charset具體的編碼 則給出, 如果是text/html 則返回 'ISO-8859-1'
很多網頁Response-Headers都是直接給一個content-type: text/html, 用 'ISO-8859-1'明顯是亂碼了
response.apparent_encoding
Request還有一個apparent_encoding的編碼, 這個很簡單也是來自于正文的chardet, 也并不能保證完全準確的
3, Request的content和text
```
@property
def content(self):
"""Content of the response, in bytes."""
if self._content is False:
# Read the contents.
if self._content_consumed:
raise RuntimeError(
'The content for this response was already consumed')
if self.status_code == 0 or self.raw is None:
self._content = None
else:
self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()
self._content_consumed = True
# don't need to release the connection; that's been handled by urllib3
# since we exhausted the data.
return self._content
@property
def text(self):
"""Content of the response, in unicode.
If Response.encoding is None, encoding will be guessed using
``chardet``.
The encoding of the response content is determined based solely on HTTP
headers, following RFC 2616 to the letter. If you can take advantage of
non-HTTP knowledge to make a better guess at the encoding, you should
set ``r.encoding`` appropriately before accessing this property.
"""
# Try charset from content-type
content = None
encoding = self.encoding
if not self.content:
return str('')
# Fallback to auto-detected encoding.
if self.encoding is None:
encoding = self.apparent_encoding
# Decode unicode from given encoding.
try:
content = str(self.content, encoding, errors='replace')
except (LookupError, TypeError):
# A LookupError is raised if the encoding was not found which could
# indicate a misspelling or similar mistake.
#
# A TypeError can be raised if encoding is None
#
# So we try blindly encoding.
content = str(self.content, errors='replace')
return content
```
content是bytes 字節流格式的, 而text是將其轉為str
content = str(self.content, encoding, errors='replace')
如果網頁正好是utf-8格式的, 因為編碼環境# -*- coding: utf-8 -*-, 所以content直接可用; 否則依然會有亂碼問題
綜上, 最好的解決方案是 結合源碼的實現以及自身的需求來實現一套方案:
Headers 聲明編碼
網頁開始的空格檢測
正文聲明編碼
chardet 模塊檢測編碼
對于 調用Request包, 簡單處理:
if response.encoding == 'ISO-8859-1':
response.encoding = response.apparent_encoding
response.text
或者借用bs4的方法
from bs4.dammit import EncodingDetector
self.detector = EncodingDetector(
markup, override_encodings, is_html, exclude_encodings)
print self.detector.encoding
總結
以上是生活随笔為你收集整理的get 到的html代码如何转码,爬虫网页转码逻辑的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 用Nmap工具查找Downadup/Co
- 下一篇: javaIO学习下:javase学习(三