當(dāng)前位置：首頁(yè) > 编程语言 > python >内容正文

python

python爬虫文字全是乱码_pythone爬虫编码自适应解决网页乱码

發(fā)布時(shí)間：2024/1/23 python 21 豆豆

生活随笔收集整理的這篇文章主要介紹了 python爬虫文字全是乱码_pythone爬虫编码自适应解决网页乱码小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

該樓層疑似違規(guī)已被系統(tǒng)折疊?隱藏此樓查看此樓

#coding=utf-8

import chardet #字符集檢測(cè)

import urllib.parse

import urllib.request

import re

import ssl

#跳過(guò) SSL證書

ssl._create_default_https_context=ssl._create_unverified_context

rr = re.compile(r"\bcharset[=:\"\s]{1,3}([-_A-Z0-9]+)",re.I)

def getCode(string):

p = rr.findall(string)

if len(p)>0:

print(u'編碼方式: ' + p[0])

return p[0]

print(u'沒(méi)找到編碼方式')

return ''

#getCode(r'iiifjjd charset:" utf_8iidi-oo">')

def getHtml(url):

headers={

"User-Agent": 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',

'Referer': url

}

values = {

'name': 'hao_hao',

'ie': 'utf-8'

}

data = urllib.parse.urlencode(values)

req = urllib.request.Request(url=url+'?'+data, headers=headers)

#req = urllib.request.Request(url+'?'+data)

response = urllib.request.urlopen(req)

#1 從響應(yīng)頭中找編碼方式

page = getCode(response.headers['Content-Type'])

#2 從網(wǎng)頁(yè)源代碼中找編碼方式

if page == '':

for line in response.readlines():

page = getCode(line.decode())

if page !='': break

the_page = response.read()

#3 chardet字符集檢測(cè) 進(jìn)行內(nèi)容分析. https://mm.taobao.com/search_tstar_model.html GBK 識(shí)別成 GB2312 所以不好用. 前兩個(gè)方法都不行再用

if page =='':

chardit1 = chardet.detect(the_page)

page = chardit1['encoding']

print(u'chardet字符集檢測(cè)\r\n編碼方式: ' + page)

#打印響應(yīng)頭數(shù)據(jù).

print(response.headers)

#需要時(shí)關(guān)閉連接

#response.close()

#都找不到編碼方式

if page =='': return ''

return the_page.decode(page) #解碼.

#return the_page.decode(page).encode('utf-8')

print ('===============================================')

#gbk

html = getHtml("https://mm.tao[請(qǐng)把這幾個(gè)字刪掉]bao.com/search_tstar_model.html")

print (html)

print ('===============================================')

#utf-8

html = getHtml("http://kyfw.123[請(qǐng)把這幾個(gè)字刪抻]06.cn/otn/leftTicket/init")

print (html)

print ('===============================================')

總結(jié)

以上是生活随笔為你收集整理的python爬虫文字全是乱码_pythone爬虫编码自适应解决网页乱码的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：各地实时摄像头_智能摄像头——城市治安防
下一篇： python中表示单一数据的类型被称为_

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

python

python爬虫文字全是乱码_pythone爬虫编码自适应 解决网页乱码

總結(jié)

python爬虫文字全是乱码_pythone爬虫编码自适应解决网页乱码