Python 爬取生成中文词云以爬取知乎用户属性为例
?
代碼如下:
# -*- coding:utf-8 -*-import requests import pandas as pd import timeimport matplotlib.pyplot as plt from wordcloud import WordCloud import jiebaheader={'authorization':'Bearer 2|1:0|10:1515395885|4:z_c0|92:Mi4xOFQ0UEF3QUFBQUFBRU1LMElhcTVDeVlBQUFCZ0FsVk5MV2xBV3dDLVZPdEhYeGxaclFVeERfMjZvd3lOXzYzd1FB|39008996817966440159b3a15b5f921f7a22b5125eb5a88b37f58f3f459ff7f8','User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36','X-UDID':'ABDCtCGquQuPTtEPSOg35iwD-FA20zJg2ps=', }user_data = [] def get_user_data(page):for i in range(page):url = 'https://www.zhihu.com/api/v4/members/excited-vczh/followees?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset={}&limit=20'.format(i*20) #response = requests.get(url, headers=header).textresponse = requests.get(url, headers=header).json()['data']#['data'] 只有JSON格式中選擇data節(jié)點user_data.extend(response)print('正在爬取%s頁' % str(i+1))time.sleep(1)if __name__=='__main__':get_user_data(10)#pandas 的函數(shù) from_dict()可以直接將一個response變成一個對象#df = pd.DataFrame.from_dict(user_data)#df.to_csv('D:/PythonWorkSpace/TestData/zhihu/user2.csv')df = pd.DataFrame.from_dict(user_data).get('headline')df.to_csv('D:/PythonWorkSpace/TestData/zhihu/headline.txt')text_from_file_with_apath = open('D:/PythonWorkSpace/TestData/zhihu/headline.txt').read()wordlist_after_jieba = jieba.cut(text_from_file_with_apath, cut_all=True)wl_space_split = " ".join(wordlist_after_jieba)my_wordcloud = WordCloud().generate(wl_space_split)plt.imshow(my_wordcloud)plt.axis("off")plt.show()需要安裝準(zhǔn)備的庫:
pip install?matplotlib
pip install jieba
pip install wordcloud(發(fā)現(xiàn)這方法安裝不成功)
換種安裝方式到?https://github.com/amueller/word_cloud?這里下載庫文件,解壓,然后進(jìn)入到解壓后的文件,按住shift+鼠標(biāo)右鍵?打開命令窗口運行一下命令:
python setup.py install?然后同樣報錯
?然后我又換了一張安裝方式:
到?http://www.lfd.uci.edu/~gohlke/pythonlibs/#wordcloud?頁面下載所需的wordcloud模塊的whl文件,下載后進(jìn)入存儲該文件的路徑,按照方法一,執(zhí)行“pip install wordcloud-1.3.3-cp36-cp36m-win_amd64.whl”,這樣就會安裝成功。
?
然后生成詞云的代碼如下:
text_from_file_with_apath = open('D:\Python\zhihu\headline.txt','r',encoding='utf-8').read() wordlist_after_jieba = jieba.cut(text_from_file_with_apath, cut_all=True) wl_space_split = " ".join(wordlist_after_jieba)my_wordcloud = WordCloud().generate(wl_space_split)plt.imshow(my_wordcloud) plt.axis("off") plt.show()但是發(fā)現(xiàn)不顯示中文,這可就頭疼了。
顯示的是一些大大小小的彩色框框。這是因為,我們使用的wordcloud.py中,FONT_PATH的默認(rèn)設(shè)置不識別中文。
仔細(xì)研究之后做了改進(jìn),終于可以正常顯示中文了
?
?
坑:
Python讀取文件時經(jīng)常會遇到這樣的錯誤:python3.4 UnicodeDecodeError: 'gbk' codec can't decode byte 0xff in position 0: illegal multibyte sequence
import codecs,sys
?f = codecs.open("***.txt","r","utf-8")
指明打開文件的編碼方式就可以消除錯誤了
轉(zhuǎn)載于:https://www.cnblogs.com/PeterZhang1520389703/p/8244633.html
總結(jié)
以上是生活随笔為你收集整理的Python 爬取生成中文词云以爬取知乎用户属性为例的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 决策树python建模中的坑 :Valu
- 下一篇: 百度云怎样提升下载速度