Python爬虫,微信公众号话题标签内容采集打印PDF输出
生活随笔
收集整理的這篇文章主要介紹了
Python爬虫,微信公众号话题标签内容采集打印PDF输出
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
微信公眾號內容采集,比較怪異,其參數,post參數需要話費時間去搞定,這里采集的是話題標簽的內容,同時應用了pdfkit打印輸出內容。
這里實現應用了兩個版本,第一個是直接網頁訪問,其真實地址即post網址也存在比較多的參數,沒有嘗試過,獲取到的內容僅有部分,比較不理想。第二個版本是采用了無頭瀏覽器直接訪問,獲取到網頁源碼,進行解析,得到想要的內容。
本渣渣現在比較懶,代碼都是拿以前的,現成的,復制,改改,直接使用的!
版本一:
#微信公眾號內容獲取打印pdf #by 微信:huguo00289 #https://mp.weixin.qq.com/mp/homepage?__biz=MzA4NjQ3MDk4OA==&hid=5&sn=573b1b806f9ebf63171a56ee2936b883&devicetype=android-29&version=27001239&lang=zh_CN&nettype=WIFI&a=&session_us=gh_7d55ab2d943f&wx_header=1&fontScale=100&from=timeline&isappinstalled=0&scene=1&subscene=2&clicktime=1594602258&enterid=1594602258&ascene=14 #?-*-?coding:?UTF-8?-*- import?requests from?fake_useragent?import?UserAgent import?os,re import?pdfkitconfg?=?pdfkit.configuration(wkhtmltopdf=r'D:\wkhtmltox-0.12.5-1.mxe-cross-win64\wkhtmltox\bin\wkhtmltopdf.exe')class?Du():def?__init__(self,furl):ua=UserAgent()self.headers={"User-Agent":?ua.random,}self.url=furldef?get_urls(self):response=requests.get(self.url,headers=self.headers,timeout=8)html=response.content.decode('utf-8')req=re.findall(r'var?data={(.+?)if',html,re.S)[0]urls=re.findall(r',"link":"(.+?)",',req,re.S)urls=set(urls)print(len(urls))return?urlsdef?get_content(self,url,category):response?=?requests.get(url,?headers=self.headers,?timeout=8)print(response.status_code)html?=?response.content.decode('utf-8')req?=?re.findall(r'<div?id="img-content"?class="rich_media_wrp">(.+?)var?first_sceen__time',html,re.S)[0]#獲取標題h1=re.findall(r'<h2?class="rich_media_title"?id="activity-name">(.+?)</h2>',req,re.S)[0]h1=h1.strip()pattern?=?r"[\/\\\:\*\?\"\<\>\|]"h1?=?re.sub(pattern,?"_",?h1)??#?替換為下劃線print(h1)#獲取詳情detail?=?re.findall(r'<div?class="rich_media_content?"?id="js_content"?style="visibility:?hidden;">(.+?)<script?nonce=".+?"?type="text/javascript">',req,re.S)[0]data?=?f'<h1>{h1}</h1>\n{detail}'self.dypdf(h1,data,category)return?datadef?dypdf(self,h1,data,category):datas?=?f'<html><head><meta?charset="UTF-8"></head><body>{data}</body></html>'print("開始打印內容!")pdfkit.from_string(datas,?f'{category}/{h1}.pdf',?configuration=confg)print("打印保存成功!")if?__name__=='__main__':furl="https://mp.weixin.qq.com/mp/homepage?__biz=MzA4NjQ3MDk4OA==&hid=5&sn=573b1b806f9ebf63171a56ee2936b883&devicetype=android-29&version=27001239&lang=zh_CN&nettype=WIFI&a=&session_us=gh_7d55ab2d943f&wx_header=1&fontScale=100&from=timeline&isappinstalled=0&scene=1&subscene=2&clicktime=1594602258&enterid=1594602258&ascene=14"category="潘通色卡(電子版)"datas?=?''os.makedirs(f'{category}/',exist_ok=True)spider=Du(furl)urls=spider.get_urls()for?url?in?urls:print(f">>?正在爬取鏈接:{url}?..")try:data=spider.get_content(url,category)except?Exception?as?e:print(f"爬取錯誤,錯誤代碼為:{e}")datas='%s%s%s'%(datas,'\n',data)spider.dypdf(category,datas,category)版本二:
#微信公眾號內容獲取打印pdf #by 微信:huguo00289 #https://mp.weixin.qq.com/mp/homepage?__biz=MzA4NjQ3MDk4OA==&hid=5&sn=573b1b806f9ebf63171a56ee2936b883&devicetype=android-29&version=27001239&lang=zh_CN&nettype=WIFI&a=&session_us=gh_7d55ab2d943f&wx_header=1&fontScale=100&from=timeline&isappinstalled=0&scene=1&subscene=2&clicktime=1594602258&enterid=1594602258&ascene=14 #?-*-?coding:?UTF-8?-*- import?requests from?selenium?import?webdriver import?os,re,time import?pdfkit from?bs4?import?BeautifulSoupconfg?=?pdfkit.configuration(wkhtmltopdf=r'D:\wkhtmltox-0.12.5-1.mxe-cross-win64\wkhtmltox\bin\wkhtmltopdf.exe')class?wx():def?__init__(self,furl):self.url?=?furlself.chrome_driver?=?r'C:\Users\Administrator\Desktop\chromedriver_win32\chromedriver.exe'??#?chromedriver的文件位置self.browser?=?webdriver.Chrome(executable_path=self.chrome_driver)def?get_urls(self):urls=[]self.browser.get(self.url)hrefs=self.browser.find_elements_by_xpath("//div[@class='article_list']/a[@class='list_item?js_post']")for?href?in?hrefs:url=href.get_attribute('href')urls.append(url)print(len(urls))return?urlsdef?get_content(self,url,category):self.browser.get(url)time.sleep(5)#?調用driver的page_source屬性獲取頁面源碼pageSource?=?self.browser.page_sourcesoup=BeautifulSoup(pageSource,'lxml')#獲取標題h1=re.findall(r'<h2?class="rich_media_title"?id="activity-name">(.+?)</h2>',pageSource,re.S)[0]h1=h1.strip()pattern?=?r"[\/\\\:\*\?\"\<\>\|]"h1?=?re.sub(pattern,?"_",?h1)??#?替換為下劃線print(h1)#獲取詳情detail?=soup.find('div',class_="rich_media_content")detail=str(detail)del_text="""<p?class=""?style="margin-top:?-1px;?max-width:?100%;?font-family:?微軟雅黑;?white-space:?normal;?min-height:?40px;?visibility:?visible;?height:?40px;?line-height:?40px;?border-radius:?10px;?text-align:?center;?box-shadow:?rgb(190,?190,?190)?0px?3px?5px;?color:?rgb(255,?255,?255);?box-sizing:?border-box?!important;?word-wrap:?break-word?!important;?background-image:?none;?background-attachment:?scroll;?background-color:?rgb(245,?143,?198);?background-position:?0%?0%;?background-repeat:?repeat;"><strong?class=""?style="max-width:?100%;?box-sizing:?border-box?!important;?word-wrap:?break-word?!important;"><span style="max-width:?100%;?font-size:?14px;?box-sizing:?border-box?!important;?word-wrap:?break-word?!important;">↑?點擊上方<span style="max-width:?100%;?box-sizing:?border-box?!important;?word-wrap:?break-word?!important;">“染整百科”</span>關注我們</span></strong></p>"""detail=detail.replace(del_text,'')data?=?f'<h1>{h1}</h1>\n{detail}'self.dypdf(h1,data,category)return?datadef?dypdf(self,h1,data,category):datas?=?f'<html><head><meta?charset="UTF-8"></head><body>{data}</body></html>'print("開始打印內容!")pdfkit.from_string(datas,?f'{category}/{h1}.pdf',?configuration=confg)print("打印保存成功!")def?quit(self):self.browser.quit()if?__name__=='__main__':furl="https://mp.weixin.qq.com/mp/homepage?__biz=MzA4NjQ3MDk4OA==&hid=5&sn=573b1b806f9ebf63171a56ee2936b883&devicetype=android-29&version=27001239&lang=zh_CN&nettype=WIFI&a=&session_us=gh_7d55ab2d943f&wx_header=1&fontScale=100&from=timeline&isappinstalled=0&scene=1&subscene=2&clicktime=1594602258&enterid=1594602258&ascene=14"category="潘通色卡(電子版)"datas?=?''os.makedirs(f'{category}/',exist_ok=True)spider=wx(furl)urls=spider.get_urls()for?url?in?urls:print(f">>?正在爬取鏈接:{url}?..")try:data=spider.get_content(url,category)except?Exception?as?e:print(f"爬取錯誤,錯誤代碼為:{e}")datas='%s%s%s'%(datas,'\n',data)spider.quit()spider.dypdf(category,datas,category)以上代碼僅供參考,如有雷同,那肯定是本渣渣抄襲的!
?? ? ?
微信公眾號:二爺記
不定時分享python源碼及工具
總結
以上是生活随笔為你收集整理的Python爬虫,微信公众号话题标签内容采集打印PDF输出的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 计算机组成原理时序部件实验,计算机组成原
- 下一篇: socket编程实现多人聊天室