爬去当当热销图书信息
生活随笔
收集整理的這篇文章主要介紹了
爬去当当热销图书信息
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
運行環境:python 3.6.0
目的:練練手,爬去當當圖書熱門圖書的信息并且存儲
import requests import re import threading import jsonbase_url = url = 'http://bang.dangdang.com/books/fivestars/01.00.00.00.00.00-recent30-0-0-1-'headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.22 Safari/537.36 SE 2.X MetaSr 1.0' }def get_page(page):"""爬去當當網頁信息:param page: 頁碼:return: 網頁信息"""try:url = base_url + str(page)# print(type(url))response = requests.get(url=url, headers=headers)return response.textexcept requests.ConnectionError as e:print('Error', e.args)return Nonedef pase_info(item):"""提取圖書信息:param item: 網頁代碼:return: 圖書信息"""list_num = '<div class="list_num.*?">(.*?).</div>.*?'pic = '<div class="pic">.*?<a.*?>.*?<img src="(.*?)" alt=.*?>.*?</a>.*?</div>.*?'title = '<div class="name">.*?<a href=.*?title="(.*?)">.*?</a>.*?</div>.*?'biaosheng = '<div class="biaosheng">(.*?)<span>(.*?)</span></div>.*?'price = '<div class="price">.*?<p>.*?<span.*?class="price_n">(.*?)</span>.*?</p>.*?</div>.*?'pattern = re.compile('<li>.*?{}{}{}{}{}.*?</li>'.format(list_num, pic, title, biaosheng, price), re.S)items = re.findall(pattern, item)return itemsdef save_info(book):"""存儲到本地:param book: 圖書信息:return: None"""with open('當當圖書.txt', 'a+', encoding='utf-8') as f:# f.write(json.dumps(book, ensure_ascii=False))f.write(str(book))f.write('\n')def main(each):"""對一整個網頁信息的抓取及存儲:param each: 頁碼范圍:return: None"""response = get_page(each)book_info = pase_info(response)for book in book_info:# print(book)info = {'num': book[0],'pic': book[1],'price': book[5],'biaosheng': book[3] + book[4],'title': book[2]}# print(info)save_info(info)if __name__ == '__main__':MIN_PAGE = 1MAX_PAGE = 25for each in range(MIN_PAGE, MAX_PAGE + 1):print('第 %s 頁' % each)# th = threading.Thread(target=main, args=[each])# th.start()main(each)運行結果:
ps:其實我本來想用多線程的,誰知道因為順序的原因輸出無序,存儲也無序了
總結
以上是生活随笔為你收集整理的爬去当当热销图书信息的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 截图上传录屏gif上传工具推荐
- 下一篇: 大数据简历模板