多线程爬取新闻标题和链接
                                                            生活随笔
收集整理的這篇文章主要介紹了
                                多线程爬取新闻标题和链接
小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.                        
                                新聞分頁地址:https://news.cnblogs.com/n/page/10/;url中最后一個(gè)數(shù)字代表頁碼
?
from concurrent.futures import ThreadPoolExecutor import threading import time from queue import Queue import logging import requests from bs4 import BeautifulSoup# 日志參數(shù)的設(shè)定 FORMAT = "%(asctime)s %(threadName)s %(thread)d %(message)s" logging.basicConfig(format=FORMAT, level=logging.INFO)# 多線程對象 event = threading.Event()# url的前綴和user-agent值的設(shè)定 base_url = 'https://news.cnblogs.com' page_path = '/n/page/' ua = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'# 隊(duì)列對象 urls = Queue() # 待爬取隊(duì)列,省略已爬取隊(duì)列 htmls = Queue() # 從原網(wǎng)頁里爬取的全部html內(nèi)容:太大,無用數(shù)據(jù)太多,不保存 outputs = Queue() # 提取的數(shù)據(jù),結(jié)果輸出隊(duì)列# 1.創(chuàng)建urls;周而復(fù)始的創(chuàng)建要爬取的url;start表示起始頁面,stop表示終止頁面 def create_urls(start, stop, step=1):for i in range(start, stop+1, step):url = "{}{}{}/".format(base_url, page_path, i)# print(url)urls.put(url) # 將生成的url放入待爬取的url的隊(duì)列里print('url創(chuàng)建完畢')# create_urls(1,10) # 創(chuàng)建page1到page10的url # print(urls.qsize()) # 隊(duì)列的大小為10# 2.使用url來發(fā)起request請求,返回response對象 def crawler(): # 多線程while not event.is_set():try:url = urls.get(True, 1) # 設(shè)置超時(shí)時(shí)間為1秒response = requests.get(url, headers={'User-agent': ua})with response:html = response.text # 異步方式獲取文本信息htmls.put(html) # 將每個(gè)頁面內(nèi)容存放進(jìn)對應(yīng)的htmls隊(duì)列里print('url:', url)# 捕獲超時(shí)拋出的錯(cuò)誤except Exception as e:print(e)# logging.error(e)# 3.分析提取有用的數(shù)據(jù)入庫 def parse():while not event.is_set():try:html = htmls.get(True, 1) soup = BeautifulSoup(html, 'lxml') # 解析html內(nèi)容news = soup.select('h2.news_entry a') # 提取所需標(biāo)簽內(nèi)容for n in news:title = n.textref = base_url + n.attrs.get('href')print('get_title:', title, 'get_ref:', ref)outputs.put((title, ref)) # 提取出的標(biāo)題和鏈接內(nèi)容存放至對應(yīng)隊(duì)列里except Exception as e:print(e)# logging.error(e)# 4.入庫;保存到文件中 def save(path):with open(path, 'a+', encoding='utf-8') as f:while not event.is_set():try:title, ref = outputs.get(True, 1) # 元組結(jié)構(gòu)print('save_title:', title, 'save_ref:', ref)f.write('{}_{}\n'.format(title, ref))f.flush() # 爬取內(nèi)容保存到文件中except Exception as e:print(e)# logging.error(e)# 線程池中,啟動(dòng)線程(最大線程數(shù)為10) executor = ThreadPoolExecutor(max_workers=10) executor.submit(create_urls, 1, 10) # 起始urls,以后queue中parse有用的url也可以加入 executor.submit(parse) executor.submit(save, 'news.txt')for i in range(7):executor.submit(crawler)while True:cmd = input('>>>')if cmd.strip() == 'q': # 在console欄里輸入q,就會(huì)過一秒后停止多線程運(yùn)行event.set()executor.shutdown()print('closing')time.sleep(1)break?
轉(zhuǎn)載于:https://www.cnblogs.com/hongdanni/p/10573858.html
總結(jié)
以上是生活随笔為你收集整理的多线程爬取新闻标题和链接的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
 
                            
                        - 上一篇: 比较好的一些 ConcurrentHas
- 下一篇: input 输入值的监听 禁止
