當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

6.获取环球时报关键词新闻--动态网页Ajax

發布時間：2024/3/7 编程问答 40 豆豆

生活随笔收集整理的這篇文章主要介紹了 6.获取环球时报关键词新闻--动态网页Ajax 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

一、背景

前段時間寫了個爬蟲獲取新浪的新聞，但新浪新聞頁面文檔頁格式不統一，新聞質量也較差，經過篩選，環球時報上面的新聞質量稍好，且頁面格式比較統一。

二、實例解析

1.思路

我們這里主要獲取環球時報上面的國際新聞
國際新聞URL：https://world.huanqiu.com/
爬取新聞的三步法：解析主頁上面的新聞鏈接---->解析每個新聞鏈接里面的內容---->格式化文本寫入文檔

2.新聞鏈接解析

常規套路，打開主頁，檢查元素，找到一個新聞的元素位置

容易發現上圖中，“多米尼克”新聞的元素位置如下：

selector為"#recomend li a"，但寫入代碼后，可以發現是找不到這個元素的，可以猜測是動態加載的網頁
同樣的可以驗證，打開主頁源代碼，我們搜索這條新聞是搜索不到的。

點擊檢查元素的‘網絡’選項卡，容易發現該網頁動態加載的，在‘預覽’中可以發現包含了當前頁面的20條新聞數據，Json格式保存；

接下來就是拼湊新聞鏈接

這里主要關注下url中的offset為偏移量，通俗的講就是頁碼，offset=0即為第1頁，offset=20即為第2頁，依次類推

接著解析每條新聞的地址，jsons數據中，aid即為該新聞的保存名，如aid為42BY7TXsqIc，那么該新聞的鏈接就是：https://world.huanqiu.com/article/42BY7TXsqIc；title為標題名，我們從json中獲取這兩條就夠了

下面是新聞鏈接解析模塊的代碼

#獲取新聞鏈接 keyword='新冠' count=int(input('請輸入爬取頁碼數(1,2,3.....)：')) news_title=[] news_url=[] headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36 Edg/88.0.705.81","Cookie":'''UM_distinctid=177f21c29bf10d-043ddbd82b4eef-7a667166-144000-177f21c29c039b; _ma_tk=hewd6oh2rs0o2mqybcrd3s60un56m46w; REPORT_UID_=cJ869QdbSP132oZ6591juDJYZZ8wK0SC; Hm_lvt_1fc983b4c305d209e7e05d96e713939f=1614674668,1614738118,1614738291,1614908983; CNZZDATA1000010102=1231572127-1614671036-https%253A%252F%252Fwww.huanqiu.com%252F%7C1614914069; Hm_lpvt_1fc983b4c305d209e7e05d96e713939f=1614914406''' } path = 'https://world.huanqiu.com/api/list?node=%22/e3pmh22ph/e3pmh2398%22,%22/e3pmh22ph/e3pmh26vv%22,%22/e3pmh22ph/e3pn6efsl%22,%22/e3pmh22ph/efp8fqe21%22&offset={0}&limit=20' print('開始爬取新聞鏈接...') for i in tqdm(range(0,count)):try:url=path.format(20*i) #offset為20res = requests.get(url,headers=headers)if res.content:items=res.json().get('list')for item in items:title=item.get('title') aid=item.get('aid')if keyword in str(title) and title not in news_title: #篩選關鍵詞url='https://world.huanqiu.com/article/'+str(aid)news_title.append(item.get('title'))news_url.append(url) else:print('無內容')except res.ConnectionError:print('error')

3.新聞內容解析

內容解析就很簡單了，選定元素位置循環起來就ok了

#獲取新聞內容 news_content=[] news_time=[] news_source=[] print('開始爬取新聞內容....') for url in tqdm(news_url): req=requests.get(url)req.encoding='utf-8'soup=BeautifulSoup(req.text,'lxml')#獲取新聞發布時間和來源reg_source=soup.select('div.metadata-info') str_source=re.findall('<a href=".*">(.*)</a>',str(reg_source),re.S)news_source.append(str_source)str_time=re.findall('(.*)',str(reg_source),re.S) news_time.append(str_time) #獲取新聞內容reg_content=soup.select('.l-con.clear')str_data=re.findall('(.*?)',str(reg_content),re.S) #re.S參數，多行匹配str_data=''.join(str_data) #將data中的數組拼成一個字符串#剔除標簽中的內容pat_str1='.*'pat_str2='.*' #剔除：海外網3月5日電str_data=re.sub(pat_str1,'',str_data)str_data=re.sub(pat_str2,'',str_data) news_content.append(str_data)

4.保存文本

#寫入txt文本 print('開始寫入文本....') write_flag = True txtname='0305.txt' with open(txtname, 'w', encoding='utf-8') as f:for i in range(len(news_title)): if news_content[i]!='':f.writelines(news_title[i])f.writelines('\n')f.writelines(news_source[i])f.writelines('\n')f.writelines(news_time[i])f.writelines('\n')f.writelines(news_content[i]) else:continuef.write('\n\n\n')f.close()

三、小結

動態網頁的爬取，關鍵在于解析新聞鏈接，通過抓包分析出請求鏈接和json數據包，從中找到關鍵信息即可

四、代碼

import requests import re import json from urllib.parse import urlencode from bs4 import BeautifulSoup from tqdm import tqdm# 下載新聞鏈接 keyword='新冠' count=int(input('請輸入爬取頁碼數(1,2,3.....)：')) news_title=[] news_url=[] headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36 Edg/88.0.705.81","Cookie":'''UM_distinctid=177f21c29bf10d-043ddbd82b4eef-7a667166-144000-177f21c29c039b; _ma_tk=hewd6oh2rs0o2mqybcrd3s60un56m46w; REPORT_UID_=cJ869QdbSP132oZ6591juDJYZZ8wK0SC; Hm_lvt_1fc983b4c305d209e7e05d96e713939f=1614674668,1614738118,1614738291,1614908983; CNZZDATA1000010102=1231572127-1614671036-https%253A%252F%252Fwww.huanqiu.com%252F%7C1614914069; Hm_lpvt_1fc983b4c305d209e7e05d96e713939f=1614914406''' } path = 'https://world.huanqiu.com/api/list?node=%22/e3pmh22ph/e3pmh2398%22,%22/e3pmh22ph/e3pmh26vv%22,%22/e3pmh22ph/e3pn6efsl%22,%22/e3pmh22ph/efp8fqe21%22&offset={0}&limit=20'print('開始爬取新聞鏈接...') for i in tqdm(range(0,count)):try:url=path.format(20*i) #offset為20res = requests.get(url,headers=headers)if res.content:items=res.json().get('list')for item in items:title=item.get('title') aid=item.get('aid')if keyword in str(title) and title not in news_title: #篩選關鍵詞url='https://world.huanqiu.com/article/'+str(aid)news_title.append(item.get('title'))news_url.append(url) else:print('無內容')except res.ConnectionError:print('error')#獲取新聞內容 news_content=[] news_time=[] news_source=[] print('開始爬取新聞內容....') for url in tqdm(news_url): req=requests.get(url)req.encoding='utf-8'soup=BeautifulSoup(req.text,'lxml')#獲取新聞發布時間和來源reg_source=soup.select('div.metadata-info') str_source=re.findall('<a href=".*">(.*)</a>',str(reg_source),re.S)news_source.append(str_source)str_time=re.findall('(.*)',str(reg_source),re.S) news_time.append(str_time) #獲取新聞內容reg_content=soup.select('.l-con.clear')str_data=re.findall('(.*?)',str(reg_content),re.S) #re.S參數，多行匹配str_data=''.join(str_data) #將data中的數組拼成一個字符串#剔除標簽中的內容pat_str1='.*'pat_str2='.*' #剔除：海外網3月5日電str_data=re.sub(pat_str1,'',str_data)str_data=re.sub(pat_str2,'',str_data) news_content.append(str_data)#寫入txt文本 print('開始寫入文本....') write_flag = True txtname='0305.txt' with open(txtname, 'w', encoding='utf-8') as f:for i in tqdm(range(len(news_title))): if news_content[i]!='':f.writelines(news_title[i])f.writelines('\n')f.writelines(news_source[i])f.writelines('\n')f.writelines(news_time[i])f.writelines('\n')f.writelines(news_content[i]) else:continuef.write('\n\n\n\n')f.close()print('---結束---')

總結

以上是生活随笔為你收集整理的6.获取环球时报关键词新闻--动态网页Ajax的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。