當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

爬虫实战操作（2）—— 新浪新闻内容细节

發布時間：2023/12/31 编程问答 22 豆豆

生活随笔收集整理的這篇文章主要介紹了爬虫实战操作（2）—— 新浪新闻内容细节小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

本文實現獲取新浪新聞內容的各種細節，標題、時間、來源、內文、編輯者、評論數。

import requests from bs4 import BeautifulSoup res=requests.get("https://news.sina.com.cn/s/2020-10-05/doc-iivhvpwz0482504.shtml") res.encoding='utf-8' #print(res.text) soup=BeautifulSoup(res.text,'html.parser') print(soup.text)

1.獲取標題

soup.select(".main-title")[0].text#獲取文章標題，里面以什么為參考找，不一定，看實際操作

2.時間和來源

2.1 整體獲取兩個

#通過開發工具找到了時間和來源為:date-source source=soup.select(".date-source")#獲得了新聞的時間和來源 print(source) print('{:*^100}'.format('輸出')) #根據上面的輸出來寫代碼如何獲取時間,contents是從span獲取內容 print(source[0].contents) time0=source[0].contents[1] time=source[0].contents[1].contents[0] print(time0,"\n",time) print('{:*^100}'.format('輸出')) #上面的時間是str類型，收集數據時，我們希望它是時間類型 from datetime import datetime print(datetime.strptime(time, "%Y年%m月%d日 %H:%M"))#將字符串轉化為shi #獲取標題 print(source[0].contents[3].text)

備注：畫紅框得輸出是為了查看怎么獲取時間和來源

2.2 分開獲取

快速獲取時間

date=soup.select(".date") datetime.strptime(date[0].text, "%Y年%m月%d日 %H:%M") soup.select(".source")[0].text

3.獲取內文和編輯者

#下面是合并每段的內容，去掉分割符P,\u3000\u3000是空白控制碼，用strip()移除它 #"".join([p.text.strip() for p in soup.select("#article p")[:-1]]) article=[] for p in soup.select("#article p")[:-1]:article.append(p.text.strip()) print(" ".join(article))#段落之間用空格隔開，也可以用其他符號“\n”,@ print('{:*^100}'.format('編輯者')) print(soup.select("#article p")[-1].text.strip('責任編輯：'))

4. 獲取評論數

查找信息得步驟：
先在doc下查找，如果沒有，說明不是同步載入得，接著在XHR和JS下查找想要得信息。

soup.select(".icon-comment")

備注：說明評論數是靠其他方式獲得得。
接下來我們查看XHR和JS下得文件，地毯上查找評論數

獲取上面評論數所在得URL，點擊headers即可。

import requests #網址太長，分行，并在結尾結反斜杠\表連接 URL="https://comment.sina.com.cn/page/info?version=1&format=json\ &channel=sh&newsid=comos-ivhvpwz0482504&group=0&compress=0&ie=utf-8&oe=utf-8&page=1\ &page_size=3&t_size=3&h_size=3&thread=1&uid=unlogin_user&callback=jsonp_1601953986658&_=1601953986658" comments=requests.get(URL) print(comments.text)

使用js解析，但是要去除掉紅線畫得部分

import json jd=json.loads(comments.text.strip('jsonp_1601953986658').strip('()')) jd #回到Chrome開發工具中，這樣瀏覽jd中的信息會比較快 jd["result"]["count"]["total"]#獲取評論數

5. 獲取新聞ID

#怎么獲取新網id #下面是新聞所在網頁的地址 newsurl="https://news.sina.com.cn/c/2018-11-09/doc-ihnprhzw5251381.shtmll" print(newsurl.split("/")) print('{:*^100}'.format('輸出')) newsid=newsurl.split("/")[-1].rstrip(".shtml").lstrip("doc-i") print("新聞id:",newsid) print('{:*^100}'.format('輸出')) #用正則表達式求新聞id import re m=re.search("doc-i(.+).shtml",newsurl)#返回匹配到的部分 print(m.group(1))#group(0)是獲得匹配的部門，group(1)是獲得匹配小括號的內容

6.整理總結

import requests from datetime import datetime from bs4 import BeautifulSoup #給一個新聞id,返回一個信息評論數，因為評論數的網址只差一個新聞id不一樣 import re import requests import json commentURL = "https://comment.sina.com.cn/page/info?version=1&format=json\ &channel=sh&newsid=comos-{}&group=0&compress=0&ie=utf-8&oe=utf-8&page=1\ &page_size=3&t_size=3&h_size=3&thread=1&uid=unlogin_user&callback=jsonp_1601953986658&_=1601953986658" def getCommentCounts(newsurl): m = re.search('doc-i(.+).shtml', newsurl)newsid = m.group(1) #獲取新聞編碼id comments=requests.get(commentURL.format(newsid))jd=json.loads(comments.text.strip('jsonp_1601953986658').strip('()'))return jd["result"]["count"]["total"]#輸入：網址；輸出：新聞正文，標題，評論數，來源 def getNewsDetail(newsurl):result = {}res = requests.get(newsurl)res.encoding = 'utf-8'soup = BeautifulSoup(res.text, 'html.parser')result['title'] = soup.select(".main-title")[0].textresult['newssource'] = soup.select(".source")[0].texttimesource =soup.select(".date")[0].textresult['dt'] = datetime.strptime(timesource, "%Y年%m月%d日 %H:%M")result['article'] = '\n'.join([p.text.strip() for p in soup.select("#article p")[:-1]])result['editor'] = soup.select("#article p")[-1].text.strip('責任編輯：')result['comments'] = getCommentCounts(newsurl)return resultimport json news="https://news.sina.com.cn/s/2020-10-05/doc-iivhvpwz0482504.shtml" print(getNewsDetail(news))

總結

以上是生活随笔為你收集整理的爬虫实战操作（2）—— 新浪新闻内容细节的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：多因素生存分析图绘制
下一篇： Python ***whl is not