python爬取新浪新闻首页_学习了《python网络爬虫实战》第一个爬虫,爬取新浪新闻...
請安裝anaconda,其中附帶的spyder方便運行完查看變量
1.進入cmd控制臺,
輸入 pip install BeautifulSoup4
pip install requests
2.編寫代碼,代碼已經很清晰了,直接運行不會報錯并有成功的結果def getNewsDetail(newsUrl):
import requests
from bs4 import BeautifulSoup
from datetime import datetime
newsWeb = requests.get(newsUrl)
newsWeb.encoding = 'utf-8'
soup = BeautifulSoup(newsWeb.text,'lxml')
result = {}
result['title'] = soup.select('.main-title')[0].text
result['newsSource'] = soup.select('.source')[0].text
timeSource = soup.select('.date')[0].text
result['datetime'] = datetime.strptime(timeSource,'%Y年%m月%d日 %H:%M')
result['article'] = soup.select('.article')[0].text
result['editor'] = soup.select('.show_author')[0].text.strip('責任編輯:')
result['comment'] = soup.select('.num')[0].text
return result
def parseListLinks(url):
import requests
import json
newsDetails = []
request = requests.get(url)
jsonLoad = json.loads(request.text.lstrip(' newsloadercallback(').rstrip(');'))
newsUrls = []
for item in jsonLoad['result']['data']:
newsUrls.append(item['url'])
for url in newsUrls:
newsDetails.append(getNewsDetail(url))
return newsDetails
if __name__ == '__main__':
#獲取單個新聞頁面的信息
newsUrl = 'http://news.sina.com.cn/s/wh/2018-01-08/doc-ifyqkarr7830426.shtml'
newsDetail = getNewsDetail(newsUrl)
#獲取整個列表各個新聞頁面的信息
rollUrl='http://api.roll.news.sina.com.cn/zt_list?channel=news&cat_1=gnxw\
&cat_2==gdxw1||=gatxw||=zs-pl||=mtjj&level==1||=2&show_ext=1&show_all=1&\
show_num=22&tag=1&format=json&page=23&callback=newsloadercallback&_=1515911333929'
newsDetails = parseListLinks(rollUrl)
總結
以上是生活随笔為你收集整理的python爬取新浪新闻首页_学习了《python网络爬虫实战》第一个爬虫,爬取新浪新闻...的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: python decimal_【进阶】嫌
- 下一篇: python操作csv文件第7行开始的数