Python3 爬虫实战 — 猫眼电影TOP100【requests、lxml、Xpath、CSV 】
- 爬取時間:2019-09-23
- 爬取難度:★☆☆☆☆☆
- 請求鏈接:https://maoyan.com/board/4
- 爬取目標:貓眼電影 TOP100 的電影名稱、排名、主演、上映時間、評分、封面圖地址,數據保存為 CSV 文件
- 涉及知識:請求庫 requests、解析庫 lxml、Xpath 語法、CSV 文件儲存
- 完整代碼:https://github.com/TRHX/Python3-Spider-Practice/tree/master/BasicTraining/maoyan-top100
- 其他爬蟲實戰代碼合集(持續更新):https://github.com/TRHX/Python3-Spider-Practice
- 爬蟲實戰專欄(持續更新):https://itrhx.blog.csdn.net/article/category/9351278
文章目錄
- 【1x00】循環爬取網頁模塊
- 【2x00】解析模塊
- 【3x00】數據儲存模塊
- 【4x00】完整代碼
- 【4x00】數據截圖
【1x00】循環爬取網頁模塊
觀察貓眼電影TOP100榜,請求地址為:https://maoyan.com/board/4
每頁展示10條電影信息,翻頁觀察 url 變化:
第一頁:https://maoyan.com/board/4
第二頁:https://maoyan.com/board/4?offset=10
第三頁:https://maoyan.com/board/4?offset=20
一共有10頁,利用一個 for 循環,從 0 到 100 每隔 10 取一個值拼接到 url,實現循環爬取每一頁
def index_page(number):url = 'https://maoyan.com/board/4?offset=%s' % numberresponse = requests.get(url=url, headers=headers)return response.textif __name__ == '__main__':for i in range(0, 100, 10):index = index_page(i)【2x00】解析模塊
定義一個頁面解析函數 parse_page(),使用 lxml 解析庫的 Xpath 方法依次提取電影排名(ranking)、電影名稱(movie_name)、主演(performer)、上映時間(releasetime)、評分(score)、電影封面圖 url(movie_img)
通過對主演部分的提取發現有多余的空格符和換行符,循環 performer 列表,使用 strip() 方法去除字符串頭尾空格和換行符
電影評分分為整數部分和小數部分,依次提取兩部分,循環遍歷組成一個完整的評分
最后使用 zip() 函數,將所有提取的對象作為參數,將對象中對應的元素打包成一個個元組,然后返回由這些元組組成的列表
def parse_page(content):tree = etree.HTML(content)# 電影排名ranking = tree.xpath("//dd/i/text()")# 電影名稱movie_name = tree.xpath('//p[@class="name"]/a/text()')# 主演performer = tree.xpath("//p[@class='star']/text()")performer = [p.strip() for p in performer]# 上映時間releasetime = tree.xpath('//p[@class="releasetime"]/text()')# 評分score1 = tree.xpath('//p[@class="score"]/i[@class="integer"]/text()')score2 = tree.xpath('//p[@class="score"]/i[@class="fraction"]/text()')score = [score1[i] + score2[i] for i in range(min(len(score1), len(score2)))]# 電影封面圖movie_img = tree.xpath('//img[@class="board-img"]/@data-src')return zip(ranking, movie_name, performer, releasetime, score, movie_img)【3x00】數據儲存模塊
定義一個 save_results() 函數,將所有數據保存到 maoyan.csv 文件
def save_results(result):with open('maoyan.csv', 'a') as fp:writer = csv.writer(fp)writer.writerow(result)【4x00】完整代碼
# ============================================= # --*-- coding: utf-8 --*-- # @Time : 2019-09-23 # @Author : TRHX # @Blog : www.itrhx.com # @CSDN : https://blog.csdn.net/qq_36759224 # @FileName: maoyan.py # @Software: PyCharm # =============================================import requests from lxml import etree import csvheaders = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36' }def index_page(number):url = 'https://maoyan.com/board/4?offset=%s' % numberresponse = requests.get(url=url, headers=headers)return response.textdef parse_page(content):tree = etree.HTML(content)# 電影排名ranking = tree.xpath("//dd/i/text()")# 電影名稱movie_name = tree.xpath('//p[@class="name"]/a/text()')# 主演performer = tree.xpath("//p[@class='star']/text()")performer = [p.strip() for p in performer]# 上映時間releasetime = tree.xpath('//p[@class="releasetime"]/text()')# 評分score1 = tree.xpath('//p[@class="score"]/i[@class="integer"]/text()')score2 = tree.xpath('//p[@class="score"]/i[@class="fraction"]/text()')score = [score1[i] + score2[i] for i in range(min(len(score1), len(score2)))]# 電影封面圖movie_img = tree.xpath('//img[@class="board-img"]/@data-src')return zip(ranking, movie_name, performer, releasetime, score, movie_img)def save_results(result):with open('maoyan.csv', 'a') as fp:writer = csv.writer(fp)writer.writerow(result)if __name__ == '__main__':print('開始爬取數據...')for i in range(0, 100, 10):index = index_page(i)results = parse_page(index)for i in results:save_results(i)print('數據爬取完畢!')【4x00】數據截圖
總結
以上是生活随笔為你收集整理的Python3 爬虫实战 — 猫眼电影TOP100【requests、lxml、Xpath、CSV 】的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 内存扩展已是安卓标配:10年前的小米2就
- 下一篇: 小米生态链产品上新:多亲F22 Pro号