當前位置：首頁 > 编程语言 > python >内容正文

python

Python爬虫实例1

發布時間：2023/12/10 python 24 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python爬虫实例1 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Python爬蟲實例1

爬取貓眼電影TOP100（http://maoyan.com/board/4)的相關內容

step1 準備工作

目標：
爬取貓眼電影TOP100的電影名稱、時間、評分、圖片

分析：
第一頁URL：https://maoyan.com/board/4，展示了排行1-10的電影；
第二頁URL：https://maoyan.com/board/4?offset=10，展示了排行10-20的電影；
…
獲取TOP100，需要分開請求10次，參數offset分別為：0，10…90

step2 獲取數據

1.爬取第一頁的源代碼

import requestsdef get_one_page(url):headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'}response = requests.get(url=url, headers=headers)html = response.textreturn htmldef main():url = "https://maoyan.com/board/4"html = get_one_page(url)print(html)

2.正則提取信息

每個電影對應一個dd節點

<dd><i class="board-index board-index-1">1</i><a href="/films/1200486" title="我不是藥神" class="image-link" data-act="boarditem-click" data-val="{movieId:1200486}"><img src="//s3plus.meituan.net/v1/mss_e2821d7f0cfe4ac1bf9202ecf9590e67/cdn-prod/file:5788b470/image/loading_2.e3d934bf.png" alt="" class="poster-default" /><img data-src="https://p0.meituan.net/movie/414176cfa3fea8bed9b579e9f42766b9686649.jpg@160w_220h_1e_1c" alt="我不是藥神" class="board-img" />`在這里插入代碼片`</a><div class="board-item-main"><div class="board-item-content"><div class="movie-item-info"><p class="name"><a href="/films/1200486" title="我不是藥神" data-act="boarditem-click" data-val="{movieId:1200486}">我不是藥神</a></p><p class="star">主演：徐崢,周一圍,王傳君</p> <p class="releasetime">上映時間：2018-07-05</p> </div><div class="movie-item-number score-num"> <p class="score"><i class="integer">9.</i><i class="fraction">6</i></p> </div></div></div></dd>

爬取信息的正則表達式：

<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name"><a.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>

括號內匹配了7個信息：排名、圖片、名稱、主演、上映時間、評分整數部分、評分小數部分

3.處理得到的數據

數據比較雜亂，我們要對數據進行處理。
我們遍歷提取結果生成字典，形成結構化數據

需要使用兩個函數：
（1）yield：通俗理解，yield就是 return 返回一個值，并且記住這個返回的位置，下次迭代就從這個位置后開始
（2）strip：去掉字符串兩邊的空格

更新代碼

def parse_one_page(html):pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name"><a'+ '.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>'+ '.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>', re.S)items = re.findall(pattern, html)for item in items:yield{'index': item[0],'image': item[1],'title': item[2],'actor': item[3].strip()[3:],'time': item[4].strip()[5:],'score': item[5]+item[6]}

4.分頁爬取
給鏈接傳入參數offset即可

def main(offset):url = "https://maoyan.com/board/4?offest="+str(offset)html = get_one_page(url)for item in parse_one_page(html):print(item)if __name__ == '__main__':for i in range(10):main(offset=i*10)

step3 寫入文件

def write_to_file(item):with open('result.txt', 'a', encoding='utf-8') as f:print(type(json.dumps(item)))f.write(json.dumps(item, ensure_ascii=False)+'\n')

step4 運行結果

1.完整代碼：

import json import requests from requests.exceptions import RequestException import re import timedef get_one_page(url):try:headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'}response = requests.get(url, headers=headers)if response.status_code == 200:return response.textreturn Noneexcept RequestException:return Nonedef parse_one_page(html):pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name"><a'+ '.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>'+ '.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>', re.S)items = re.findall(pattern, html)for item in items:yield {'index': item[0],'image': item[1],'title': item[2],'actor': item[3].strip()[3:],'time': item[4].strip()[5:],'score': item[5] + item[6]}def write_to_file(content):with open('result.txt', 'a', encoding='utf-8') as f:f.write(json.dumps(content, ensure_ascii=False) + '\n')def main(offset):url = 'http://maoyan.com/board/4?offset=' + str(offset)html = get_one_page(url)for item in parse_one_page(html):print(item)write_to_file(item)if __name__ == '__main__':for i in range(10):main(offset=i * 10)time.sleep(1)

2.txt文件

總結

以上是生活随笔為你收集整理的Python爬虫实例1的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： hudson linux节点,在Linu
下一篇：计算机网络协议包头赏析-TCP