當前位置：首頁 > 编程语言 > python >内容正文

python

python爬虫之Ajax动态加载数据抓取--豆瓣电影/腾讯招聘

發布時間：2024/1/1 python 19 豆豆

生活随笔收集整理的這篇文章主要介紹了 python爬虫之Ajax动态加载数据抓取--豆瓣电影/腾讯招聘小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

動態加載數據抓取-Ajax

特點

1、右鍵 -> 查看網頁源碼中沒有具體數據 2、滾動鼠標滑輪或其他動作時加載

抓取

1、F12打開控制臺，頁面動作抓取網絡數據包 2、抓取json文件URL地址 # 控制臺中 XHR ：異步加載的數據包 # XHR -> QueryStringParameters(查詢參數)

豆瓣電影數據抓取案例

目標

1、地址: 豆瓣電影 - 排行榜 - 劇情 2、目標: 電影名稱、電影評分

F12抓包（XHR）

1、Request URL(基準URL地址) ：https://movie.douban.com/j/chart/top_list? 2、Query String(查詢參數) # 抓取的查詢參數如下： type: 13 interval_id: 100:90 action: '' start: 0 limit: 用戶輸入的電影數量

json模塊的使用

1、json.loads(json格式的字符串)：把json格式的字符串轉為python數據類型 # 示例 html = json.loads(res.text) print(type(html))

代碼實現

import requests import jsonclass DoubanSpider(object):def __init__(self):self.url = 'https://movie.douban.com/j/chart/top_list?'self.headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'}# 獲取頁面def get_page(self,params):res = requests.get(url=self.url,params=params,headers=self.headers,verify=True)res.encoding = 'utf-8'# json.loads() josn格式->Python格式html = res.json()self.parse_page(html)# 解析并保存數據def parse_page(self,html):# html為大列表 [{電影1信息},{},{}]for h in html:# 名稱name = h['title'].strip()# 評分score = float(h['score'].strip())# 打印測試print([name,score])# 主函數def main(self):limit = input('請輸入電影數量:')params = {'type' : '24','interval_id' : '100:90','action' : '','start' : '0','limit' : limit}# 調用函數,傳遞params參數self.get_page(params)if __name__ == '__main__':spider = DoubanSpider()spider.main()

騰訊招聘案例

URL地址及目標

確定URL地址及目標

1、URL: 百度搜索騰訊招聘 - 查看工作崗位 2、目標: 職位名稱、工作職責、崗位要求

F12抓包

一級頁面json地址(index變,timestamp未檢查)

https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1563912271089&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=cn

二級頁面地址(postId在變,在一級頁面中可拿到)

https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1563912374645&postId={}&language=zh-cn

具體代碼實現

import requests import json import time import randomclass TencentSpider(object):def __init__(self):self.headers = {'User-Agent':'Mozilla/5.0'}self.one_url = 'https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1563912271089&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=cn'def get_page(self,url):res = requests.get(url,headers=self.headers)res.encoding = 'utf-8'# json.loads()把json格式的字符串轉為python數據類型html = json.loads(res.text)return htmldef parse_one_page(self,html):job_info = {}for job in html['Data']['Posts']:job_info['job_name'] = job['RecruitPostName']job_info['job_address'] = job['LocationName']# 拿postid為了拼接二級頁面地址post_id = job['PostId']# 職責和要求(二級頁面)# 得到二級頁面鏈接two_url = 'https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1563912374645&postId={}&language=zh-cn'.format(post_id)# 發請求解析job_info['job_duty'],job_info['job_requirement'] = self.parse_two_page(two_url)print(job_info)def parse_two_page(self,two_url):html = self.get_page(two_url)# 職責job_duty = html['Data']['Responsibility']# 要求job_requirement = html['Data']['Requirement']return job_duty,job_requirementdef main(self):for index in range(1,11):url = self.one_url.format(index)one_html = self.get_page(url)self.parse_one_page(one_html)time.sleep(random.uniform(0.5,1.5))if __name__ == '__main__':spider = TencentSpider()spider.main()

附git地址：https://github.com/RyanLove1/spider_code

補充：

控制臺抓包

打開方式及常用選項

1、打開瀏覽器，F12打開控制臺，找到Network選項卡 2、控制臺常用選項1、Network: 抓取網絡數據包1、ALL: 抓取所有的網絡數據包2、XHR：抓取異步加載的網絡數據包3、JS : 抓取所有的JS文件2、Sources: 格式化輸出并打斷點調試JavaScript代碼，助于分析爬蟲中一些參數3、Console: 交互模式，可對JavaScript中的代碼進行測試 3、抓取具體網絡數據包后1、單擊左側網絡數據包地址，進入數據包詳情，查看右側2、右側:1、Headers: 整個請求信息General、Response Headers、Request Headers、Query String、Form Data2、Preview: 對響應內容進行預覽3、Response：響應內容

python中正則處理headers和formdata

1、pycharm進入方法：Ctrl + r ，選中 Regex 2、處理headers和formdata(.*): (.*)"$1": "$2", 3、點擊 Replace All

總結

以上是生活随笔為你收集整理的python爬虫之Ajax动态加载数据抓取--豆瓣电影/腾讯招聘的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：【常用代码05】悬浮显示手,标题文字悬浮
下一篇： JS正则表达式验证数字、非数字、正数、负