當前位置：首頁 > 编程语言 > python >内容正文

python

python爬虫高考成绩

發布時間：2025/3/21 python 17 豆豆

生活随笔收集整理的這篇文章主要介紹了 python爬虫高考成绩小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

文章目錄

F12開發者模式
如何找到network中我們需要的部分
如何驗證請求正確
模擬發送
headers反爬蟲
response.status_code
etree 和 xpath
- etree.HTML(text)
- html.xpath 能獲得某個標簽的內容
- .xpath('div[contains(@class, "sline")]')
參考代碼
模擬登陸知乎

F12開發者模式

打開谷歌瀏覽器，F12進入后觀察network部分

如何找到network中我們需要的部分

如果是按F5刷新才出來的，一般在DOC里面
如果是點擊按鈕加載更多，請求在XHR里面

如何驗證請求正確

復制頁面上的字，在response里ctrl+F查找，能找到說明找對了位置

模擬發送

找到 url 方法以及 header

從 accept 到 user-agent 全都要，cookie 在程序中可以寫空

首字母大寫，逗號連接

requests_headers = {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3','Accept-encoding': 'gzip, deflate, br','Accept-language': 'zh-CN,zh;q=0.9','Cache-control':'max-age=0','Cookie': '','Referer': 'https://www.zhihu.com/search?q=vczh&type=content','Upgrade-insecure-requests': '1','User-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36' }url = 'https://www.zhihu.com/topic/20004648/hot'z = requests.get(url, headers=requests_headers)print(z.content)

headers反爬蟲

headers = {'User - Agent': 'Mozilla / 5.0(Windows NT 6.1;Win64;x64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 73.0.3683.103Safari / 537.36' }

response.status_code

如果數值是200說明請求被回應

etree 和 xpath

etree 能簡單地從源碼text中得到想要的內容
etree 來自模塊 lxml

etree.HTML(text)

將字符串化的html文檔text轉換為html格式的文檔

html.xpath 能獲得某個標簽的內容

html = etree.HTML(wb_data) html_data = html.xpath('/html/body/p/ul/li/a')

參考

.xpath(‘div[contains(@class, “sline”)]’)

.xpath(‘div[contains(@class, “sline”)]’)只要包含sline就可以

參考代碼

import requests import pandas as pdfrom lxml import etreedef extract_first(selectors):if len(selectors) <= 0:return Nonereturn selectors[0]class EduScore:def __init__(self):self.score_url = 'http://www.eol.cn/html/g/fsx/index.shtml'def basic(self):response = requests.get(self.score_url)response.encoding = 'utf-8'if response.status_code != 200:raise Exception('http status code not 200')html = response.textselector = etree.HTML(html)cities = selector.xpath('//div[@class="fsshowli"]')items = []for city in cities:items.append({'code': extract_first(city.xpath('@id')),'name': extract_first(city.xpath('div[@class="topline"]/div[@class="city"]/text()'))})return pd.DataFrame(items)def scores(self, code, year):year = str(year) + '年'response = requests.get(self.score_url)if response.status_code != 200:raise Exception('http status code not 200')response.encoding = 'utf-8'html = response.textselector = etree.HTML(html)city = extract_first(selector.xpath('//div[@id="{}"]'.format(code)))if city is None:returns_line = extract_first(city.xpath('div[contains(@class, "sline")]'))t_line = extract_first(city.xpath('div[contains(@class, "tline")]'))if s_line is None or t_line is None:returnyears = []for x in s_line.xpath('div[contains(@class, "year")]'):y = extract_first(x.xpath('text()'))if y is None:continueyears.append(y)if str(year) not in years:return Noneindex = years.index(str(year))tables = t_line.xpath('div/table')if len(tables) < index + 1:returntable = tables[index]items = []for tr in table.xpath('tr'):items.append([extract_first(tr.xpath('td[1]/text()')).strip(),extract_first(tr.xpath('td[2]/text()')).strip(),extract_first(tr.xpath('td[3]/text()')).strip()])return pd.DataFrame(items[1:], columns=items[0]) if __name__ == '__main__':edu_score = EduScore()print(edu_score.basic())print(edu_score.scores('hub', 2014))

F12打開瀏覽器模式

模擬登陸知乎

總結

以上是生活随笔為你收集整理的python爬虫高考成绩的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。