當(dāng)前位置：首頁(yè) > 编程语言 > python >内容正文

python

Python 爬虫学习笔记

發(fā)布時(shí)間：2025/3/15 python 24 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python 爬虫学习笔记小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

環(huán)境篇

Python3 + Pip 環(huán)境配置

MongoDB 、MYSQL、Redis 環(huán)境配置

爬蟲常用庫(kù)安裝

基礎(chǔ)篇

基本原理

什么是爬蟲：請(qǐng)求網(wǎng)站并提取數(shù)據(jù)的自動(dòng)化程序

爬蟲基本流程：發(fā)起請(qǐng)求獲取響應(yīng)內(nèi)容解析內(nèi)容保存數(shù)據(jù)

>>> import requests >>> response = requests.get('https://www.baidu.com') >>> print(response.text) <!DOCTYPE html> ...............

抓怎樣的數(shù)據(jù)：HTML 文檔、JSON 格式文本、圖片、視頻、其他

解析方式：直接處理、Json 解析、正則表達(dá)式、BeautifulSoup、PyQuery、Xpath

解決 JavaScript 渲染的問(wèn)題：分析 AJAX 請(qǐng)求、Splash、PyV8、Ghost.py

怎樣保存數(shù)據(jù)：文本（純文本、Json、xml等）、關(guān)系型數(shù)據(jù)庫(kù)（MySQL、Oracle、SQL server）、非關(guān)系型數(shù)據(jù)庫(kù)（MongoDB、Redis）

Urllib 庫(kù)基本使用

什么是 Urllib：Python 內(nèi)置的 HTTP 請(qǐng)求庫(kù) urllib.request 請(qǐng)求模塊、urllib.error 異常處理模塊、 urllib.parse url 解析模塊

相比 Python2 變化

// Python2 import urllib2 response = urllib2.urlopen('http://www.baidu.com')// Python3 import urllib.request response = urllib.request.urlopen('http://www.baidu.com')

用法詳解

""" 請(qǐng)求 """ import urllib.requestresponse = urllib.request.urlopen('http://www.baidu.com') print(response.read().decode('utf-8')) """ POST 請(qǐng)求 """ from urllib import request, parse url = 'http://httpbin.org/post' headers = {'User-Agent': 'tttt','Host': 'httpbin.org' } dict = {'name': 'm0bu' } data = bytes(parse.urlencode(dict), encoding='utf-8') req = request.Request(url=url, data=data, headers=headers, method='POST') #req.add_header('','') response = request.urlopen(req) print(response.read().decode('utf-8')) """ 異常處理 """ import socket import urllib.request import urllib.error try:response = urllib.request.urlopen('http://httpbin.org/get', timeout=1) except urllib.error.URLError as e:if isinstance(e.reason, socket.timeout):print('time out') """ 代理 """ from urllib import request proxy_handler = request.ProxyHandler({'http': 'http://127.0.0.1:1080','https': 'https://127.0.0.1:1080' }) opener = request.build_opener(proxy_handler) response = opener.open('http://httpbin.org/get') print(response.read()) """ Cookie """ import http.cookiejar, urllib.request cookie = http.cookiejar.CookieJar() handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) respone = opener.open('http://www.baidu.com') for item in cookie:print(item.name+"="+item.value)

Requests 庫(kù)基本使用

什么是 Requests：Python 實(shí)現(xiàn)的簡(jiǎn)單易用的 HTTP 庫(kù)

""" 帶參數(shù) GET 請(qǐng)求 """ import requests data = {'name': 'm0bu' } response = requests.get("http://httpbin.org/get", params=data) print(response.text) """ 解析 JSON """ import requests import json response = requests.get("http://httpbin.org/get") print(response.json()) print(json.loads(response.text)) """ 二進(jìn)制數(shù)據(jù) """ import requests response = requests.get('https://github.com/favicon.ico') print(response.content) with open('favicon.ico','wb') as f:f.write(response.content)f.close() """ 添加 headers,POST 請(qǐng)求 """ import requests import json data = {"name": "m0bu"} headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.90 Safari/537.36" } response = requests.post('http://httpbin.org/post', headers=headers, data=data) print(response.json()) """ 狀態(tài)碼判斷 """ import requests response = requests.get('http://httpbin.org/') exit() if not response.status_code == 200 else print('200') """ 文件上傳 """ import requests files ={'file':open('favicon.ico','rb')} response = requests.post("http://httpbin.org/post",files=files) print(response.text) """ 獲取 cookie """ import requests r = requests.get("https://www.baidu.com") print(r.cookies) for key,value in r.cookies.items():print(key +'='+value) """ 會(huì)話維持 """ import requests s = requests.Session() s.get("http://httpbin.org/cookies/set/number/12345678") r=s.get('http://httpbin.org/cookies') print(r.text) """ 證書驗(yàn)證 """ import requests from requests.packages import urllib3 urllib3.disable_warnings() r = requests.get('https://www.12306.cn', verify=False) print(r.status_code) """ 代理設(shè)置 """ import requests proxies ={"http":"http://127.0.0.1:1080","https":"https://127.0.0.1:1080" } r = requests.get('http://httpbin.org/ip',proxies=proxies) print(r.text) """ 異常處理 """ import requests from requests.exceptions import ReadTimeout,HTTPError,RequestException try:r = requests.get("http://httpbin.org/get", timeout=0.1)print(r.status_code) except ReadTimeout:print("timeout") except HTTPError:print('http error') except RequestException:print('error')

正則表達(dá)式基礎(chǔ)

什么是正則表達(dá)式：正則表達(dá)式是對(duì)字符串操作的一種邏輯公式，就是用事先定義好的一些特定字符、及這些特定字符的組合，組成一個(gè)“規(guī)則字符串”，這個(gè)“規(guī)則字符串”用來(lái)表達(dá)對(duì)字符串的一種過(guò)濾邏輯

非 Python 獨(dú)有，re 模塊實(shí)現(xiàn)

盡量使用泛匹配、使用括號(hào)得到匹配目標(biāo)、盡量使用非貪婪模式、有換行符就用 re.S

為匹配方便，能用 search 就不用 match，group() 打印輸出結(jié)果

re.findall 搜索字符串，以列表形式返回全部能匹配的字串

re.compile 將一個(gè)正則表達(dá)式串編譯成正則對(duì)象，以便復(fù)用該匹配模式

""" 小練習(xí) """

BeautifulSoup 庫(kù)詳解

靈活有方便的網(wǎng)頁(yè)解析庫(kù)，處理高效，支持多種解析器。利用它不用編寫正則表達(dá)式即可方便地實(shí)現(xiàn)網(wǎng)頁(yè)信息的提取。

解析器：Python 標(biāo)準(zhǔn)庫(kù)、lxml HTML 解析器、lxml XML 解析器、html5lib

from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml')

推薦使用 lxml 解析庫(kù)，必要時(shí)使用 html.parser

標(biāo)簽選擇篩選功能弱但是速度快

建議使用 find()、find_all() 查詢匹配單個(gè)結(jié)果或者多個(gè)結(jié)果

如果對(duì) CSS 選擇器熟悉建議使用 select()

記住常用的獲取屬性和文本值的方法

PyQuery 詳解

強(qiáng)大靈活的網(wǎng)頁(yè)解析庫(kù)。熟悉 jQuery 語(yǔ)法，建議使用 PyQuery

from pyquery import PyQuery as pq doc = pq(url='http://www.baidu.com') print(doc('head))

Selenium 詳解

自動(dòng)化測(cè)試工具，支持多種瀏覽器。爬蟲中主要用來(lái)解決 JavaScript 渲染問(wèn)題

from selenium import webdriver browser = webdriver.Chrome() browser.get('https://www.taobao.com') print(browser.page_source) browser.close()

官方文檔

實(shí)戰(zhàn)篇

Requests + 正則表達(dá)式爬取貓眼電影

目標(biāo)站點(diǎn)分析，流程框架：

1.抓取單頁(yè)內(nèi)容

2.正則表達(dá)式分析

3.保存至文件

4.開啟循環(huán)及多線程

import requests from multiprocessing import Pool from requests.exceptions import RequestException import re import jsondef get_one_page(url):try:response = requests.get(url)if response.status_code == 200:return response.textreturn Noneexcept RequestException:return Nonedef parse_one_page(html):pattern = re.compile('<dd>.*?board-index.*?(\d+)</i>.*?data-src="(.*?)".*?name"><a'+ '.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>'+ '.*?integer">(.*?).*?fraction">(.*?)</i>.*?</dd>', re.S)items = re.findall(pattern, html)for item in items:yield{'index': item[0],'image': item[1],'title': item[2],'actor': item[3].strip()[3:],'time': item[4].strip()[5:],'score': item[5]+item[6]}def write_to_file(content):with open('result.txt', 'a', encoding='utf-8') as f:f.write(json.dumps(content, ensure_ascii=False)+'\n')f.close()def main(offset):url = 'https://maoyan.com/board/4?offset=' + str(offset)html = get_one_page(url)for item in parse_one_page(html):print(item)write_to_file(item)if __name__ == "__main__":pool = Pool()pool.map(main, [i*10 for i in range(10)])print(end-start)

轉(zhuǎn)載于:https://www.cnblogs.com/skrr/p/11055821.html

總結(jié)

以上是生活随笔為你收集整理的Python 爬虫学习笔记的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： 20165307《网络对抗技术》Exp1
下一篇：《明日方舟》Python版公开招募工具