當(dāng)前位置：首頁(yè) > 编程语言 > python >内容正文

python

Python—实训day4—爬虫案例3：贴吧图片下载

發(fā)布時(shí)間：2023/12/18 python 27 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python—实训day4—爬虫案例3：贴吧图片下载小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

6 xpath

首先需要安裝Google的Chrome瀏覽器

6.1 安裝xpath插件

把 xpath_helper_2_0_2.crx 修改后綴名為 xpath_helper_2_0_2.rar。并解壓

在Chrome瀏覽器中，訪問(wèn) chrome://extensions/ 。打開開發(fā)者模式（把開關(guān)推動(dòng)到右邊）

點(diǎn)擊“加載已解壓的擴(kuò)展程序”

選擇解壓后的 xpath_helper_2_0_2 目錄

安裝后，留意右上角

這個(gè)，就是xpath插件

訪問(wèn)任意其他頁(yè)面，比如www.baidu.com

然后，點(diǎn)擊 xpath 插件按鈕，會(huì)出現(xiàn)如下內(nèi)容：左邊的QUERY和右邊的RESULT

可以在左邊的QUERY輸入查詢的內(nèi)容格式（//div），右邊會(huì)呈現(xiàn)對(duì)應(yīng)的結(jié)果。網(wǎng)頁(yè)會(huì)有黃色

6.2 xpath語(yǔ)法

以貼吧為例：

https://tieba.baidu.com/f?kw=%E5%8A%A8%E6%BC%AB&ie=utf-8&pn=50

1. 查找標(biāo)簽

a 絕對(duì)路徑（以//開頭）

//div

//span

//a

b 相對(duì)路徑（以./開頭）

./div

./span

./a

c. 子路徑（在父目錄下查找子路徑）

//div/span

//div/a

//ul/li/div/div/div/div/a

//ul/li//a

2 查找屬性

格式：

標(biāo)簽名[@屬性名=屬性值]

比如：

//a[@class="j_th_tit"]

//div[@class="col2_left j_threadlist_li_left"]

//a[@class="frs-author-name j_user_card"]

3 讀取屬性

格式：

標(biāo)簽名/@屬性名

比如：

//a[@class="j_th_tit"]/@href

//img[@class="j_retract"]/@src

4 獲取內(nèi)容

格式：

標(biāo)簽名/text()

比如：

//a[@class="j_th_tit"]/text()

//div[@class="col2_left j_threadlist_li_left"]/text()

//a[@class="frs-author-name j_user_card"]/text()

如果不寫text()，實(shí)際上拿到的是標(biāo)簽。

如果寫了text()，實(shí)際上拿到的是文本（字符串）

7 爬蟲案例3：貼吧圖片下載

目標(biāo)：

訪問(wèn)貼吧，找出貼吧中每個(gè)帖子的鏈接

根據(jù)鏈接進(jìn)入帖子，找出帖子中每張圖片的鏈接地址

下載圖片

需要用到xpath規(guī)則

7.1 獲取內(nèi)容

from urllib import request, parse import ssl import random# 常用User-Agent列表 ua_list = ['User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1','Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50','Mozilla/5.0 (Windows; U; Windows NT 6.1; ) AppleWebKit/534.12 (KHTML, like Gecko) Maxthon/3.0 Safari/534.12','Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.41 Safari/535.1 QQBrowser/6.9.11079.201','Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)', ]# 加載一個(gè)頁(yè)面 def loadPage(url):# 在ua_list列表中隨機(jī)選擇一個(gè)UserAgentuserAgent = random.choice(ua_list)headers = {'User-Agent': userAgent}# 發(fā)起一個(gè)請(qǐng)求req = request.Request(url, headers = headers)#print(req) # <urllib.request.Request object at 0x007B1370># 創(chuàng)建未經(jīng)過(guò)驗(yàn)證的上下文的代碼context = ssl._create_unverified_context()# 打開響應(yīng)的對(duì)象response = request.urlopen(req, context=context)#print(response) # <http.client.HTTPResponse object at 0x01F36BF0># 獲取響應(yīng)的內(nèi)容html = response.read()# 對(duì)獲取到的unicode編碼進(jìn)行解碼content = html.decode('utf-8')return content if __name__ == '__main__':url = 'https://tieba.baidu.com/f?kw=%E5%8A%A8%E6%BC%AB&ie=utf-8&pn=50'content = loadPage(url)print(content)

7.2 找出帖吧中帖子的鏈接地址

在python中使用xpath規(guī)則，需要安裝庫(kù)LXML

pip install lxml

pip install lxml -i http://pypi.douban.com/simple --trusted-host pypi.douban.com

參考代碼?

# 加載一個(gè)頁(yè)面 def loadPage(url):# 在ua_list列表中隨機(jī)選擇一個(gè)UserAgentuserAgent = random.choice(ua_list)headers = {'User-Agent': userAgent}# 發(fā)起一個(gè)請(qǐng)求req = request.Request(url)#print(req) # <urllib.request.Request object at 0x007B1370># 創(chuàng)建未經(jīng)過(guò)驗(yàn)證的上下文的代碼context = ssl._create_unverified_context()# 打開響應(yīng)的對(duì)象response = request.urlopen(req, context=context)#print(response) # <http.client.HTTPResponse object at 0x01F36BF0># 獲取響應(yīng)的內(nèi)容html = response.read()# 對(duì)獲取到的unicode編碼進(jìn)行解碼content = html.decode('utf-8')# 使用etree對(duì)html的內(nèi)容建立文檔樹content = etree.HTML(content)link_list = content.xpath('//a[@class="j_th_tit "]/@href')for link in link_list:fulllink = 'https://tieba.baidu.com' + linkprint(fulllink)

此時(shí)，可以把每個(gè)帖子的鏈接輸出

7.3 找出帖子中圖片的鏈接地址

# 加載貼吧中帖子的鏈接地址 def loadPage(url):...# 使用etree對(duì)html的內(nèi)容建立文檔樹content = etree.HTML(content)link_list = content.xpath('//a[@class="j_th_tit "]/@href')for link in link_list:fulllink = 'https://tieba.baidu.com' + linkloadImage(fulllink) # 加載帖子中圖片的鏈接地址 def loadImage(url):# 在ua_list列表中隨機(jī)選擇一個(gè)UserAgentuserAgent = random.choice(ua_list)headers = {'User-Agent': userAgent}# 發(fā)起一個(gè)請(qǐng)求req = request.Request(url)# 創(chuàng)建未經(jīng)過(guò)驗(yàn)證的上下文的代碼context = ssl._create_unverified_context()# 打開響應(yīng)的對(duì)象response = request.urlopen(req, context=context)# 打開響應(yīng)的對(duì)象response = request.urlopen(req, context=context)# 獲取響應(yīng)的內(nèi)容html = response.read()# 對(duì)獲取到的unicode編碼進(jìn)行解碼content = html.decode('utf-8')# 使用etree對(duì)html的內(nèi)容建立文檔樹content = etree.HTML(content)link_list = content.xpath('//img[@class="BDE_Image"]/@src')for link in link_list:print(link)

7.4 把圖片保存到文件中

# 加載帖子中圖片的鏈接地址 def loadImage(url):...# 使用etree對(duì)html的內(nèi)容建立文檔樹content = etree.HTML(content)link_list = content.xpath('//img[@class="BDE_Image"]/@src')for link in link_list:print(link)writeImage(link) # 把圖片下載并保存到本地 def writeImage(url):# 在ua_list列表中隨機(jī)選擇一個(gè)UserAgentuserAgent = random.choice(ua_list)headers = {'User-Agent': userAgent}# 發(fā)起一個(gè)請(qǐng)求req = request.Request(url)# 創(chuàng)建未經(jīng)過(guò)驗(yàn)證的上下文的代碼context = ssl._create_unverified_context()# 打開響應(yīng)的對(duì)象response = request.urlopen(req, context=context)# 獲取響應(yīng)的內(nèi)容image = response.read()# 把文件保存到文本filename = url[-10:] # f57882.jpgf = open(filename, 'wb')f.write(image)f.close()

7.5 用戶輸入?yún)?shù)

代碼略。自行實(shí)現(xiàn)

8 selenium自動(dòng)化工具

8.1 安裝selenium插件

pip install selenium

8.2 安裝phantomjs無(wú)界面瀏覽器

解壓phantomjs壓縮包，把bin所在的路徑添加到環(huán)境變量 PATH 下

重啟cmd命令行后，如果出現(xiàn)了 phantomjs 提示符，即表示該工具已經(jīng)安裝完成

8.3 訪問(wèn)百度

https://www.baidu.com/

編寫如下代碼：

現(xiàn)在，瀏覽器訪問(wèn)百度首頁(yè)的截圖就保存到 baidu1.png 中了

8.4 進(jìn)行搜索

# 導(dǎo)入selenium工具 from selenium import webdriver # 通過(guò)瀏覽器去加載網(wǎng)頁(yè) driver = webdriver.PhantomJS() # 打開網(wǎng)頁(yè) driver.get('https://www.baidu.com/') # 截圖 driver.save_screenshot('baidu1.png')# 找到要搜索的輸入框控件 driver.find_element_by_id('kw').send_keys('古天樂(lè)') # 截圖 driver.save_screenshot('baidu2.png') # 趙傲要點(diǎn)擊的按鈕控件 driver.find_element_by_id('su').click() # 延遲1秒 import time time.sleep(1) # 截圖 driver.save_screenshot('baidu3.png')

8.5 ChromeDriver

如果Selenium要和主流的瀏覽器關(guān)聯(lián)，對(duì)應(yīng)的瀏覽器需要安裝驅(qū)動(dòng)程序

比如：selnium要和 chrome關(guān)聯(lián)，則Chrome需要安裝ChromeDriver驅(qū)動(dòng)

注意版本的對(duì)應(yīng)

比如：現(xiàn)在Chrome版本為V85.0，則ChromeDriver也需要選擇85.0的版本

在 http://npm.taobao.org/mirrors/chromedriver/ 上查找對(duì)應(yīng)的版本并下載

解壓后，把 chromedriver.exe 拷貝到 Chrome的安裝目錄下

C:\Program Files (x86)\Google\Chrome\Application

并把該路徑添加到環(huán)境變量 PATH 路徑下

# 導(dǎo)入selenium工具 from selenium import webdriver # 通過(guò)瀏覽器去加載網(wǎng)頁(yè) #driver = webdriver.PhantomJS() option = webdriver.ChromeOptions() option.add_argument('headless') driver = webdriver.Chrome(chrome_options=option) # 無(wú)界面的Chrome #driver = webdriver.Chrome() # 有界面的Chrome # 打開網(wǎng)頁(yè) driver.get('https://www.baidu.com/')

練習(xí)

爬取粵嵌官網(wǎng)上的講師信息

包括：講師姓名、講師職位、講師簡(jiǎn)介、講師圖片

import os import requests from lxml import etree# 保存設(shè)置 def down_load(imgpath, imgurl):response = requests.get(imgurl, headers=headers)with open(imgpath, 'wb') as f:f.write(response.content)# shutil.copyfileobj(response.raw,f)f.close()headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0' }# 提取內(nèi)容 def pic_url(url, headers):pic_hxml = requests.get(url, headers=headers)html = etree.HTML(pic_hxml.text)pic1 = "//div[@class='teacher-img']/img[@class='lazy']/@src"name1 = "//div[@class='teacher-text']/h4/text()"job1 = "//div[@class='teacher-text']/h6/text()"inf1 = "//div[@class='teacher-text']/p/text()"pic2 = html.xpath(pic1)name2 = html.xpath(name1)job2 = html.xpath(job1)inf2 = html.xpath(inf1)for name, job, inf, pic in zip(name2, job2, inf2, pic2):imgurl = picpath1 = os.path.abspath('E:\PYTHON_PROJECT\homework')img_name = name + '-' + job + '.png'imgpath = os.path.join(path1, img_name)print(img_name, 'http://www.gec-edu.org/' + imgurl, inf)down_load(imgpath, 'http://www.gec-edu.org/' + imgurl)# 翻頁(yè)設(shè)置 for i in range(1, 7, 1):url = 'http://www.gec-edu.org/teachers/' + str(i)pic_url(url, headers)

總結(jié)

以上是生活随笔為你收集整理的Python—实训day4—爬虫案例3：贴吧图片下载的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

python

Python—实训day4—爬虫案例3：贴吧图片下载

總結(jié)