當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

图片爬取数据解析数据持久化

發布時間：2025/3/21 编程问答 23 豆豆

生活随笔收集整理的這篇文章主要介紹了图片爬取数据解析数据持久化小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

文章目錄

1、圖片下載
2、JS動態渲染
3、數據解析
4、持久化存儲

1、圖片下載

百度圖片:http://image.baidu.com/
搜狗圖片:https://pic.sogou.com/

# 圖片爬取: 1).尋找圖片下載的url: elements與network抓包 2).瀏覽器中訪問url, 進行驗證 3).編寫代碼獲取url 4).請求url地址, 獲取二進制流 5).將二進制流寫入文件 # 百度圖片: import time import requests from lxml import etree from selenium import webdriver# 實例化瀏覽器對象 browser = webdriver.Chrome('./chromedriver.exe')# 訪問網頁并操控網頁元素獲取搜索結果 browser.get('http://image.baidu.com/') input_tag = browser.find_element_by_id('kw') input_tag.send_keys('熊二') search_button = browser.find_element_by_class_name('s_search') search_button.click()# 通過js實現鼠標向下滾動, 獲取更多頁面源碼 js = 'window.scrollTo(0, document.body.scrollHeight)' for times in range(3):browser.execute_script(js)time.sleep(3) html = browser.page_source# 解析數據獲取圖片連接: tree = etree.HTML(html) url_list = tree.xpath('//div[@id="imgid"]/div/ul/li/@data-objurl') for img_url in url_list:headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'}content = requests.get(url=img_url, headers=headers).contentif 'token' not in img_url:with open('./baidupics/%s'%img_url.split('/')[-1], 'wb') as f:f.write(content) # 搜狗圖片: import requests import reurl = 'http://pic.sogou.com/pics?' params = {'query': '熊二' } res = requests.get(url=url, params=params).text url_list = re.findall(r',"(https://i\d+piccdn\.sogoucdn.com/.*?)"]', res) for img_url in url_list:headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'}print(img_url)content = requests.get(url=img_url, headers=headers).contentname = img_url.split('/')[-1] + '.jpg'with open('./sougoupics/%s.jpg'%name, 'wb') as f:f.write(content)

2、JS動態渲染

1).selenium爬取: selenium測試框架, 完全模人操作瀏覽器, *** page_source *** 2).基礎語法:from selenium import webdriver# 實例化瀏覽器對象:browser = webdriver.Chrome('瀏覽器驅動路徑') # 在當前路徑下: './chromedriver.exe'# 訪問響應的url地址:browser.get(url)# 獲取頁面元素:find_element_by_idfind_element_by_name(): name是標簽的name屬性值find_element_by_class_name: class的屬性值find_element_by_xpath: 根據xpath表達式定位元素find_element_by_css_selector:根據css選擇器# 示例:獲取一個id為kw的input輸入框input_tag = browser.find_element_by_id('kw')# 輸入內容:input_tag.clear()input_tag.send_keys('喬碧蘿殿下')# 點擊button按鈕:button.click()# 執行JS代碼:js = 'window.scrollTo(0, document.body.scrollHeight)'for i in range(3):browser.execute_script(js)# 獲取HTML源碼: 記住沒有括號*****html = browser.page_source # str類型# 數據解析工作:1).xpath提取數據:2).正則提取: 正則表達式的書寫 + re模塊的使用3).Beautifulsoup: CSS選擇器 -->(節點選擇器, 方法選擇器, CSS選擇器)# 媒體類型: 視頻, 圖片, 壓縮包, 軟件安裝包1).下載鏈接2).requests請求: response.content --> 二進制流scrapy框架: response.body --> 二進制流3).寫文件:with open('./jdkfj/name', 'wb') as f:f.write(res.content | response.body)

3、數據解析

1.Xpath # 編碼流程from lxml import etree# 實例化etree對象 tree = etree.HTML(res.text) # 調用xpath表達式提取數據li_list = tree.xpath('xpath表達式') # xpath提取的數據在列表中# 嵌套for li in li_list:li.xpath('xpath表達式')# ./# .//# 基礎語法:./:從當前的根節點向下匹配../:從當前節點下的任意位置匹配nodeName: 節點名定位nodename[@attributename="value"]: 根據屬性定位單屬性多值匹配:contains--> div[contains(@class, "item")]多屬性匹配: and --> div[@class="item" and @name="divtag"]@attributename: 提取其屬性值text(): 提取文本信息# 按序選擇:1).索引定位: 索引從1開始, res.xpath('//div/ul/li[1]/text()'): 定位第一個li標簽requests模塊請求的響應對象:res.text-->文本res.json()-->python的基礎數據類型 --> 字典res.content--> 二進制流2).last()函數定位: 最后一個, 倒數第二個:last()-1res.xpath('//div/ul/li[last()]'): 定位最后一個res.xpath('//div/ul/li[last()-1]'): 定位倒數第二個3).position()函數: 位置res.xpath('//div/ul/li[position()<4]')2.BS4基礎語法: # 編碼流程:from bs4 import BeautifulSoup# 實例化soup對象soup = BeautifulSoup(res.text, 'lxml')# 定位節點soup.select('CSS選擇器') # CSS選擇器語法:id: #class: .soup.select('div > ul > li') # 單層級選擇器soup.select('div li') # 多層級選擇器 # 獲取節點的屬性或文本:tag.string: 取直接文本 --> 當標簽中除了字節文本, 還包含其他標簽時, 取不到直接文本tag.get_text(): 取文本tag['attributename']: 取屬性(試試屬性有兩個(包含)值以上時返回的數據類型) 3.正則 & re模塊分組 & 非貪婪匹配:() --> 'dfkjd(kdf.*?dfdf)dfdf'<a href="https://www.baidu.com/kdjfkdjf.jpg">這是一個a標簽</a> --> '<a href="(https://www.baidu.com/.*?\.jpg)">' 量詞:+ : 匹配1次或多次* : 匹配0次獲取多次{m}: 匹配m次{m,n}: 匹配m到n次{m,}: 至少m次{,n}: 至多n次 re模塊:re.findall('正則表示', res.text) --> list列表

4、持久化存儲

1.txt ############# 寫入txt文件 ###############if title and joke and comment:# with open('qbtxt.txt', 'a', encoding='utf-8') as txtfile:# txtfile.write('&'.join([title[0], joke[0], comment[0]]))# txtfile.write('\n')# txtfile.write('********************************************\n')2.json############# 寫入json文件 ################# dic = {'title': title[0], 'joke':joke[0], 'comment':comment[0]}# with open('jsnfile.json', 'a', encoding='utf-8') as jsonfile:# jsonfile.write(json.dumps(dic, indent=4, ensure_ascii=False))# jsonfile.write(','+'\n')3.csv ############# 寫入CSV文件 ##################with open('csvfile.csv', 'a', encoding='utf-8') as csvfile:writer = csv.writer(csvfile, delimiter=' ')writer.writerow([title[0], joke[0], comment[0]]) ############# scrapy框架 ###################FEED_URI = 'file:///home/eli/Desktop/qtw.csv'FEED_FORMAT = 'CSV'

總結

以上是生活随笔為你收集整理的图片爬取数据解析数据持久化的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。