总说手机没有“好壁纸”,Python一次性抓取500张“美女”图片,够不够用!
生活随笔
收集整理的這篇文章主要介紹了
总说手机没有“好壁纸”,Python一次性抓取500张“美女”图片,够不够用!
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
作者 | 舊時晚風拂曉城? ? ? ?編輯?| JackTian
來源 | 杰哥的IT之旅(ID:Jake_Internet)
原文鏈接:https://blog.csdn.net/fyfugoyfa/article/details/107734468
1. 爬取一頁的圖片
正則匹配提取圖片數據
網頁源代碼部分截圖如下:
重新設置 GBK 編碼解決了亂碼問題
代碼實現:
import?requests import?re#?設置保存路徑 path?=?r'D:\test\picture_1\?' #?目標url url?=?"http://pic.netbian.com/4kmeinv/index.html" #?偽裝請求頭??防止被反爬 headers?=?{"User-Agent":?"Mozilla/5.0?(Windows?NT?6.1;?WOW64)?AppleWebKit/537.1?(KHTML,?like?Gecko)?Chrome/21.0.1180.89?Safari/537.1","Referer":?"http://pic.netbian.com/4kmeinv/index.html" }#?發送請求??獲取響應 response?=?requests.get(url,?headers=headers) #?打印網頁源代碼來看??亂碼???重新設置編碼解決編碼問題 #?內容正常顯示??便于之后提取數據 response.encoding?=?'GBK'#?正則匹配提取想要的數據??得到圖片鏈接和名稱 img_info?=?re.findall('img?src="(.*?)"?alt="(.*?)"?/',?response.text)for?src,?name?in?img_info:img_url?=?'http://pic.netbian.com'?+?src???#?加上?'http://pic.netbian.com'才是真正的圖片urlimg_content?=?requests.get(img_url,?headers=headers).contentimg_name?=?name?+?'.jpg'with?open(path?+?img_name,?'wb')?as?f:?????#?圖片保存到本地print(f"正在為您下載圖片:{img_name}")f.write(img_content)Xpath定位提取圖片數據
代碼實現:
import?requests from?lxml?import?etree#?設置保存路徑 path?=?r'D:\test\picture_1\?' #?目標url url?=?"http://pic.netbian.com/4kmeinv/index.html" #?偽裝請求頭??防止被反爬 headers?=?{"User-Agent":?"Mozilla/5.0?(Windows?NT?6.1;?WOW64)?AppleWebKit/537.1?(KHTML,?like?Gecko)?Chrome/21.0.1180.89?Safari/537.1","Referer":?"http://pic.netbian.com/4kmeinv/index.html" }#?發送請求??獲取響應 response?=?requests.get(url,?headers=headers) #?打印網頁源代碼來看??亂碼???重新設置編碼解決編碼問題 #?內容正常顯示??便于之后提取數據 response.encoding?=?'GBK' html?=?etree.HTML(response.text) #?xpath定位提取想要的數據??得到圖片鏈接和名稱 img_src?=?html.xpath('//ul[@class="clearfix"]/li/a/img/@src') #?列表推導式???得到真正的圖片url img_src?=?['http://pic.netbian.com'?+?x?for?x?in?img_src] img_alt?=?html.xpath('//ul[@class="clearfix"]/li/a/img/@alt')for?src,?name?in?zip(img_src,?img_alt):img_content?=?requests.get(src,?headers=headers).contentimg_name?=?name?+?'.jpg'with?open(path?+?img_name,?'wb')?as?f:???#?圖片保存到本地print(f"正在為您下載圖片:{img_name}")f.write(img_content)2.翻頁爬取,實現批量下載
單線程版
import?requests from?lxml?import?etree import?datetime import?time#?設置保存路徑 path?=?r'D:\test\picture_1\?' headers?=?{"User-Agent":?"Mozilla/5.0?(Windows?NT?6.1;?WOW64)?AppleWebKit/537.1?(KHTML,?like?Gecko)?Chrome/21.0.1180.89?Safari/537.1","Referer":?"http://pic.netbian.com/4kmeinv/index.html" } start?=?datetime.datetime.now()def?get_img(urls):for?url?in?urls:#?發送請求??獲取響應response?=?requests.get(url,?headers=headers)#?打印網頁源代碼來看??亂碼???重新設置編碼解決編碼問題#?內容正常顯示??便于之后提取數據response.encoding?=?'GBK'html?=?etree.HTML(response.text)#?xpath定位提取想要的數據??得到圖片鏈接和名稱img_src?=?html.xpath('//ul[@class="clearfix"]/li/a/img/@src')#?列表推導式???得到真正的圖片urlimg_src?=?['http://pic.netbian.com'?+?x?for?x?in?img_src]img_alt?=?html.xpath('//ul[@class="clearfix"]/li/a/img/@alt')for?src,?name?in?zip(img_src,?img_alt):img_content?=?requests.get(src,?headers=headers).contentimg_name?=?name?+?'.jpg'with?open(path?+?img_name,?'wb')?as?f:??#?圖片保存到本地# print(f"正在為您下載圖片:{img_name}")f.write(img_content)time.sleep(1)def?main():#?要請求的url列表url_list?=?['http://pic.netbian.com/4kmeinv/index.html']?+?[f'http://pic.netbian.com/4kmeinv/index_{i}.html'?for?i?in?range(2,?11)]get_img(url_list)delta?=?(datetime.datetime.now()?-?start).total_seconds()print(f"抓取10頁圖片用時:{delta}s")if?__name__?==?'__main__':main()程序運行成功,抓取了10頁的圖片,共210張,用時63.682837s。
多線程版
import?requests from?lxml?import?etree import?datetime import?time import?random from?concurrent.futures?import?ThreadPoolExecutor#?設置保存路徑 path?=?r'D:\test\picture_1\?' user_agent?=?["Mozilla/5.0?(Windows?NT?6.1;?WOW64)?AppleWebKit/537.1?(KHTML,?like?Gecko)?Chrome/22.0.1207.1?Safari/537.1","Mozilla/5.0?(Windows?NT?6.1;?WOW64)?AppleWebKit/536.6?(KHTML,?like?Gecko)?Chrome/20.0.1092.0?Safari/536.6","Mozilla/5.0?(Windows?NT?6.2)?AppleWebKit/536.6?(KHTML,?like?Gecko)?Chrome/20.0.1090.0?Safari/536.6","Mozilla/5.0?(Windows?NT?6.2;?WOW64)?AppleWebKit/537.1?(KHTML,?like?Gecko)?Chrome/19.77.34.5?Safari/537.1","Mozilla/5.0?(Windows?NT?6.0)?AppleWebKit/536.5?(KHTML,?like?Gecko)?Chrome/19.0.1084.36?Safari/536.5","Mozilla/5.0?(Windows?NT?6.1;?WOW64)?AppleWebKit/536.3?(KHTML,?like?Gecko)?Chrome/19.0.1063.0?Safari/536.3","Mozilla/5.0?(Windows?NT?5.1)?AppleWebKit/536.3?(KHTML,?like?Gecko)?Chrome/19.0.1063.0?Safari/536.3","Mozilla/5.0?(Windows?NT?6.2)?AppleWebKit/536.3?(KHTML,?like?Gecko)?Chrome/19.0.1062.0?Safari/536.3","Mozilla/5.0?(Windows?NT?6.1;?WOW64)?AppleWebKit/536.3?(KHTML,?like?Gecko)?Chrome/19.0.1062.0?Safari/536.3","Mozilla/5.0?(Windows?NT?6.2)?AppleWebKit/536.3?(KHTML,?like?Gecko)?Chrome/19.0.1061.1?Safari/536.3","Mozilla/5.0?(Windows?NT?6.1;?WOW64)?AppleWebKit/536.3?(KHTML,?like?Gecko)?Chrome/19.0.1061.1?Safari/536.3","Mozilla/5.0?(Windows?NT?6.1)?AppleWebKit/536.3?(KHTML,?like?Gecko)?Chrome/19.0.1061.1?Safari/536.3","Mozilla/5.0?(Windows?NT?6.2)?AppleWebKit/536.3?(KHTML,?like?Gecko)?Chrome/19.0.1061.0?Safari/536.3","Mozilla/5.0?(Windows?NT?6.2;?WOW64)?AppleWebKit/535.24?(KHTML,?like?Gecko)?Chrome/19.0.1055.1?Safari/535.24"] start?=?datetime.datetime.now()def?get_img(url):headers?=?{"User-Agent":?random.choice(user_agent),"Referer":?"http://pic.netbian.com/4kmeinv/index.html"}#?發送請求??獲取響應response?=?requests.get(url,?headers=headers)#?打印網頁源代碼來看??亂碼???重新設置編碼解決編碼問題#?內容正常顯示??便于之后提取數據response.encoding?=?'GBK'html?=?etree.HTML(response.text)#?xpath定位提取想要的數據??得到圖片鏈接和名稱img_src?=?html.xpath('//ul[@class="clearfix"]/li/a/img/@src')#?列表推導式???得到真正的圖片urlimg_src?=?['http://pic.netbian.com'?+?x?for?x?in?img_src]img_alt?=?html.xpath('//ul[@class="clearfix"]/li/a/img/@alt')for?src,?name?in?zip(img_src,?img_alt):img_content?=?requests.get(src,?headers=headers).contentimg_name?=?name?+?'.jpg'with?open(path?+?img_name,?'wb')?as?f:??#?圖片保存到本地# print(f"正在為您下載圖片:{img_name}")f.write(img_content)time.sleep(random.randint(1,?2))def?main():#?要請求的url列表url_list?=?['http://pic.netbian.com/4kmeinv/index.html']?+?[f'http://pic.netbian.com/4kmeinv/index_{i}.html'?for?i?in?range(2,?51)]with?ThreadPoolExecutor(max_workers=6)?as?executor:executor.map(get_img,?url_list)delta?=?(datetime.datetime.now()?-?start).total_seconds()print(f"爬取50頁圖片用時:{delta}s")if?__name__?==?'__main__':main()程序運行成功,抓取了50頁圖片,共1047張,用時56.71979s。開多線程大大提高的爬取數據的效率。
最終成果如下:
3. 其他說明
總結
以上是生活随笔為你收集整理的总说手机没有“好壁纸”,Python一次性抓取500张“美女”图片,够不够用!的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 华为宣布了,手机将全面支持鸿蒙!
- 下一篇: 国家发钱了!研究生补贴一览表!