py3+requests+urllib+bs4+threading,爬取斗图图片
生活随笔
收集整理的這篇文章主要介紹了
py3+requests+urllib+bs4+threading,爬取斗图图片
小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.
實(shí)現(xiàn)原理及思路請(qǐng)參考我的另外幾篇爬蟲(chóng)實(shí)踐博客
?
py3+urllib+bs4+反爬,20+行代碼教你爬取豆瓣妹子圖:http://www.cnblogs.com/uncleyong/p/6892688.html
py3+requests+json+xlwt,爬取拉勾招聘信息:http://www.cnblogs.com/uncleyong/p/6960044.html
py3+urllib+re,輕輕松松爬取雙色球最近100期中獎(jiǎng)號(hào)碼:http://www.cnblogs.com/uncleyong/p/6958242.html
實(shí)現(xiàn)代碼如下:
#-*- coding:utf-8 -*- import requests, threading, time from lxml import etree from bs4 import BeautifulSoup# 獲取源碼 def get_html(url):# url = 'http://www.doutula.com/article/list/?page=1'headers = {'user-agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}request = requests.get(url=url, headers=headers) # 網(wǎng)址發(fā)送get請(qǐng)求response = request.content.decode('utf-8') # 獲取源碼# print(response)return response# 匹配圖片url def get_img_html(html):# soup = BeautifulSoup(html,'html.parser')soup = BeautifulSoup(html,'lxml') # 解析網(wǎng)頁(yè)all_a = soup.find_all('a',class_='list-group-item') # 獲取a標(biāo)簽,如果有class或id來(lái)命名,一定要加上名字# class="list-group-item"是a標(biāo)簽的名字# <a class="list-group-item" href="http://www.doutula.com/article/detail/7536783"># print(type(all_a)) # <class 'bs4.element.ResultSet'># print(all_a)for i in all_a:# print(i['href'])img_html = get_html(i['href']) # 獲取內(nèi)頁(yè)源碼,i['href']表示獲取屬性值# print(img_html)get_img(img_html)# 獲取圖片url def get_img(html):# soup = etree.HTML(html) # 初始化源碼# items = soup.xpath('//div[@class="artile_des"]') # //表示某個(gè)目錄下,從匹配選擇的當(dāng)前節(jié)點(diǎn)選擇文檔中的節(jié)點(diǎn),而不考慮它們的位置。# # []表示過(guò)濾條件# for item in items:# imgurl_list = item.xpath('table/tbody/tr/td/a/img/@onerror')# # print(imgurl_list)# # start_save_img(imgurl_list)soup = BeautifulSoup(html, 'lxml')items = soup.find('div',class_='swiper-slide').find_all('div',class_='artile_des')# 不能寫成這樣:find_all后面不能跟find,因?yàn)閒ind是找一個(gè),find_all是找多個(gè),從多個(gè)中找一個(gè)是不對(duì)的# items = soup.find('div',class_='swiper-slide').find_all('div',class_='artile_des').find('img')['src']# print(items)imgurl_list = []for i in items:imgurl = i.find('img')['src'] # img標(biāo)簽下的src屬性# print(type(imgurl)) # <class 'str'># print(imgurl)imgurl_list.append(imgurl)start_save_img(imgurl_list) # 這里是對(duì)每一組套圖做多線程# 下載圖片 x = 1 def save_img(img_url):# global x # 全局變量# x +=1# img_url = img_url.split('=')[-1][1:-2].replace('jp','jpg') # 以=分割# print('正在下載'+'http:'+img_url)# img_content = requests.get('http:'+img_url).content# with open('doutu/%s.jpg'%x, 'wb') as f:# urllib下的retrieve也可以下載# f.write(img_content)global x # 全局變量x +=1print('正在下載:'+img_url)geshi = img_url.split('.')[-1] # 因?yàn)閳D片格式不一樣,所以切片,把鏈接中圖片后綴獲取到,用于下面拼接文件名img_content = requests.get(img_url).contentwith open('doutu/%s.%s'%(x,geshi), 'wb') as f: # urllib下的retrieve也可以下載f.write(img_content)def start_save_img(imgurl_list):for i in imgurl_list:# print(i)th = threading.Thread(target=save_img,args=(i,)) # i后面加逗號(hào)表示args是一個(gè)元組# target是可調(diào)用對(duì)象,是一個(gè)函數(shù)名,線程啟動(dòng)后執(zhí)行,th.start()th.join()# 主函數(shù) def main():start_url = 'http://www.doutula.com/article/list/?page={}'for i in range(1,2):# print(start_url.format(i))start_html = get_html(start_url.format(i))get_img_html(start_html) # 獲取內(nèi)頁(yè)圖片的urlif __name__ == '__main__': # 判斷文件入口start_time = time.time()main()end_time = time.time()print(start_time)print(end_time)print(end_time-start_time)總結(jié)
以上是生活随笔為你收集整理的py3+requests+urllib+bs4+threading,爬取斗图图片的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: JAVA 国际化基础知识(二)
- 下一篇: Webbygram:网页版Instagr