當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

python爬取千图网高清图

發(fā)布時(shí)間：2023/12/31 python 28 豆豆

生活随笔收集整理的這篇文章主要介紹了 python爬取千图网高清图小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

###一、scrapy圖片爬蟲構(gòu)建思路
1.分析網(wǎng)站
2.選擇爬取方式與策略
3.創(chuàng)建爬蟲項(xiàng)目 → 定義items.py
4.編寫爬蟲文件
5.編寫pipelines與setting
6.調(diào)試

二、千圖網(wǎng)難點(diǎn)（http://www.58pic.com/）

1.要爬取全站的圖片
2.要爬取高清的圖片------找出高清地址即可
3.要有相應(yīng)的反爬蟲機(jī)制------如模擬瀏覽器，不記錄cookie等，只要相應(yīng)注釋去掉即可COOKIES_ENABLED = False

三、散點(diǎn)知識(shí)

1.from scrapy.http import Request 是回調(diào)函數(shù)用Request(url=…,callback=…)
2.xpath的//表示提取所有符合的節(jié)點(diǎn)

代碼：

items.py

import scrapy class QiantuwangItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()url = scrapy.Field()title = scrapy.Field()

middlewares.py

from scrapy import signalsclass QiantuwangSpiderMiddleware(object):# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the spider middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_spider_input(response, spider):# Called for each response that goes through the spider# middleware and into the spider.# Should return None or raise an exception.return Nonedef process_spider_output(response, result, spider):# Called with the results returned from the Spider, after# it has processed the response.# Must return an iterable of Request, dict or Item objects.for i in result:yield idef process_spider_exception(response, exception, spider):# Called when a spider or process_spider_input() method# (from other spider middleware) raises an exception.# Should return either None or an iterable of Response, dict# or Item objects.passdef process_start_requests(start_requests, spider):# Called with the start requests of the spider, and works# similarly to the process_spider_output() method, except# that it doesn’t have a response associated.# Must return only requests (not items).for r in start_requests:yield rdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name) import urllib import random class QiantuwangPipeline(object):def process_item(self, item, spider):try:title = item['title'][0].encode('gbk')file = "E:/tupian/" + str(title) + str(int(random.random() * 10000)) + ".jpg"urllib.urlretrieve(item['url'][0], filename=file)except Exception, e:print epassreturn item

總結(jié)

以上是生活随笔為你收集整理的python爬取千图网高清图的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： linux禁止访问国外ip,Shell脚
下一篇： cad 切图_CAD切图方法你知道吗