python爬取千图网高清图
生活随笔
收集整理的這篇文章主要介紹了
python爬取千图网高清图
小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.
###一、scrapy圖片爬蟲構(gòu)建思路
1.分析網(wǎng)站
2.選擇爬取方式與策略
3.創(chuàng)建爬蟲項(xiàng)目 → 定義items.py
4.編寫爬蟲文件
5.編寫pipelines與setting
6.調(diào)試
二、千圖網(wǎng)難點(diǎn)(http://www.58pic.com/)
1.要爬取全站的圖片
2.要爬取高清的圖片------找出高清地址即可
3.要有相應(yīng)的反爬蟲機(jī)制------如模擬瀏覽器,不記錄cookie等,只要相應(yīng)注釋去掉即可COOKIES_ENABLED = False
三、散點(diǎn)知識(shí)
1.from scrapy.http import Request 是回調(diào)函數(shù)用Request(url=…,callback=…)
2.xpath的//表示提取所有符合的節(jié)點(diǎn)
代碼:
items.py
import scrapy class QiantuwangItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()url = scrapy.Field()title = scrapy.Field()middlewares.py
from scrapy import signalsclass QiantuwangSpiderMiddleware(object):# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the spider middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_spider_input(response, spider):# Called for each response that goes through the spider# middleware and into the spider.# Should return None or raise an exception.return Nonedef process_spider_output(response, result, spider):# Called with the results returned from the Spider, after# it has processed the response.# Must return an iterable of Request, dict or Item objects.for i in result:yield idef process_spider_exception(response, exception, spider):# Called when a spider or process_spider_input() method# (from other spider middleware) raises an exception.# Should return either None or an iterable of Response, dict# or Item objects.passdef process_start_requests(start_requests, spider):# Called with the start requests of the spider, and works# similarly to the process_spider_output() method, except# that it doesn’t have a response associated.# Must return only requests (not items).for r in start_requests:yield rdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name) import urllib import random class QiantuwangPipeline(object):def process_item(self, item, spider):try:title = item['title'][0].encode('gbk')file = "E:/tupian/" + str(title) + str(int(random.random() * 10000)) + ".jpg"urllib.urlretrieve(item['url'][0], filename=file)except Exception, e:print epassreturn item總結(jié)
以上是生活随笔為你收集整理的python爬取千图网高清图的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: linux禁止访问国外ip,Shell脚
- 下一篇: cad 切图_CAD切图方法你知道吗