生活随笔
收集整理的這篇文章主要介紹了
scrapy框架对接seleniumpipeline数据持久化
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
文章目錄
- 1、**scrapy對接selenium**
- 2、pipeline數據持久化
1、scrapy對接selenium
動態數據加載:
1.ajax:
①url接口存在規律, 可以自行構建url, 直接爬取
②selenium自動化測試框架, 抓取動態數據
2.js動態數據加載
①js逆向
②selenium抓取
selenium可以實現抓取動態數據
scrapy不能抓取動態數據, 如果是ajax請求, 可以請求接口, 如果是js動態加載, 需要結合selenium
import scrapy
from selenium
import webdriver
from ..items
import WynewsItem
from selenium
.webdriver
import ChromeOptions
class NewsSpider(scrapy
.Spider
):name
= 'news'start_urls
= ['https://news.163.com/domestic/']option
.add_experimental_option
('excludeSwitches', ['enable-automation']) bro
=webdriver
.Chrome
(executable_path
=r'C:\Users\Administrator\Desktop\news\wynews\wynews\spiders\chromedriver.exe')def detail_parse(self
, response
):content_list
= response
.xpath
('//div[@id="endText"]/p//text()').extract
()content
= ''title
= response
.meta
['title']for s
in content_list
:content
+= sitem
= WynewsItem
()item
["title"] = titleitem
["content"] = content
yield item
def parse(self
, response
):div_list
= response
.xpath
('//div[contains(@class, "data_row")]')for div
in div_list
:link
= div
.xpath
('./a/@href').extract_first
()title
= div
.xpath
('./div/div[1]/h3/a/text()').extract_first
()yield scrapy
.Request
(url
=link
, callback
=self
.detail_parse
, meta
={"title":title
})
from scrapy
.http
import HtmlResponse
class WynewsDownloaderMiddleware(object):def process_response(self
, request
, response
, spider
):bro
= spider
.bro
if request
.url
in spider
.start_urls
:bro
.get
(request
.url
)time
.sleep
(3)js
= 'window.scrollTo(0, document.body.scrollHeight)'bro
.execute_script
(js
)time
.sleep
(3)response_selenium
= bro
.page_source
return HtmlResponse
(url
=bro
.current_url
, body
=response_selenium
, encoding
="utf-8", request
=request
)return response
import pymongo
class WynewsPipeline(object):conn
= pymongo
.MongoClient
('localhost', 27017)db
= conn
.wynewstable
= db
.newsinfo
def process_item(self
, item
, spider
):self
.table
.insert
(dict(item
))return item
2、pipeline數據持久化
介紹:
1.pipelines: 用于數據持久化
2.數據持久化的方式有很多種: MongoDB, MySQL, Redis, CSV
3.必須實現的方法: process_item
open_spider
(self
, spider
): spider開啟是被調用
close_spider
(self
, spider
): spider關閉是被調用
from_crawler
(cls
, crawler
): 類方法
, 用@
classmethod標識
, 可以獲取配置信息
Process_item
(self
, item
, spider
): 與數據庫交互存儲數據
, 該方法必須實現
*****
import Pymongo
class MongoPipeline(object):def __init__(self
, mongo_uri
, mongo_db
):self
.mongo_uri
= mongo_uriself
.mongo_db
= mongo_db
@classmethoddef from_crawler(cls
, crawler
):return cls
(mongo_uri
= crawler
.settings
.get
('MONGO_URI'),mongo_db
= crawler
.settings
.get
('MONGO_DB'))def open_spider(self
, spider
):self
.client
= pymongo
.MongoClient
(self
.mongo_uri
)self
.db
= self
.client
[self
.mongo_db
]def process_item(self
, item
, spider
):self
.db
['news'].insert
(dict(item
))return item
def close_spider(self
, spider
):self
.client
.close
()
import pymysql
class MysqlPipeline(object):def __init__(self
, host
, database
, user
, password
, port
):self
.host
= hostself
.database
= databaseself
.user
= userself
.password
= passwordself
.port
= port
@classmethod def from_crawler(cls
, crawler
):return cls
(host
= crawler
.settings
.get
('MYSQL_HOST')database
= crawler
.settings
.get
('MYSQL_DATABASE')user
= crawler
.settings
.get
('MYSQL_USER')password
= crawler
.settings
.get
('MYSQL_PASSWORD')port
= crawler
.settings
.get
('MYSQL_PORT'))def open_spider(self
, spider
):self
.db
= pymysql
.connect
(self
.host
, self
.user
, self
.password
, self
.database
, charset
='utf-8', port
=self
.port
)self
.cursor
= self
.db
.cursor
()def process_item(self
, item
, spider
):data
= dict(item
)keys
= ','.join
(data
.keys
())values
= ','.join
(['%s']*len(data
))sql
= 'insert into %s (%s) values (%s)' % (tablename
, keys
, values
)self
.cursor
.execute
(sql
, tuple(data
.values
()))self
.db
.commit
()return item
用于文件下載的管道類
import scrapy
from ..items
import XhxhItem
class XhSpider(scrapy
.Spider
):name
= 'xh'start_urls
= ['http://www.521609.com/qingchunmeinv/']def parse(self
, response
):li_list
= response
.xpath
('//div[@class="index_img list_center"]/ul/li')for li
in li_list
:item
= XhxhItem
()link
= li
.xpath
('./a[1]/img/@src').extract_first
()item
['img_link'] = 'http://www.521609.com' + link
print(item
)yield item
import scrapy
class XhxhItem(scrapy
.Item
):img_link
= scrapy
.Field
()
import scrapy
from scrapy
.pipelines
.images
import ImagesPipeline
class XhxhPipeline(object):def process_item(self
, item
, spider
):return item
class ImgPipeLine(ImagesPipeline
):def get_media_requests(self
, item
, info
):yield scrapy
.Request
(url
=item
['img_link'])def file_path(self
, request
, response
=None, info
=None):url
= request
.urlfile_name
= url
.split
('/')[-1]return file_name
def item_completed(self
, results
, item
, info
):return item
ITEM_PIPELINES
= {'xhxh.pipelines.XhxhPipeline': 300,'xhxh.pipelines.ImgPipeLine': 301,
}
IMAGES_STORE
= './mvs'
總結
以上是生活随笔為你收集整理的scrapy框架对接seleniumpipeline数据持久化的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。