當前位置：首頁 > 运维知识 > 数据库 >内容正文

数据库

scrapy 保存mysql_scrapy爬虫事件以及数据保存为txt,json,mysql

發(fā)布時間：2024/1/1 数据库 23 豆豆

生活随笔收集整理的這篇文章主要介紹了 scrapy 保存mysql_scrapy爬虫事件以及数据保存为txt,json,mysql 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

今天要爬取的網(wǎng)頁是虎嗅網(wǎng)

我們將完成如下幾個步驟：

創(chuàng)建一個新的Scrapy工程

定義你所需要要抽取的Item對象

編寫一個spider來爬取某個網(wǎng)站并提取出所有的Item對象

編寫一個Item Pipline來存儲提取出來的Item對象

創(chuàng)建Scrapy工程

在任何目錄下執(zhí)行如下命令

scrapy startproject coolscrapy

cd coolscrapy

scrapy genspider huxiu huxiu.com

我們看看創(chuàng)建的工程目錄結(jié)構(gòu)：(news.json,news.txt是最后結(jié)果保存的)

定義Item

我們通過創(chuàng)建一個scrapy.Item類，并定義它的類型為scrapy.Field的屬性，我們準備將虎嗅網(wǎng)新聞列表的名稱、鏈接地址和摘要爬取下來。

1 importscrapy2

4 classCoolscrapyItem(scrapy.Item):5 #define the fields for your item here like:

6 #name = scrapy.Field()

7 title = scrapy.Field() #標題

8 link = scrapy.Field() #鏈接

9 desc = scrapy.Field() #簡述

10 posttime = scrapy.Field() #發(fā)布時間

編寫Spider

蜘蛛就是你定義的一些類，Scrapy使用它們來從一個domain(或domain組)爬取信息。在蜘蛛類中定義了一個初始化的URL下載列表，以及怎樣跟蹤鏈接，如何解析頁面內(nèi)容來提取Item。

定義一個Spider，只需繼承scrapy.Spider類并定于一些屬性：

name: Spider名稱，必須是唯一的

start_urls: 初始化下載鏈接URL

parse(): 用來解析下載后的Response對象，該對象也是這個方法的唯一參數(shù)。它負責解析返回頁面數(shù)據(jù)并提取出相應的Item(返回Item對象)，還有其他合法的鏈接URL(返回Request對象)。

我們打開在coolscrapy/spiders文件夾下面的huxiu.py，內(nèi)容如下：

1 #-*- coding: utf-8 -*-

2 importscrapy3 from coolscrapy.items importCoolscrapyItem4

5 classHuxiuSpider(scrapy.Spider):6 name = "huxiu"

7 allowed_domains = ["huxiu.com"]8 start_urls = ['http://huxiu.com/index.php']9

10 defparse(self, response):11 items =[]12 data = response.xpath('//div[@class="mod-info-flow"]/div/div[@class="mob-ctt"]')13 for sel indata:14 item =CoolscrapyItem()15 if len(sel.xpath('./h2/a/text()').extract()) <=0:16 item['title'] = 'No title'

17 else:18 item['title'] = sel.xpath('./h2/a/text()').extract()[0]19 if len(sel.xpath('./h2/a/@href').extract()) <=0:20 item['link'] = 'link在哪里！！！！！！！！'

21 else:22 item['link'] = sel.xpath('./h2/a/@href').extract()[0]23 url = response.urljoin(item['link'])24 if len(sel.xpath('div[@class="mob-sub"]/text()').extract()) <=0:25 item['desc'] = '啥也沒有哦...'

26 else:27 item['desc'] = sel.xpath('div[@class="mob-sub"]/text()').extract()[0]28 #item['posttime'] = sel.xpath('./div[@class="mob-author"]/span/@text()').extract()[0]

29 print(item['title'], item['link'], item['desc'])30 items.append(item)31 return items

現(xiàn)在可以在終端運行了，是可以打印每個新聞信息的。

scrapy crawl huxiu

如果一切正常，應該可以打印出每一個新聞

處理鏈接

如果想繼續(xù)跟蹤每個新聞鏈接進去，看看它的詳細內(nèi)容的話，那么可以在parse()方法中返回一個Request對象，然后注冊一個回調(diào)函數(shù)來解析新聞詳情。

下面繼續(xù)編寫huxiu.py

#-*- coding: utf-8 -*-

importscrapyfrom coolscrapy.items importCoolscrapyItemclassHuxiuSpider(scrapy.Spider):

name= "huxiu"allowed_domains= ["huxiu.com"]

start_urls= ['http://huxiu.com/index.php']defparse(self, response):#items = []

data = response.xpath('//div[@class="mod-info-flow"]/div/div[@class="mob-ctt"]')for sel indata:

item=CoolscrapyItem()if len(sel.xpath('./h2/a/text()').extract()) <=0:

item['title'] = 'No title'

else:

item['title'] = sel.xpath('./h2/a/text()').extract()[0]if len(sel.xpath('./h2/a/@href').extract()) <=0:

item['link'] = 'link在哪里！！！！！！！！'

else:

item['link'] = sel.xpath('./h2/a/@href').extract()[0]

url= response.urljoin(item['link'])if len(sel.xpath('div[@class="mob-sub"]/text()').extract()) <=0:

item['desc'] = '啥也沒有哦...'

else:

item['desc'] = sel.xpath('div[@class="mob-sub"]/text()').extract()[0]#item['posttime'] = sel.xpath('./div[@class="mob-author"]/span/@text()').extract()[0]

print(item['title'], item['link'], item['desc'])#items.append(item)

#return items

yield scrapy.Request(url,callback=self.parse_article)defparse_article(self,response):

detail= response.xpath('//div[@class="article-wrap"]')

item=CoolscrapyItem()

item['title'] = detail.xpath('./h1/text()')[0].extract().strip()

item['link'] =response.url

item['posttime'] = detail.xpath('./div/div[@class="column-link-box"]/span[1]/text()')[0].extract()print(item['title'],item['link'],item['posttime'])yield item

現(xiàn)在parse只提取感興趣的鏈接，然后將鏈接內(nèi)容解析交給另外的方法去處理了。你可以基于這個構(gòu)建更加復雜的爬蟲程序了。

導出抓取數(shù)據(jù)

最簡單的保存抓取數(shù)據(jù)的方式是使用json格式的文件保存在本地，像下面這樣運行：

scrapey crawl huxiu -o items.json

一般構(gòu)建爬蟲系統(tǒng)，建議自己編寫Item Pipeline

數(shù)據(jù)保存為TXT/JSON/MySql

1.數(shù)據(jù)保存為TXT

打開Pipeline.py

1 importcodecs2 importos3 importjson4 importpymysql5

6 classCoolscrapyPipeline(object):#需要在setting.py里設置'coolscrapy.piplines.CoolscrapyPipeline':3007 defprocess_item(self, item, spider):8 #獲取當前工作目錄

9 base_dir =os.getcwd()10 fiename = base_dir + '/news.txt'

11 #從內(nèi)存以追加的方式打開文件，并寫入對應的數(shù)據(jù)

12 with open(fiename, 'a') as f:13 f.write(item['title'] + '\n')14 f.write(item['link'] + '\n')15 f.write(item['posttime'] + '\n\n')16 return item

2.保存為json格式

在Pipeline.py里面新建一個類

1 #以下兩種寫法保存json格式，需要在settings里面設置'coolscrapy.pipelines.JsonPipeline': 200

3 classJsonPipeline(object):4 def __init__(self):5 self.file = codecs.open('logs.json', 'w', encoding='utf-8')6 defprocess_item(self, item, spider):7 line = json.dumps(dict(item), ensure_ascii=False) + "\n"

8 self.file.write(line)9 returnitem10 defspider_closed(self, spider):11 self.file.close()12

14 classJsonPipeline(object):15 defprocess_item(self, item, spider):16 base_dir =os.getcwd()17 filename = base_dir + '/news.json'

18 #打開json文件，向里面以dumps的方式吸入數(shù)據(jù)

19 #注意需要有一個參數(shù)ensure_ascii=False ，不然數(shù)據(jù)會直接為utf編碼的方式存入比如

20 #:“/xe15”

21 with codecs.open(filename, 'a') as f:22 line = json.dumps(dict(item), ensure_ascii=False) + '\n'

23 f.write(line)24 return item

上面是兩種寫法，都是一樣的

3.保存到mysql

在Pipeline.py里面新建一個類

1 classmysqlPipeline(object):2 defprocess_item(self,item,spider):3 '''

4 將爬取的信息保存到mysql5 '''

6 #將item里的數(shù)據(jù)拿出來

7 title = item['title']8 link = item['link']9 posttime = item['posttime']10

11 #和本地的newsDB數(shù)據(jù)庫建立連接

12 db =pymysql.connect(13 host='localhost', #連接的是本地數(shù)據(jù)庫

14 user='root', #自己的mysql用戶名

15 passwd='123456', #自己的密碼

16 db='newsDB', #數(shù)據(jù)庫的名字

17 charset='utf8mb4', #默認的編碼方式：

18 cursorclass=pymysql.cursors.DictCursor)19 try:20 #使用cursor()方法獲取操作游標

21 cursor =db.cursor()22 #SQL 插入語句

23 sql = "INSERT INTO NEWS(title,link,posttime) \24 VALUES ('%s', '%s', '%s')" %(title,link,posttime)25 #執(zhí)行SQL語句

26 cursor.execute(sql)27 #提交修改

28 db.commit()29 finally:30 #關(guān)閉連接

31 db.close()32 return item

編寫Settings.py

我們需要在Settings.py將我們寫好的PIPELINE添加進去，

scrapy才能夠跑起來

這里只需要增加一個dict格式的ITEM_PIPELINES，

數(shù)字value可以自定義，數(shù)字越小的優(yōu)先處理

1 ITEM_PIPELINES={'coolscrapy.pipelines.CoolscrapyPipeline':300,2 'coolscrapy.pipelines.JsonPipeline': 200,3 'coolscrapy.pipelines.mysqlPipeline': 100,4 }

下面讓程序跑起來

scrape crawl huxiu

看看結(jié)果：

好了，這次就到這里。代碼要自己敲才會慢慢熟練。

總結(jié)

以上是生活随笔為你收集整理的scrapy 保存mysql_scrapy爬虫事件以及数据保存为txt,json,mysql的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： “物联网”涵盖了大量的行业和应用
下一篇：量化交易用python还是matlab_

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

数据库

scrapy 保存mysql_scrapy爬虫事件以及数据保存为txt,json,mysql

總結(jié)