當前位置：首頁 > 编程语言 > python >内容正文

python

python之scrapy:第一只spider

發布時間：2023/12/8 python 27 豆豆

生活随笔收集整理的這篇文章主要介紹了 python之scrapy:第一只spider 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

???????學習python一直的方向是想成為數據分析方向發展，但是數據分析是那種自己想學卻比較需要環境的工作。一般在家自己學習數據分析得有很多的數據。那不如先從python最著名的爬蟲功能學起。

???????首先先從身邊的自己進行需要的數據開始抓取，最終選擇了深圳房地產信息系統，這是個對外的查詢房產信息的系統。包括了房產的樓號、面積、產權信息。這些數據即貼近生活又有分析價值。ok，開始抓取。?

這個網站比較老，應該是09年左右的系統。用的是ASP.NET開發的，因為之前一直在寫ASP.NET，很多控件都用的.NET自帶的。前端頁面的很多代碼都是自動生成的，比如分頁。

用的python框架是scrapy，著名的爬蟲框架。里面的要實現的方法都是回調函數，因此整個抓取過程都是多線程的。

首先我們要安裝scrapy框架，很簡單用pip install scrapy.中間可能會遇到一些問題，見招拆招，這里就展開了，可以遇到問題搜百度。

一）創建scrapy，我們可以把抓取一個網站列為一個項目。

在相應目錄下scrapy startproject rishome?

這時候創建出來的目錄結構是這樣的。?

a)?? items類似ORM系統的對象類對應數據庫的表

b)?? spiders目錄下保存爬蟲，我這里創建了一只itcast的爬蟲?

c)?? pipelines用于處理items對數據庫的操作

d)?? middlewares沒用到就不說了，以后用到補充

e）setting保存這個爬蟲程序的相關設置

執行數序是??從itcast開始請求頁面返回HTML，按照對HTML的解析封裝數據給items對象，將items對象推送給pipelines進行數據處理。整個過程都是異步的，也是多線程的。

二）結構分析

先分析一下頁面的數據邏輯我們最新看到的是一個個樓盤信息，點擊每個樓盤又有多個分支，每個分支又有分多個樓，每個樓又有多個戶。?

項目列表：?

樓名分支：

座名：?

每房信息：?

關系是? 項目1：N 樓 1：N?房（座只作為房其中一個屬性）?

這個不光要看頁面還要看url構成來分析?

當進入某項目的時候url是http://ris.szpl.gov.cn/bol/projectdetail.aspx?id=37813 很明顯id就是這個項目的id?

當進入該項目某樓時url是http://ris.szpl.gov.cn/bol/building.aspx?id=33043&presellid=37813這時候id是該樓id，priesellid是項目id?

當進入某間房的時候是http://ris.szpl.gov.cn/bol/housedetail.aspx?id=1634258?此時的id是該房的id?

根據分析進行數據庫設計：

property:id 項目idname 項目名稱buliding：id 樓idpropertyid 項目idbulidingname 樓名稱house:id 房屋idbulidingid 樓idname 房屋名稱square 房屋面積....

三）創建items，根據數據庫創建

因為items不會直接跟數據庫映射，所以這里沒有必要完全按數據庫創建，類名和屬性名沒必要和數據庫一致?

# -*- coding: utf-8 -*-# Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass RishomeItem(scrapy.Item):id = scrapy.Field()name = scrapy.Field()class BulidingItem(scrapy.Item):id = scrapy.Field()propertyid=scrapy.Field()bulidingname=scrapy.Field()class HouseItem(scrapy.Item):id = scrapy.Field()bulidingid = scrapy.Field()level = scrapy.Field()houseno = scrapy.Field()

正式創建spider

# -*- coding: utf-8 -*- import scrapy from rishome.items import RishomeItem,BulidingItem,HouseItem from rishome.pipelines import RishomePipelineclass ItcastSpider(scrapy.Spider):name = 'itcast'allowed_domains = ['ris.szpl.gov.cn']start_urls = ['http://ris.szpl.gov.cn/bol/']

name代表這個爬蟲的名稱，后面執行那個的話，也是按這個命令來執行的

allowed_domains 沒搞明白做什么的?

start_urls 表示起始頁面?

當我們對起始頁面發起Get請求，請求的結果就進度到def parse(self, response)方法里，response就相當于對方服務器對我們返回的內容。

接下來我們要將一個重要的知識點xpath，怎么從無數HTML標簽中找到自己想要的這是一門技術。

通過response.xpath進行檢索。后面起文章說明。

def parse(self, response):context = response.xpath('//tr[@bgcolor="#F5F9FC"]/td[3]')for item in context:title=item.xpath('a/text()').extract_first()idstr=item.xpath('a/@href').extract_first()idstr=idstr[idstr.find('=')+1:]request=scrapy.Request(url='http://ris.szpl.gov.cn/bol/projectdetail.aspx?id='+idstr, method='GET',callback=self.showdetailpage)yield request

第一頁并不需要我們抓取什么內容，而是要根據鏈接進入下個頁面。我們可以用chrom來協助我們獲取xpath?

我們抓取是這個表格每行的tr是 <tr bgcolor="#F5F9FC">。要點擊項目名稱，項目名稱上鏈接是我們終極提取的目標，xpath中所有的該標簽的提取方式是response.xpath('//tr[@bgcolor="#F5F9FC"]/td[3]')

//表示所有改節點

//tr[@bgcolor="#F5F9FC"]表示所有屬性bgcolor為"#F5F9FC"的tr節點

//tr[@bgcolor="#F5F9FC"]/td[3]表示所有屬性bgcolor為"#F5F9FC"的tr節點下第3個td標簽

?context = response.xpath('//tr[@bgcolor="#F5F9FC"]/td[3]')將結果賦給context，此時context也是一個xpath的集合

既然是集合我們當然可以遍歷它，? for item in context: 此時的item就是一個個td節點，那么他的子節點就是鏈接<a href="projectdetail.aspx?id=37873" target="_parent">天安云谷產業園二期(02-08)</a>

如果我們獲得a標簽下的內容，那么我們就用 title=item.xpath('a/text()').extract_first()獲得

extract_first()表示返回第一個數值，返回字符

extract()表示返回list，將所有的結果保存在一個list

如果我們想過的一個標簽的中某個屬性的值，我們可以用?idstr=item.xpath('a/@href').extract_first()獲得,這樣我們就獲得了a標簽href的值,應為我們獲取了一個相對地址，我們要向這個地址發起請求。

idstr=idstr[idstr.find('=')+1:] request=scrapy.Request(url='http://ris.szpl.gov.cn/bol/projectdetail.aspx?id='+idstr, method='GET',callback=self.showdetailpage)yield request

這時候我們就向項目的URL發起了Request請求?，使用了GET方法，結果通過回調函數self.showdetailpage

下面我們編寫回調函數，這時候我們就要封裝Items里面的對象了。提取網頁上的內容，然后封裝在RishomeItem對象。將封裝好的對象交由pipelins處理。接下來繼續需要進入下一層的就繼續yied request

def showdetailpage(self,response):item = RishomeItem()homeid = response.url[response.url.find('=')+1:]item["url"]=response.urlitem["id"]=homeidcontext=response.xpath('//tr[@class="a1"]')for it in context:title = it.xpath('td[1]/div/text()').extract_first()if title=="項目名稱":content=it.xpath('td[2]/text()').extract_first()item['name']=contentif title=="宗地位置":content = it.xpath('td[2]/text()').extract_first()item['location']=contentif title=="受讓日期":content = it.xpath('td[2]/text()').extract_first()item['landstartdate']=contentcontent = it.xpath('td[4]/text()').extract_first()item['district'] = content.replace('\r\n','').replace(' ','')if title=="合同文號":content = it.xpath('td[4]/div/text()').extract_first()item['landyear'] = content.replace('\r\n','').replace(' ','').replace('年','')if title=="房屋用途":content = it.xpath('td[2]/text()').extract_first()item['landproperty'] = contentif title=="土地用途":content = it.xpath('td[2]/text()').extract_first()item['houseproperty'] = contentyield itemprojectlist=response.xpath('//*[@id="DataList1"]/tr[@bgcolor="#F5F9FC"]')for it in projectlist:bi=BulidingItem()bulidingname=it.xpath('td[2]/text()').extract_first()bulidingurl=it.xpath('td[5]/a/@href').extract_first()bulidingurl=self.start_urls[0]+bulidingurlbiid=bulidingurl[bulidingurl.find('id=')+3:bulidingurl.find('&')]bi['id']=biidbi['propertyid']=homeidbi['bulidingname']=bulidingnamebi['url']=bulidingurlyield birequest= scrapy.Request(bulidingurl,method='GET',callback=self.showhousepage)yield request

在看看pipeline如何處理

啟用pipeline之前需要在setting里面設置一下

ITEM_PIPELINES = {'rishome.pipelines.RishomePipeline': 300, } import pymysql.cursors import time class RishomePipeline(object):def __init__(self):# 連接數據庫self.connect = pymysql.connect(host='127.0.0.1', # 數據庫地址port=3306, # 數據庫端口db='rishome', # 數據庫名user='root', # 數據庫用戶名passwd='', # 數據庫密碼charset='utf8', # 編碼方式use_unicode=True)self.cursor = self.connect.cursor()#所有的item被yield了以后，都會到這個方法，區別item的方法是獲取item的typedef process_item(self, item, spider):if str(type(item))=="<class 'rishome.items.RishomeItem'>":self.saverishome(item)if str(type(item))=="<class 'rishome.items.BulidingItem'>":self.savebuliding(item)if str(type(item))=="<class 'rishome.items.HouseItem'>":self.savehouse(item)return item # 必須實現返回def saverishome(self,item):timestr=time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time()))sqlstr="""INSERT INTO `rishome`.`property` (`id`,`name`,`location`,`district`,`landstartdate`,`landyear`, `landproperty`,`houseproperty`,`createtime`,`updatetime`,`url`) VALUES(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s) """#self.cursor.execute(sqlstr,(item['id'],item['name'],item['location'],item['district'],item['landstartdate'],item['landyear'],item['landproperty'],item['houseproperty'],timestr,timestr,item['url']))self.cursor.execute(sqlstr, (item['id'], item['name'], item['location'], item['district'],None if item['landstartdate']=='' else item['landstartdate'],None if item['landyear']=='' else item['landyear'],item['landproperty'], item['houseproperty'], timestr, timestr, item['url']))self.connect.commit()

最后我們執行這只爬蟲。進入到爬蟲目錄run spider

E:\scrapy\rishome\rishome\spiders>scrapy crawl itcast

?這樣我們就完成了，從頁面抓取到URL跳轉，到數據庫存儲的全部過程。

總結

以上是生活随笔為你收集整理的python之scrapy:第一只spider的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

python

python之scrapy:第一只spider

在看看pipeline如何處理

總結