當前位置：首頁 > 编程语言 > python >内容正文

python

Python之爬取安居客网二手房小区详情页数据

發布時間：2023/12/14 python 23 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python之爬取安居客网二手房小区详情页数据小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

哈嘍，小伙伴們，前兩篇博客案例基本上將爬蟲基礎流程介紹的差不多了，這篇博客開始放重磅炸彈，難度系數上升一些（難度1：涉及二級頁面爬取，難度2：共爬取16個字段）。本文的主要內容：以石家莊市為例，爬取安居客網二手房小區的詳情頁的相關字段信息，關于二手房小區首頁信息的爬取這里就不作過多介紹，因為與上一篇博客（Python爬取58同城在售樓盤房源信息）的爬蟲步驟基本一致，感興趣的小伙伴可以去看下呀。好了，廢話不多說，開始展開~

首先，我們先打開安居客官方網站，設置好兩個篩選條件：石家莊市、二手房小區（這個根據小伙伴們的興趣自行選擇），可以看到篩選出的小區有11688個，每頁有25個，所以大概有468頁數據，如果把所有小區數據都爬取完的話，耗費時間較多，本文主要以講解流程為主，所以這里的話，我們主要爬取前500個小區的詳情頁相關字段數據，下面我們來看一下二手房小區詳情頁有哪些字段可以爬取？

我們以首頁的第一個二手房小區恒大御景半島為例，打開小區詳情頁如下圖，從圖中可以看出，有很多字段信息，這次我們的任務就是爬取這些相關字段，主要包括：小區名稱、所在區及地址、小區均價、二手房源數、租房房源數、物業類型、物業費、總建面積、總戶數、竣工時間、停車位、容積率、綠化率、開發商、物業公司、所屬商圈，共有16個字段。

文章開頭也提到，相對于前兩篇爬蟲案例來說，本文爬蟲案例難度要有所上升，難點主要集中在兩方面：一個是二級頁面爬取，另一個是爬取字段較多。不過不要慌，穩住，其實也并不難。這里我把大致的爬取流程簡單說下，小伙伴們就明白怎么爬取了。大致流程：先根據小區列表頁面的URL爬取每個小區詳情頁的URL，然后遍歷每個小區詳情頁的URL，在循環的過程中依次爬取其詳情頁的相關字段信息。基本上就是循環套循環的邏輯！如果小伙伴還是不明白的話，等會直接看代碼或許有意外驚喜呢！

1. 獲取安居客網石家莊市二手房小區URL

關于如何獲取URL，這里就不過多介紹了哈，直接放結果。如果有剛開始接觸的小伙伴，可以看下我前兩篇爬蟲基礎案例的博客。

# 首頁URL url = 'https://sjz.anjuke.com/community/p1'# 多頁爬取:為了爬取方便，這里以爬取前500個小區為例，每頁25個，共有20頁 for i in range(20):url = 'https://sjz.anjuke.com/community/p{}'.format(i)

2. 分析網頁html代碼，查看各字段信息所在的網頁位置

這里的話，涉及到兩個頁面的html代碼，一個是小區列表頁面的，一個是每個小區詳情頁面的，我們分別來看一下：

（1）小區列表頁面html代碼：

在小區列表頁面的話，我們只需要獲取兩方面內容：一個是每個小區詳情頁的URL，一個是每個小區的均價；

（2）小區詳情頁面html代碼：

3.?利用Xpath解析網頁，獲取相應字段的值

（1）小區列表頁面：

# 每個小區詳情頁URL： link = html.xpath('.//div[@class="list-cell"]/a/@href') # 小區均價： price = html.xpath('.//div[@class="list-cell"]/a/div[3]/div/strong/text()')

（2）小區詳情頁面：

dict_result = {'小區名稱':'-','價格':'-','小區地址':'-','物業類型':'-','物業費': '-','總建面積': '-','總戶數': '-','建造年代': '-','停車位': '-','容積率': '-','綠化率': '-','開發商': '-','物業公司': '-','所屬商圈': '-','二手房源數':'-','租房房源數':'-'} dict_result['小區名稱'] = html.xpath('.//div[@class="comm-title"]/h1/text()') dict_result['小區地址'] = html.xpath('.//div[@class="comm-title"]/h1/span/text()') dict_result['物業類型'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/dl/dd[1]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[1]/text()') dict_result['物業費'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/dl/dd[2]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[2]/text()') dict_result['總建面積'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/dl/dd[3]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[3]/text()') dict_result['總戶數'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/dl/dd[4]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[4]/text()') dict_result['建造年代'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/dl/dd[5]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[5]/text()') dict_result['停車位'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/dl/dd[6]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[6]/text()') dict_result['容積率'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/dl/dd[7]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[7]/text()') dict_result['綠化率'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/dl/dd[8]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[8]/text()') dict_result['開發商'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/dl/dd[9]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[9]/text()') dict_result['物業公司'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/dl/dd[10]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[10]/text()') dict_result['所屬商圈'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/dl/dd[11]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[11]/text()') dict_result['二手房源數'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/div[3]/a[1]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/div[3]/a[1]/text()') dict_result['租房房源數'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/div[3]/a[2]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/div[3]/a[2]/text()')

4. 首頁爬取——25個小區詳情頁數據

一般情況下，我首先都會考慮爬取首頁的內容，當首頁內容所有字段信息都爬取無誤后，再去加循環爬取多頁內容。如果小伙伴有基礎的話，可以直接跳過這一章看最后爬取所有數據的完整代碼（5. 多頁爬取時完整代碼解析）;

（1）導包以及創建文件對象

（2）設置反爬

## 設置請求頭參數：User-Agent, cookie, referer ua = UserAgent() headers = {# 隨機生成User-Agent"user-agent": ua.random,# 不同用戶不同時間訪問，cookie都不一樣，根據自己網頁的來，獲取方法見另一篇博客"cookie": "sessid=C7103713-BE7D-9BEF-CFB5-6048A637E2DF; aQQ_ajkguid=263AC301-A02C-088D-AE4E-59D4B4D4726A; ctid=28; twe=2; id58=e87rkGCpsF6WHADop0A3Ag==; wmda_uuid=1231c40ad548840be4be3d965bc424de; wmda_new_uuid=1; wmda_session_id_6289197098934=1621733471115-664b82b6-8742-1591; wmda_visited_projects=%3B6289197098934; obtain_by=2; 58tj_uuid=8b1e1b8f-3890-47f7-ba3a-7fc4469ca8c1; new_session=1; init_refer=http%253A%252F%252Flocalhost%253A8888%252F; new_uv=1; _ga=GA1.2.1526033348.1621734712; _gid=GA1.2.876089249.1621734712; als=0; xxzl_cid=7be33aacf08c4431a744d39ca848819a; xzuid=717fc82c-ccb6-4394-9505-36f7da91c8c6",# 設置從何處跳轉過來"referer": "https://sjz.anjuke.com/community/p1/", }## 從代理IP池，隨機獲取一個IP，比如必須ProxyPool項目在運行中 def get_proxy():try:PROXY_POOL_URL = 'http://localhost:5555/random'response = requests.get(PROXY_POOL_URL)if response.status_code == 200:return response.textexcept ConnectionError:return None

（3）解析一級頁面函數：

主要爬取小區列表中每個小區詳情頁的URL和每個小區的均價；

## 解析一級頁面函數 def get_link(url):text = requests.get(url=url, headers=headers, proxies={"http": "http://{}".format(get_proxy())}).texthtml = etree.HTML(text)link = html.xpath('.//div[@class="list-cell"]/a/@href')price = html.xpath('.//div[@class="list-cell"]/a/div[3]/div/strong/text()')#print(link)#print(price)return zip(link, price)

（4）解析二級頁面函數，也就是小區詳情頁

## 解析二級頁面函數 def parse_message(url, price):dict_result = {'小區名稱': '-','價格': '-','小區地址': '-','物業類型': '-','物業費': '-','總建面積': '-','總戶數': '-','建造年代': '-','停車位': '-','容積率': '-','綠化率': '-','開發商': '-','物業公司': '-','所屬商圈': '-','二手房源數':'-','租房房源數':'-'}text = requests.get(url=url, headers=headers,proxies={"http": "http://{}".format(get_proxy())}).texthtml = etree.HTML(text)dict_result['小區名稱'] = html.xpath('.//div[@class="comm-title"]/h1/text()')dict_result['小區地址'] = html.xpath('.//div[@class="comm-title"]/h1/span/text()')dict_result['物業類型'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/dl/dd[1]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[1]/text()')dict_result['物業費'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/dl/dd[2]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[2]/text()')dict_result['總建面積'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/dl/dd[3]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[3]/text()')dict_result['總戶數'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/dl/dd[4]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[4]/text()')dict_result['建造年代'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/dl/dd[5]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[5]/text()')dict_result['停車位'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/dl/dd[6]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[6]/text()')dict_result['容積率'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/dl/dd[7]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[7]/text()')dict_result['綠化率'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/dl/dd[8]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[8]/text()')dict_result['開發商'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/dl/dd[9]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[9]/text()')dict_result['物業公司'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/dl/dd[10]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[10]/text()')dict_result['所屬商圈'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/dl/dd[11]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[11]/text()')dict_result['二手房源數'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/div[3]/a[1]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/div[3]/a[1]/text()')dict_result['租房房源數'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/div[3]/a[2]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/div[3]/a[2]/text()')# 對爬取到的數據進行簡單預處理for key,value in dict_result.items():value = list(map(lambda item: re.sub('\s+', '', item), value)) # 去掉換行符制表符dict_result[key] = list(filter(None, value)) # 去掉上一步產生的空元素if len(dict_result[key]) == 0:dict_result[key] = ''else:dict_result[key] = dict_result[key][0]dict_result['價格'] = pricereturn dict_result

（5）保存數據到文件save_csv()函數

## 將數據讀取到csv文件中 def save_csv(result):for row in result: # 一個小區數據存放到一個字典中csv_write.writerow(row)

（6）只爬取首頁時的主函數

#主函數 C = 1 k = 1 # 爬取房源條數 print("************************第1頁開始爬取************************") # 第一頁URL url = 'https://sjz.anjuke.com/community/p1' # 解析一級頁面函數,函數返回詳情頁URL和均價 link = get_link(url) list_result = [] # 將字典數據存入到列表中 for j in link:try:# 解析二級頁面函數，分別傳入詳情頁URL和均價兩個參數result = parse_message(j[0], j[1])list_result.append(result)print("已爬取{}條數據".format(k))k = k + 1 # 控制爬取的小區數time.sleep(round(random.randint(5, 10), C)) # 設置睡眠時間間隔except Exception as err:print("-----------------------------")print(err) # 保存數據到文件中 save_csv(list_result) print("************************第1頁爬取成功************************")

5. 多頁爬取——完整代碼解析

由于代碼較長，小伙伴一定要耐心閱讀，剛開始學習爬蟲的小伙伴，可以先看看上面第4部分，學會爬取首頁數據后，再來看爬取多頁數據就會輕松很多；

## 導入相關程序包 from lxml import etree import requests from fake_useragent import UserAgent import random import time import csv import re## 創建文件對象 f = open('安居客網石家莊市二手房源信息.csv', 'w', encoding='utf-8-sig', newline="") # 創建文件對象 csv_write = csv.DictWriter(f, fieldnames=['小區名稱', '價格', '小區地址', '物業類型','物業費','總建面積','總戶數', '建造年代','停車位','容積率','綠化率','開發商','物業公司','所屬商圈','二手房源數','租房房源數']) csv_write.writeheader() # 寫入文件頭## 設置請求頭參數：User-Agent, cookie, referer ua = UserAgent() headers = {# 隨機生成User-Agent"user-agent": ua.random,# 不同用戶不同時間訪問，cookie都不一樣，根據自己網頁的來，獲取方法見另一篇博客"cookie": "sessid=C7103713-BE7D-9BEF-CFB5-6048A637E2DF; aQQ_ajkguid=263AC301-A02C-088D-AE4E-59D4B4D4726A; ctid=28; twe=2; id58=e87rkGCpsF6WHADop0A3Ag==; wmda_uuid=1231c40ad548840be4be3d965bc424de; wmda_new_uuid=1; wmda_session_id_6289197098934=1621733471115-664b82b6-8742-1591; wmda_visited_projects=%3B6289197098934; obtain_by=2; 58tj_uuid=8b1e1b8f-3890-47f7-ba3a-7fc4469ca8c1; new_session=1; init_refer=http%253A%252F%252Flocalhost%253A8888%252F; new_uv=1; _ga=GA1.2.1526033348.1621734712; _gid=GA1.2.876089249.1621734712; als=0; xxzl_cid=7be33aacf08c4431a744d39ca848819a; xzuid=717fc82c-ccb6-4394-9505-36f7da91c8c6",# 設置從何處跳轉過來"referer": "https://sjz.anjuke.com/community/p1/", }## 從代理IP池，隨機獲取一個IP，比如必須ProxyPool項目在運行中 def get_proxy():try:PROXY_POOL_URL = 'http://localhost:5555/random'response = requests.get(PROXY_POOL_URL)if response.status_code == 200:return response.textexcept ConnectionError:return None## 解析一級頁面函數 def get_link(url):text = requests.get(url=url, headers=headers, proxies={"http": "http://{}".format(get_proxy())}).texthtml = etree.HTML(text)link = html.xpath('.//div[@class="list-cell"]/a/@href')price = html.xpath('.//div[@class="list-cell"]/a/div[3]/div/strong/text()')#print(link)#print(price)return zip(link, price)## 解析二級頁面函數 def parse_message(url, price):dict_result = {'小區名稱': '-','價格': '-','小區地址': '-','物業類型': '-','物業費': '-','總建面積': '-','總戶數': '-','建造年代': '-','停車位': '-','容積率': '-','綠化率': '-','開發商': '-','物業公司': '-','所屬商圈': '-','二手房源數':'-','租房房源數':'-'}text = requests.get(url=url, headers=headers,proxies={"http": "http://{}".format(get_proxy())}).texthtml = etree.HTML(text)dict_result['小區名稱'] = html.xpath('.//div[@class="comm-title"]/h1/text()')dict_result['小區地址'] = html.xpath('.//div[@class="comm-title"]/h1/span/text()')dict_result['物業類型'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/dl/dd[1]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[1]/text()')dict_result['物業費'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/dl/dd[2]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[2]/text()')dict_result['總建面積'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/dl/dd[3]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[3]/text()')dict_result['總戶數'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/dl/dd[4]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[4]/text()')dict_result['建造年代'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/dl/dd[5]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[5]/text()')dict_result['停車位'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/dl/dd[6]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[6]/text()')dict_result['容積率'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/dl/dd[7]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[7]/text()')dict_result['綠化率'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/dl/dd[8]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[8]/text()')dict_result['開發商'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/dl/dd[9]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[9]/text()')dict_result['物業公司'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/dl/dd[10]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[10]/text()')dict_result['所屬商圈'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/dl/dd[11]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/dl/dd[11]/text()')dict_result['二手房源數'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/div[3]/a[1]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/div[3]/a[1]/text()')dict_result['租房房源數'] = html.xpath('.//div[@class="comm-basic-mod "]/div[2]/div[3]/a[2]/text()|.//div[@class="comm-basic-mod has-pano-box "]/div[2]/div[3]/a[2]/text()')# 對爬取到的數據進行簡單預處理for key,value in dict_result.items():value = list(map(lambda item: re.sub('\s+', '', item), value)) # 去掉換行符制表符dict_result[key] = list(filter(None, value)) # 去掉上一步產生的空元素if len(dict_result[key]) == 0:dict_result[key] = ''else:dict_result[key] = dict_result[key][0]dict_result['價格'] = pricereturn dict_result## 將數據讀取到csv文件中 def save_csv(result):for row in result:csv_write.writerow(row)## 主要代碼 C = 1 k = 1 # 爬取房源條數# 多頁爬取，由于時間所限，只爬取前500個小區詳情數據，后續感興趣的小伙伴可以自行爬取 for i in range(1,21): #每頁25個小區，前500個就是20頁print("************************" + "第%s頁開始爬取" % i + "************************")url = 'https://sjz.anjuke.com/community/p{}'.format(i)# 解析一級頁面函數,函數返回詳情頁URL和均價link = get_link(url)list_result = [] # 定義一個列表，存放每個小區字典數據for j in link:try:# 解析二級頁面函數，分別傳入詳情頁URL和均價兩個參數result = parse_message(j[0], j[1])list_result.append(result) # 將字典數據存入到列表中print("已爬取{}條數據".format(k))k = k + 1 # 控制爬取的小區數time.sleep(round(random.randint(1,3), C)) # 設置睡眠時間間隔,控制兩級頁面訪問時間except Exception as err:print("-----------------------------")print(err)# 保存數據到文件中save_csv(list_result)time.sleep(random.randint(1,3)) # 設置睡眠時間間隔,控制一級頁面訪問時間print("************************" + "第%s頁爬取成功" % i + "************************")

6. 最終爬取到的數據

好了，到此第三個爬蟲案例就差不多結束了，本文主要利用Xpath爬取安居客網石家莊市二手房小區詳情頁相關數據，該案例相對于前兩個案例來說，難度上升了一個層次，難點主要體現在兩方面：一個是涉及到二級頁面的爬取，需要從一級頁面中獲取二級頁面的URL；另一個就是爬取的字段較多，需要不斷去嘗試查看相應字段是否可以爬取成功。總體來說，難度雖然上升了，但是只要小伙伴們能夠堅持閱讀下來，相信會有不小的收獲呢！當初我這個小白學的時候，第一感覺就是爬蟲還可以這么玩，還是蠻有意思的！關于后續的博客計劃，以前在學習過程中，還爬取過百度地圖POI數據、大眾點評等，這或許是我下一步要總結的，如果小伙伴感興趣的話，可以來波關注，嘿嘿！

如果哪里有介紹的不是很全面的地方，歡迎小伙伴在評論區留言，我會不斷完善的！

? ? ? ? ? ? ? ? ? ? ? ? ? ??來都來了，確定不留下點什么嘛，嘻嘻~

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ? ??

總結

以上是生活随笔為你收集整理的Python之爬取安居客网二手房小区详情页数据的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： dfs专栏
下一篇：怪物之心无法触发_《勇者斗恶龙怪兽篇：