新房,因為有著比二手房更好的質量和配套,是每一個打工人夢想的港灣。人們買房往往通過置業顧問或實地踩盤了解相關樓盤信息,然而,這個過程費時費力。如果能通過技術手段,將所在城市所有最新樓盤信息獲取到,做一個初篩,然后再精準實地踩盤,將是一個不錯的思路。
本文通過利用Python技術,手把手教你爬取房天下任意城市新房數據,讓你更快洞察房地產市場變化,助你選擇最佳的置業或投資的房產標的。
01
網頁分析
我們以上海為例,首頁URL及網頁內容如下:
https://sh.newhouse.fang.com/house/s/b91/
這是上海新盤列表,合計749個房源信息(每頁20個),點擊下一頁,URL變為:
https://sh.newhouse.fang.com/house/s/b92/
很顯然,是簡單的靜態網頁,URL由城市參數(此處為sh)和翻頁參數(此處為2)拼接而成。點進一個樓盤(如建邦國宸府),查看樓盤詳情,此時URL和網頁內容變為:
https://sh.newhouse.fang.com/loupan/1210130400/housedetail.htm
而這些內容,才是我們真正要抓取的目標。詳情URL由城市參數(此處為sh)和房源id(此處為1210130400)拼接而成,而房源id大概率藏在首頁URL網頁源代碼中。
那么,爬蟲思路就很清晰了:遍歷首頁房源列表獲取所有房源id,拼接詳情URL,遍歷獲取所有房源詳情信息。
02
爬蟲實戰
打開Pycharm,新建一個py文件,導入爬蟲相關包:
import?requests??#請求數據
from?pyquery?import?PyQuery?as?pq??#本次采用pyquery和re解析數據
import?time
import?re
import?random
為了提高爬蟲安全性,除了最基本的延時,本次爬蟲還加了一些請求頭和代理ip(網上down的,也可以購買),讓程序從中隨機抽取并請求網頁。
global?user_agents
global?proxy_list
user_agents?=?["Mozilla/5.0?(Windows?NT?6.1;?WOW64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?""Chrome/45.0.2454.85?Safari/537.36?115Browser/6.0.3","Mozilla/5.0?(Macintosh;?U;?Intel?Mac?OS?X?10_6_8;?en-us)?AppleWebKit/534.50?(KHTML,?like?Gecko)?Version/5.1?Safari/534.50","Mozilla/5.0?(Windows;?U;?Windows?NT?6.1;?en-us)?AppleWebKit/534.50?(KHTML,?like?Gecko)?Version/5.1?Safari/534.50","Mozilla/4.0?(compatible;?MSIE?8.0;?Windows?NT?6.0;?Trident/4.0)","Mozilla/4.0?(compatible;?MSIE?7.0;?Windows?NT?6.0)","Mozilla/5.0?(Windows?NT?6.1;?rv:2.0.1)?Gecko/20100101?Firefox/4.0.1","Opera/9.80?(Windows?NT?6.1;?U;?en)?Presto/2.8.131?Version/11.11","Mozilla/5.0?(Macintosh;?Intel?Mac?OS?X?10_7_0)?AppleWebKit/535.11?(KHTML,?like?Gecko)?Chrome/17.0.963.56?Safari/535.11","Mozilla/4.0?(compatible;?MSIE?7.0;?Windows?NT?5.1;?Trident/4.0;?SE?2.X?MetaSr?1.0;?SE?2.X?MetaSr?1.0;?.NET?CLR?2.0.50727;?SE?2.X?MetaSr?1.0)","Mozilla/5.0?(compatible;?MSIE?9.0;?Windows?NT?6.1;?Trident/5.0","Mozilla/5.0?(Windows?NT?6.1;?rv:2.0.1)?Gecko/20100101?Firefox/4.0.1","Mozilla/4.0?(compatible;?MSIE?6.0;?Windows?NT?5.1;?SV1;?AcooBrowser;?.NET?CLR?1.1.4322;?.NET?CLR?2.0.50727)","Mozilla/4.0?(compatible;?MSIE?7.0;?Windows?NT?6.0;?Acoo?Browser;?SLCC1;?.NET?CLR?2.0.50727;?Media?Center?PC?5.0;?.NET?CLR?3.0.04506)","Mozilla/4.0?(compatible;?MSIE?7.0;?AOL?9.5;?AOLBuild?4337.35;?Windows?NT?5.1;?.NET?CLR?1.1.4322;?.NET?CLR?2.0.50727)","Mozilla/5.0?(Windows;?U;?MSIE?9.0;?Windows?NT?9.0;?en-US)","Mozilla/5.0?(compatible;?MSIE?9.0;?Windows?NT?6.1;?Win64;?x64;?Trident/5.0;?.NET?CLR?3.5.30729;?.NET?CLR?3.0.30729;?.NET?CLR?2.0.50727;?Media?Center?PC?6.0)","Mozilla/5.0?(compatible;?MSIE?8.0;?Windows?NT?6.0;?Trident/4.0;?WOW64;?Trident/4.0;?SLCC2;?.NET?CLR?2.0.50727;?.NET?CLR?3.5.30729;?.NET?CLR?3.0.30729;?.NET?CLR?1.0.3705;?.NET?CLR?1.1.4322)","Mozilla/4.0?(compatible;?MSIE?7.0b;?Windows?NT?5.2;?.NET?CLR?1.1.4322;?.NET?CLR?2.0.50727;?InfoPath.2;?.NET?CLR?3.0.04506.30)","Mozilla/5.0?(Windows;?U;?Windows?NT?5.1;?zh-CN)?AppleWebKit/523.15?(KHTML,?like?Gecko,?Safari/419.3)?Arora/0.3?(Change:?287?c9dfb30)","Mozilla/5.0?(X11;?U;?Linux;?en-US)?AppleWebKit/527+?(KHTML,?like?Gecko,?Safari/419.3)?Arora/0.6","Mozilla/5.0?(Windows;?U;?Windows?NT?5.1;?en-US;?rv:1.8.1.2pre)?Gecko/20070215?K-Ninja/2.1.1","Mozilla/5.0?(Windows;?U;?Windows?NT?5.1;?zh-CN;?rv:1.9)?Gecko/20080705?Firefox/3.0?Kapiko/3.0","Mozilla/5.0?(X11;?Linux?i686;?U;)?Gecko/20070322?Kazehakase/0.4.5","Mozilla/5.0?(X11;?U;?Linux?i686;?en-US;?rv:1.9.0.8)?Gecko?Fedora/1.9.0.8-1.fc10?Kazehakase/0.5.6","Mozilla/5.0?(Windows?NT?6.1;?WOW64)?AppleWebKit/535.11?(KHTML,?like?Gecko)?Chrome/17.0.963.56?Safari/535.11","Mozilla/5.0?(Macintosh;?Intel?Mac?OS?X?10_7_3)?AppleWebKit/535.20?(KHTML,?like?Gecko)?Chrome/19.0.1036.7?Safari/535.20","Opera/9.80?(Macintosh;?Intel?Mac?OS?X?10.6.8;?U;?fr)?Presto/2.9.168?Version/11.52","Mozilla/5.0?(Windows?NT?6.1;?WOW64)?AppleWebKit/536.11?(KHTML,?like?Gecko)?Chrome/20.0.1132.11?TaoBrowser/2.0?Safari/536.11","Mozilla/5.0?(Windows?NT?6.1;?WOW64)?AppleWebKit/537.1?(KHTML,?like?Gecko)?Chrome/21.0.1180.71?Safari/537.1?LBBROWSER","Mozilla/5.0?(compatible;?MSIE?9.0;?Windows?NT?6.1;?WOW64;?Trident/5.0;?SLCC2;?.NET?CLR?2.0.50727;?.NET?CLR?3.5.30729;?.NET?CLR?3.0.30729;?Media?Center?PC?6.0;?.NET4.0C;?.NET4.0E;?LBBROWSER)","Mozilla/4.0?(compatible;?MSIE?6.0;?Windows?NT?5.1;?SV1;?QQDownload?732;?.NET4.0C;?.NET4.0E;?LBBROWSER)","Mozilla/5.0?(Windows?NT?6.1;?WOW64)?AppleWebKit/535.11?(KHTML,?like?Gecko)?Chrome/17.0.963.84?Safari/535.11?LBBROWSER","Mozilla/4.0?(compatible;?MSIE?7.0;?Windows?NT?6.1;?WOW64;?Trident/5.0;?SLCC2;?.NET?CLR?2.0.50727;?.NET?CLR?3.5.30729;?.NET?CLR?3.0.30729;?Media?Center?PC?6.0;?.NET4.0C;?.NET4.0E)","Mozilla/5.0?(compatible;?MSIE?9.0;?Windows?NT?6.1;?WOW64;?Trident/5.0;?SLCC2;?.NET?CLR?2.0.50727;?.NET?CLR?3.5.30729;?.NET?CLR?3.0.30729;?Media?Center?PC?6.0;?.NET4.0C;?.NET4.0E;?QQBrowser/7.0.3698.400)","Mozilla/4.0?(compatible;?MSIE?6.0;?Windows?NT?5.1;?SV1;?QQDownload?732;?.NET4.0C;?.NET4.0E)","Mozilla/4.0?(compatible;?MSIE?7.0;?Windows?NT?5.1;?Trident/4.0;?SV1;?QQDownload?732;?.NET4.0C;?.NET4.0E;?360SE)","Mozilla/4.0?(compatible;?MSIE?6.0;?Windows?NT?5.1;?SV1;?QQDownload?732;?.NET4.0C;?.NET4.0E)","Mozilla/4.0?(compatible;?MSIE?7.0;?Windows?NT?6.1;?WOW64;?Trident/5.0;?SLCC2;?.NET?CLR?2.0.50727;?.NET?CLR?3.5.30729;?.NET?CLR?3.0.30729;?Media?Center?PC?6.0;?.NET4.0C;?.NET4.0E)","Mozilla/5.0?(Windows?NT?5.1)?AppleWebKit/537.1?(KHTML,?like?Gecko)?Chrome/21.0.1180.89?Safari/537.1","Mozilla/5.0?(Windows?NT?6.1;?WOW64)?AppleWebKit/537.1?(KHTML,?like?Gecko)?Chrome/21.0.1180.89?Safari/537.1","Mozilla/5.0?(iPad;?U;?CPU?OS?4_2_1?like?Mac?OS?X;?zh-cn)?AppleWebKit/533.17.9?(KHTML,?like?Gecko)?Version/5.0.2?Mobile/8C148?Safari/6533.18.5","Mozilla/5.0?(Windows?NT?6.1;?Win64;?x64;?rv:2.0b13pre)?Gecko/20110307?Firefox/4.0b13pre","Mozilla/5.0?(X11;?Ubuntu;?Linux?x86_64;?rv:16.0)?Gecko/20100101?Firefox/16.0","Mozilla/5.0?(Windows?NT?6.1;?WOW64)?AppleWebKit/537.11?(KHTML,?like?Gecko)?Chrome/23.0.1271.64?Safari/537.11","Mozilla/5.0?(X11;?U;?Linux?x86_64;?zh-CN;?rv:1.9.2.10)?Gecko/20100922?Ubuntu/10.10?(maverick)?Firefox/3.6.10","Mozilla/5.0?(Windows?NT?10.0;?Win64;?x64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/58.0.3029.110?Safari/537.36",
]
proxy_list?=?["218.91.13.2:46332","121.31.176.85:8123","218.71.161.56:80","49.85.1.230:28643","115.221.121.165:41674","123.55.177.237:808"]
定義一個get_id函數,遍歷獲取所有房源id,將其存放在列表idlist中:
def?get_id(city):url?=?'https://'?+?city?+?'.newhouse.fang.com/house/s/b91'user_agent?=?random.choice(user_agents)header?=?{'User-Agent':?user_agent}proxy?=?{'Proxies':?random.choice(proxy_list)}r?=?requests.get(url,?headers=header,?proxies=proxy)time.sleep(2)r.encoding?=?'GBK'pattern1?=?re.compile('(?<=現有新樓盤)\d+')total?=?int(re.findall(pattern1,?r.text)[0])?//?20?+?1idlist?=?[]for?i?in?range(1,?total?+?1):url?=?'https://'?+?city?+?'.newhouse.fang.com/house/s/b9'?+?str(i)user_agent?=?random.choice(user_agents)header?=?{'User-Agent':?user_agent}proxy?=?{'Proxies':?random.choice(proxy_list)}r?=?requests.get(url,?headers=header,?proxies=proxy)time.sleep(2)r.encoding?=?'gb2312'pattern?=?re.compile('(?<=loupan/)\d+')id?=?re.findall(pattern,?r.text)for?j?in?id:idlist.append(j)#?print(idlist)return?idlist
定義一個get_data函數,將房源id傳入詳情頁URL中,遍歷獲取所有房源詳情信息:
def?get_data(city,?id):url?=?'https://'?+?city?+?'.newhouse.fang.com/loupan/'?+?id?+?'/housedetail.htm'user_agent?=?random.choice(user_agents)header?=?{'User-Agent':?user_agent}proxy?=?{'Proxies':?random.choice(proxy_list)}r?=?requests.get(url,?headers=header,?proxies=proxy)time.sleep(1)r.encoding?=?'utf8'doc?=?pq(r.text)#?print(doc)data1?=?doc('.ts_linear').items()for?i?in?data1:print(i.text())data1?=?doc('.list').items()for?i?in?data1:print(i.text())
最后,調用這兩個函數:
id?=?get_id('sh')
for?i?in?range(len(id)):get_data('sh',?id[i])
03
效果演示
推薦閱讀
誤執行了rm -fr /*之后,除了跑路還能怎么辦?!程序員必備58個網站匯總大幅提高生產力:你需要了解的十大Jupyter Lab插件
總結
以上是生活随笔為你收集整理的肝!教你用Python抓取某天下楼盘数据的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。