當前位置：首頁 > 编程语言 > python >内容正文

python

python爬虫requests源码链家_python的爬虫项目(链家买二手房）

發布時間：2025/3/19 python 27 豆豆

生活随笔收集整理的這篇文章主要介紹了 python爬虫requests源码链家_python的爬虫项目(链家买二手房）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

不知不覺，已經工作6年有余，恍恍惚惚，有機會滿足房子需求。

在收集房子信息過程中，做些記錄。

貝殼的功能很強大，但很難滿足小區、距離、教育、面積等多個方面的匹配，使用起來成本仍然較高。

針對以上情況，編寫該項目，收集鏈家的二手房數據。項目中，主要根據價格來篩選小區，并根據小區教育、同工作位置的距離來確定關注小區，再通過房子面積、總價、戶型來確定可以選擇的房子列表，從而將購房精力集中在關注的重點小區和房子中。

當然，每個人可以根據自己需求進行調整。

一、基礎環境說明

1.1 基礎環境

1.1.1 python

1.1.2 request(加載頁面)

1.1.3 BeautifuSoup(提取信息 )

常用使用例子：

from bs4 import BeautifulSoup

soup = BeautifulSoup(a, "html.parser")

soup.title.text # '標題'

# 一、提取標簽

# 1.1 提取唯一標簽

soup.h1

soup.find('h1')

soup.find_all('h1')[0]

# 1.2 提取多個標簽

soup.find_all('h2')

# [

標題2

標題3

]

soup.find_all(['h1','h2'])

# [

標題1

標題2

標題3

]

# 1.3 使用正則表達式

import re

soup.find_all(re.compile('^h'))

# [

標題1

標題2

標題3

]

# 二、匹配屬性

# 2.1 匹配屬性1，直接將屬性名作為參數名，但是有些屬性不行，比如像a-b這樣的屬性

soup.find_all('p', id = 'p1') # 一般情況

soup.find_all('p', class_='p3') # class是保留字比較特殊，需要后面加一個_

# 2.2 匹配屬性2，最通用的方法

soup.find_all('p', attrs={'class':'p3'}) # 包含這個屬性就算，而不是只有這個屬性

soup.find_all('p', attrs={'class':'p3','id':'pp'}) # 使用多個屬性匹配

soup.find_all('p', attrs={'class':'p3','id':False}) # 指定不能有某個屬性

soup.find_all('p', attrs={'id':['p1','p2']}) # 屬性值是p1或p2

soup.find_all('p', attrs={'class':True}) # 含有class屬性即可

# 2.3 匹配屬性3，正則表達式匹配

import re

soup.find_all('p', attrs={'id':re.compile('^p')}) # 使用正則表達式

# 三、根據標簽內容文本來識別

# 3.1 匹配標簽內容1，正則表達式

import re

soup.find_all('p', text=re.compile('段落'))

soup.find_all('p',text=True)

# 3.2 匹配標簽內容2，傳入函數

def nothing(c):

return c not in ['段落1','段落2','文章']

soup.find_all('p',text=nothing)

def has_class_but_no_id(tag):

return tag.has_attr('class') and not tag.has_attr('id')

# 四、提取內容

# 4.1 提取標簽文本

soup.h.text # 多層嵌套也可以直接返回

soup.h.a.text # 也可以這樣

soup.body.text # 里面有多個內容時 '\n標題\n段落1\n段落2\n'

# 4.2 其他標簽的屬性值

# 提取屬性值，像字典一樣提取，以下兩種方法等價

soup.h.a['href']

soup.h.a.get('href')

# 五、提取標簽信息

print(i.name) # 提取標簽名

print(i.attrs) # 提取標簽所有屬性值

print(i.has_attr('href')) # 檢查標簽是否有某屬性

# 五、示例

soup.find('p', attrs={'class':'first'}).text # '文字1'

soup.find_all('p') # [

文字1

文字2

], 再分別從中提取文字

soup.find('ul', attrs={'class':'list1'}).find_all('li') # [

列表1第1項, 列表1第2項]

# 代碼參考：https://zhuanlan.zhihu.com/p/35354532

1.1.4 地理位置信息(百度API)

調用方式1：

def geocodeB(address):

base = "http://api.map.baidu.com/geocoder?address=%s&output=json&key=yourak&city=上海" % address

response = requests.get(base)

if response.status_code == 200:

answer = response.json()

if "location" in answer['result'] and "level" in answer['result']:

return (address,

# round(answer['result']['location']['lng'], 5),

answer['result']['location']['lng'],

# round(answer['result']['location']['lat'], 5),

answer['result']['location']['lat'],

answer['result']["level"])

else:

logger.error("geocodeB %s warning:%s" % (address, answer))

return None

else:

logger.error("geocodeB %s Error" % address)

return None

調用方式2：

def geocodeB2(address):

from urllib.request import urlopen, quote

from urllib.parse import quote_plus

import hashlib, json

# 以get請求為例http://api.map.baidu.com/geocoder/v2/?address=百度大廈&output=json&ak=yourak

queryStr = '/geocoder/v2/?address=%s&city=上海&output=json&ak=$yourak$' % address

# 對queryStr進行轉碼，safe內的保留字符不轉換

encodedStr = quote(queryStr, safe="/:=&?#+!$,;'@()*[]")

# 在最后直接追加上yoursk

rawStr = encodedStr + '$yoursn$'

sn = hashlib.md5(quote_plus(rawStr).encode("utf8")).hexdigest()

url = 'http://api.map.baidu.com%s&sn=%s' % (encodedStr, sn)

req = urlopen(url)

res = req.read().decode() # 將其他編碼的字符串解碼成unicode

answer = json.loads(res) # 對json數據進行解析

if "location" in answer['result'] and "level" in answer['result']:

return answer['result']['location']['lat'], answer['result']['location']['lng']

else:

logger.error("geocodeB %s warning:%s" % (address, answer))

return None

調用方式3：

def geocode_by_baidu(address):

from geopy.geocoders import baidu

apikey = '$yourak$' # 從網站申請 http://lbsyun.baidu.com/apiconsole/key?application=key

sn = '$yoursn$'

g = baidu.Baidu(api_key=apikey, security_key=sn, timeout=200)

a = g.geocode(address)

# return (round(a.latitude, 6), round(a.longitude, 6))

return a.latitude, a.longitude

1.1.5 地理獲取距離計算(geopy)

# x and y is (lat,lng)

def get_distance(x, y):

from geopy.distance import geodesic

return round(geodesic(x, y).km, 3)

1.1.6 解決懶加載和滾動加載(selenium)

Selenium是一個用于Web應用程序測試的工具。

Selenium測試直接運行在瀏覽器中，就像真正的用戶在操作一樣。支持的瀏覽器包括IE（7, 8, 9, 10, 11），Firefox，Safari，Chrome，Opera等。

使用python爬蟲調用selenium來模擬正常用戶訪問瀏覽器.

1.2 主要問題

1.2.1 懶加載問題

1.2.2 滾動加載問題

1.2.3 IP訪問限制

二、前期準備

2.1 分析獲取的需求

個人買房需求：

預算：400萬，最多不超過450萬；

教育：2梯隊學區房

戶型：二房以上

房齡：1990年后

面積：60平米以上

交通：離世紀大道乘公交不超過1小時

===》

1、學區情況，根據小區攻略的教育評分來過濾，確定小區范圍

2、根據a. 小區的中的房子的價格，使用預算過濾；b.小區的位置，通過距離來過濾交通，不滿足的小區

3、通過符合要求的小區列表，來針對每個小區獲取房子列表，并確定跟蹤重點小區

特別說明：

1、為什么不直接獲取房子呢？房子無法判斷是否滿足教育；如果通過房子找小區，再找教育，考慮房子比小區多出幾個數量級，會有更多的時間浪費

2、通過預算和面積需求，可以確定房子的單價，通過單價來篩選小區，減少小區范圍。

2.2 分析頁面路徑

2.2.1 獲取小區列表

1、小區列表的鏈接分析

由于鏈家僅顯示前100頁內容，而整個上海的小區顯然比100頁更多，故根據區來獲取小區。

其中

bp5ep7.5為價格在5-7.5萬的區間，bp為begin price；ep為end price。

pg為page的頁面

2、小區是否有評價的判斷

可以根據第一步獲取的小區列表中，查看小區是否存在小區攻略標簽來判斷是否有小區評價信息

特別說明：并不是每一個小區，都可以查看到小區的教育評分

示例鏈接：https://m.ke.com/sh/xiaoqu/5011000016009/，可以獲取到小區的整體的評分

2.2.2 根據小區，獲取攻略

小區的攻略地址為：

https://m.ke.com/sh/xiaoqu/5011000016009/gonglueV2.html?click_source=m_resblock_detail#review

對于小區，有總體評分和分項評分，其中分項評分包含建筑品質、戶型設計、交通條件、教育質量、商業環境、花園景觀、物業管理等評分。

每個人可以根據自己的需求，使用不同的評分項進行小區過濾。

例如，我優先考慮教育，則以教育條件進行主要過濾條件，要求教育8分以上，而其他的要求6.5分以上。

2.2.3 根據小區，獲取房子列表

小區的房子列表：

增加過濾：https://m.ke.com/sh/ershoufang/bp350ep450l2l3ba67ea70c5011000016009

其中，bp350ep450為價格區間為350-450萬；l2l3戶型為2室或3室；ba67ea70為面積在67-70平米；c5011000016009 為小區編號為5011000016009。

三、項目代碼實現

3.1 獲取小區

def get_xiaoqu_list(self, area, save_path):

page_size = 100

# 由于僅收集上海，故未對多城市處理

fieldnames = ['area', 'page', 'xiaoqu_id', 'url', 'name', "brief", "loc", "build_type", "build_year", "price",

"have_gonglue"]

# 如果不存在，則創建一個空CSV文件，包含表頭

# 如果已存在，則將記錄已處理的記錄情況（針對IP限制，需要跑多次情況）

handled_list = []

if os.path.isfile(save_path):

with open(save_path, encoding='utf-8-sig') as csvfile:

reader = csv.DictReader(csvfile)

for row in reader:

handled_list.append("%s_%s" % (row['area'], row['page']))

else:

with open(save_path, "a+", newline='\n', encoding='utf-8-sig') as csvfile:

writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

writer.writeheader()

handled_set = set(handled_list)

logger.info(

"get_xiaoqu_list, have handled:%s " % (len(handled_set)))

# 針對上海各區進行處理

with open(save_path, "a+", newline='\n', encoding='utf-8-sig') as csvfile:

writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

for page_num in range(1, page_size):

# https://m.ke.com/sh/xiaoqu/pudong/pb4ep4.5pg10/

url = "https://m.ke.com/sh/xiaoqu/%s/bp5ep7.5pg%s/" % (area, str(page_num))

if "%s_%s" % (area, page_num) in handled_set:

logger.info("%s has been handled." % url)

continue

else:

logger.info(url)

# 獲取頁面內容

r = requests.get(url=url, headers=self.page_headers)

html = r.content

lj = BeautifulSoup(html, 'html.parser')

page_items = lj.find_all('li', attrs={'class': 'pictext'})

# 解析各頁中的小區列表

if len(page_items) > 0:

for item in page_items:

xiaoqu_url = item.a.get('href')

xiaoqu_id = xiaoqu_url.split("/")[-2]

xiaoqu_gonglue = item.find_all("p", attrs={"class": "gonglue_title"})

if len(xiaoqu_gonglue) == 0:

is_gonglue = 0

else:

is_gonglue = 1

xiaoqu_info = item.find_all("div", attrs={"class": "item_list"})[0]

xiaoqu_name = xiaoqu_info.find_all("div", attrs={"class": "item_main"})[0].string

xiaoqu_brief = xiaoqu_info.find_all("div", attrs={"class": "item_other"})[0].string.strip(

"\n\r \"")

xiaoqu_brief = " ".join(xiaoqu_brief.split())

xiaoqu_loc = xiaoqu_brief.split()[0]

build_type = xiaoqu_brief.split()[1]

build_year = re.search(r' (?P\d{1,})年建成', xiaoqu_brief, re.I)

if build_year:

xiaoqu_build = build_year.group("build_year")

else:

xiaoqu_build = ""

xiaoqu_price = xiaoqu_info.find_all("span", attrs={"class": "price_total"})[0].em.string

xiaoqu_dict = {

"area": area,

"page": page_num,

"xiaoqu_id": xiaoqu_id,

"url": xiaoqu_url,

"name": xiaoqu_name,

"brief": xiaoqu_brief,

"loc": xiaoqu_loc,

"build_type": build_type,

"build_year": xiaoqu_build,

"price": xiaoqu_price,

"have_gonglue": is_gonglue

}

writer.writerow(xiaoqu_dict)

else:

# 表面已到最后一頁

break

handled_set.update({"%s_%s" % (area, page_num)})

3.2 根據小區列表，獲取包含攻略的小區

3.2.1 根據單個頁面獲取小區詳細信息

# 根據指定小區的id，獲取小區的攻略信息

def get_xiaoqu_gonglue_dict(self, id):

url = "https://m.ke.com/sh/xiaoqu/%s/gonglueV2.html?click_source=m_resblock_detail#review" % id

logger.info(url)

# 根據url加載頁面

# https://m.ke.com/sh/xiaoqu/5011000007603/gonglueV2.html?click_source=m_resblock_detail#review

html = requests.get(url=url, headers=self.page_headers).content

lj = BeautifulSoup(html, 'html.parser')

loc_node = lj.find('div', attrs={'class': 'head_location'})

if loc_node is not None:

loc_name = loc_node.string

else:

loc_name = ""

cpt_content = lj.find_all('div', attrs={'id': 'review'})[0]

totoal_score = cpt_content.find('div', attrs={'class': "review_score"}).get_text().replace("綜合測評得分", "")

review_txt = ""

if cpt_content.find('div', attrs={'class': "review_txt_box"}) is not None:

review_txt = cpt_content.find('div', attrs={'class': "review_txt_box"}).get_text().strip(" \n\r")

review_list_txt = cpt_content.find('ul', attrs={'class': "review_list"})

review_list = review_list_txt.find_all('li')

other = ""

jianzhu_score = huxing_score = jiaotong_score = shangye_score = jiaoyu_score = jingguan_score = wuye_score = ""

for item in review_list:

key = item.span.string

value = item.progress.get('value')

if key == "建筑品質":

jianzhu_score = value

elif key == "戶型設計":

huxing_score = value

elif key == "交通條件":

jiaotong_score = value

elif key == "教育質量":

jiaoyu_score = value

elif key == "商業環境":

shangye_score = value

elif key == "花園景觀":

jingguan_score = value

elif key == "物業管理":

wuye_score = value

else:

other = " %s:%s " % (key, value)

peitao_node = lj.find('div', attrs={"class": "box peitao card_box"})

map_api_node = peitao_node.find('img') if peitao_node is not None else None

if map_api_node is not None:

map_api = map_api_node.get('src')

else:

map_api = ""

def get_geo_from_mapapi(map_api):

geo = re.search(r'center=(?P[\d.]+),(?P[\d.]+)', map_api, re.I)

if geo:

lat = geo.group("lat")

lng = geo.group("lng")

else:

lat = lng = None

return lat, lng

lat, lng = get_geo_from_mapapi(map_api)

gonglue_dict = {

"xiaoqu_id": id,

"loc_name": loc_name,

"total_score": totoal_score,

"review_txt": review_txt if review_txt is not None else "",

"jianzhu_score": jianzhu_score if jianzhu_score is not None else "",

"huxing_score": huxing_score if huxing_score is not None else "",

"jiaotong_score": jiaotong_score if jiaotong_score is not None else "",

"jiaoyu_score": jiaoyu_score if jiaoyu_score is not None else "",

"shangye_score": shangye_score if shangye_score is not None else "",

"jingguan_score": jingguan_score if jingguan_score is not None else "",

"wuye_score": wuye_score if wuye_score is not None else "",

"map_api": map_api,

"lng": lng if lng is not None else "",

"lat": lat if lat is not None else "",

"other": other

}

return gonglue_dict

3.2.2 根據列表，生成所有攻略信息列表

# 根據第一步獲取的小區列表，逐項生成攻略列表

def handle_gonglue_by_xiaoqu(self, file_path, save_path, if_distance=False, local_geo=None):

# 判斷參數是否正確

if if_distance == True and local_geo is None:

logger.error("in handle_gonglue_by_xiaoqu, if_distance's Ture, local_geo can't be None")

exit(1)

# 生成小區列表

url_list = []

with open(file_path, encoding='utf-8-sig') as csvfile:

reader = csv.DictReader(csvfile)

for row in reader:

if row['have_gonglue'] == "1":

url_list.append(row['xiaoqu_id'])

# 如果攻略列表已存在，則統計已處理的記錄

handled_list = []

fieldnames = ['xiaoqu_id', 'loc_name', 'total_score', "review_txt", "jianzhu_score", "huxing_score",

"jiaotong_score", "jiaoyu_score", "shangye_score", "jingguan_score", "wuye_score",

"map_api", "lat", "lng", "distance", "other"]

if os.path.isfile(save_path):

with open(save_path, encoding='utf-8-sig') as csvfile:

reader = csv.DictReader(csvfile)

for row in reader:

handled_list.append(row['xiaoqu_id'])

else:

# 如果不存在，則創建一個空CSV文件，包含表頭

with open(save_path, "a+", newline='\n', encoding='utf-8-sig') as csvfile:

writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

writer.writeheader()

handled_set = set(handled_list)

logger.info("handle_gonglue_by_xiaoqu, the length of url_list: %s" % len(url_list))

# 針對每一個小區列表，獲取攻略信息

with open(save_path, "a+", newline='\n', encoding='utf-8-sig') as csvfile:

writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

for xiaoqu_id in url_list:

if xiaoqu_id not in handled_set:

gonglue_dict = self.get_xiaoqu_gonglue_dict(id=xiaoqu_id)

if if_distance:

distance = get_distance((gonglue_dict["lat"], gonglue_dict["lng"]), local_geo)

gonglue_dict["distance"] = distance

writer.writerow(gonglue_dict)

handled_set.update({xiaoqu_id})

else:

logger.info("xiaoqu %s is handled" % xiaoqu_id)

3.3 根據攻略列表，生成關注的房子列表

3.3.1 獲取單個小區的房子列表

# 根據小區id，獲取小區的滿足條件的房子列表

def get_houselist_by_xiaoqu(self, xiaoqu_id):

# https://m.ke.com/sh/ershoufang/bp350ep450l2l3ba67ea70c5011000009590

# bp350ep450 表示價格開始和結束

# l2l3 戶型2室和3室

# ba67ea70 面積67-70

# c5011000009590 小區編號

url = "https://m.ke.com/sh/ershoufang/bp350ep450l2l3ba60ea90c%s" % xiaoqu_id

html = requests.get(url=url, headers=self.page_headers).content

house_list = []

lj = BeautifulSoup(html, 'html.parser')

# 頁面中包含多個列表，包含當前搜索以及推薦其他小區

view_body = lj.find('div', attrs={'class': 'list-view-section-body'})

item_list = view_body.find_all('div', attrs={'class': 'lj-track', 'data-click-event': 'SearchClick'})

for item in item_list:

house_body = item.find("div", attrs={'class': 'kem__house-tile-ershou'})

house_id = house_body.get("data-id")

logger.info("handle house_id:%s" % house_id)

house_txt = house_body.find("div", attrs={'class': 'house-text'})

house_title = house_txt.find("div", attrs={"class": 'house-title'}).text

house_desc = house_txt.find("div", attrs={"class": 'house-desc'}).string

house_price_total = house_txt.find("span", attrs={"class": "price-total"}).strong.string

house_price_unit = house_txt.find("span", attrs={"class": "price-unit"}).string.strip("元/平")

house_dict = {

"xiaoqu_id": xiaoqu_id,

"house_id": house_id,

"title": house_title,

"desc": house_desc,

"price_total": house_price_total,

"price_unit": house_price_unit

}

house_list.append(house_dict)

return house_list

3.3.2 根據攻略列表，生成房子列表

# 根據攻略列表，提取關注的小區，再逐項獲取列表

def handle_hoselist_by_gonglue(self, file_path, save_path, filter_func=None):

xiaoqu_list = []

with open(file_path, encoding='utf-8-sig') as csvfile:

reader = csv.DictReader(csvfile)

for row in reader:

if filter_func is not None:

if filter_func(row):

# 將小區的ID，加入到處理列表中

xiaoqu_list.append((row["xiaoqu_id"], row["loc_name"], row["distance"]))

else:

xiaoqu_list.append((row["xiaoqu_id"], row["loc_name"], row["distance"]))

handled_list = []

fieldnames = ['xiaoqu_id', 'xiaoqu_name', 'distance', 'house_id', 'title', "desc", "price_total", "price_unit"]

if os.path.isfile(save_path):

with open(save_path, encoding='utf-8-sig') as csvfile:

reader = csv.DictReader(csvfile)

for row in reader:

handled_list.append(row['xiaoqu_id'])

else:

# 如果不存在，則創建一個空CSV文件，包含表頭

with open(save_path, "a+", newline='\n', encoding='utf-8-sig') as csvfile:

writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

writer.writeheader()

handled_set = set(handled_list)

logger.info(

"handle_hoselist_by_xiaoqu, to be handled: %s, have handled:%s " % (len(xiaoqu_list), len(handled_set)))

with open(save_path, "a+", newline='\n', encoding='utf-8-sig') as csvfile:

writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

for xiaoqu_id, xiaoqu_loc_name, distance in xiaoqu_list:

if xiaoqu_id not in handled_set:

logger.info("handle xiaoqu:%s" % xiaoqu_id)

house_list = self.get_houselist_by_xiaoqu(xiaoqu_id)

if len(house_list) > 0:

for house_dict in house_list:

house_dict["xiaoqu_name"] = xiaoqu_loc_name

house_dict["distance"] = distance

writer.writerow(house_dict)

else:

house_dict = {

"xiaoqu_id": xiaoqu_id,

"xiaoqu_name": xiaoqu_loc_name,

"distance": distance

}

writer.writerow(house_dict)

logger.info("小區：%s %s have no match house." % (xiaoqu_id, xiaoqu_loc_name))

handled_set.update({xiaoqu_id})

else:

logger.info("%s is handled" % xiaoqu_id)

總結

以上是生活随笔為你收集整理的python爬虫requests源码链家_python的爬虫项目(链家买二手房）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： python join_详解Python
下一篇： python数据的格式输出_python

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

python

python爬虫requests源码链家_python的爬虫项目(链家买二手房）

標題2

標題3

標題1

標題2

標題3

標題1

標題2

標題3

總結