python爬虫requests源码链家_python的爬虫项目(链家买二手房)
不知不覺,已經工作6年有余,恍恍惚惚,有機會滿足房子需求。
在收集房子信息過程中,做些記錄。
貝殼的功能很強大,但很難滿足小區、距離、教育、面積等多個方面的匹配,使用起來成本仍然較高。
針對以上情況,編寫該項目,收集鏈家的二手房數據。項目中,主要根據價格來篩選小區,并根據小區教育、同工作位置的距離來確定關注小區,再通過房子面積、總價、戶型來確定可以選擇的房子 列表,從而將購房精力集中在關注的重點小區和房子中。
當然,每個人可以根據自己需求進行調整。
一、基礎環境說明
1.1 基礎環境
1.1.1 python
1.1.2 request(加載頁面)
1.1.3 BeautifuSoup(提取信息 )
常用使用例子:
from bs4 import BeautifulSoup
soup = BeautifulSoup(a, "html.parser")
soup.title.text # '標題'
# 一、提取標簽
# 1.1 提取唯一標簽
soup.h1
soup.find('h1')
soup.find_all('h1')[0]
# 1.2 提取多個標簽
soup.find_all('h2')
# [
標題2
,標題3
]soup.find_all(['h1','h2'])
# [
標題1
,標題2
,標題3
]# 1.3 使用正則表達式
import re
soup.find_all(re.compile('^h'))
# [
標題1
,標題2
,標題3
]# 二、匹配屬性
# 2.1 匹配屬性1,直接將屬性名作為參數名,但是有些屬性不行,比如像a-b這樣的屬性
soup.find_all('p', id = 'p1') # 一般情況
soup.find_all('p', class_='p3') # class是保留字比較特殊,需要后面加一個_
# 2.2 匹配屬性2,最通用的方法
soup.find_all('p', attrs={'class':'p3'}) # 包含這個屬性就算,而不是只有這個屬性
soup.find_all('p', attrs={'class':'p3','id':'pp'}) # 使用多個屬性匹配
soup.find_all('p', attrs={'class':'p3','id':False}) # 指定不能有某個屬性
soup.find_all('p', attrs={'id':['p1','p2']}) # 屬性值是p1或p2
soup.find_all('p', attrs={'class':True}) # 含有class屬性即可
# 2.3 匹配屬性3,正則表達式匹配
import re
soup.find_all('p', attrs={'id':re.compile('^p')}) # 使用正則表達式
# 三、根據標簽內容文本來識別
# 3.1 匹配標簽內容1,正則表達式
import re
soup.find_all('p', text=re.compile('段落'))
soup.find_all('p',text=True)
# 3.2 匹配標簽內容2,傳入函數
def nothing(c):
return c not in ['段落1','段落2','文章']
soup.find_all('p',text=nothing)
def has_class_but_no_id(tag):
return tag.has_attr('class') and not tag.has_attr('id')
# 四、提取內容
# 4.1 提取標簽文本
soup.h.text # 多層嵌套也可以直接返回
soup.h.a.text # 也可以這樣
soup.body.text # 里面有多個內容時 '\n標題\n段落1\n段落2\n'
# 4.2 其他標簽的屬性值
# 提取屬性值,像字典一樣提取,以下兩種方法等價
soup.h.a['href']
soup.h.a.get('href')
# 五、提取標簽信息
print(i.name) # 提取標簽名
print(i.attrs) # 提取標簽所有屬性值
print(i.has_attr('href')) # 檢查標簽是否有某屬性
# 五、示例
soup.find('p', attrs={'class':'first'}).text # '文字1'
soup.find_all('p') # [
文字1
,文字2
], 再分別從中提取文字soup.find('ul', attrs={'class':'list1'}).find_all('li') # [
列表1第1項, 列表1第2項]# 代碼參考:https://zhuanlan.zhihu.com/p/35354532
1.1.4 地理位置信息(百度API)
調用方式1:
def geocodeB(address):
base = "http://api.map.baidu.com/geocoder?address=%s&output=json&key=yourak&city=上海" % address
response = requests.get(base)
if response.status_code == 200:
answer = response.json()
if "location" in answer['result'] and "level" in answer['result']:
return (address,
# round(answer['result']['location']['lng'], 5),
answer['result']['location']['lng'],
# round(answer['result']['location']['lat'], 5),
answer['result']['location']['lat'],
answer['result']["level"])
else:
logger.error("geocodeB %s warning:%s" % (address, answer))
return None
else:
logger.error("geocodeB %s Error" % address)
return None
調用方式2:
def geocodeB2(address):
from urllib.request import urlopen, quote
from urllib.parse import quote_plus
import hashlib, json
# 以get請求為例http://api.map.baidu.com/geocoder/v2/?address=百度大廈&output=json&ak=yourak
queryStr = '/geocoder/v2/?address=%s&city=上海&output=json&ak=$yourak$' % address
# 對queryStr進行轉碼,safe內的保留字符不轉換
encodedStr = quote(queryStr, safe="/:=&?#+!$,;'@()*[]")
# 在最后直接追加上yoursk
rawStr = encodedStr + '$yoursn$'
sn = hashlib.md5(quote_plus(rawStr).encode("utf8")).hexdigest()
url = 'http://api.map.baidu.com%s&sn=%s' % (encodedStr, sn)
req = urlopen(url)
res = req.read().decode() # 將其他編碼的字符串解碼成unicode
answer = json.loads(res) # 對json數據進行解析
if "location" in answer['result'] and "level" in answer['result']:
return answer['result']['location']['lat'], answer['result']['location']['lng']
else:
logger.error("geocodeB %s warning:%s" % (address, answer))
return None
調用方式3:
def geocode_by_baidu(address):
from geopy.geocoders import baidu
apikey = '$yourak$' # 從網站申請 http://lbsyun.baidu.com/apiconsole/key?application=key
sn = '$yoursn$'
g = baidu.Baidu(api_key=apikey, security_key=sn, timeout=200)
a = g.geocode(address)
# return (round(a.latitude, 6), round(a.longitude, 6))
return a.latitude, a.longitude
1.1.5 地理獲取距離計算(geopy)
# x and y is (lat,lng)
def get_distance(x, y):
from geopy.distance import geodesic
return round(geodesic(x, y).km, 3)
1.1.6 解決懶加載和滾動加載(selenium)
Selenium是一個用于Web應用程序測試的工具。
Selenium測試直接運行在瀏覽器中,就像真正的用戶在操作一樣。支持的瀏覽器包括IE(7, 8, 9, 10, 11),Firefox,Safari,Chrome,Opera等。
使用python爬蟲調用selenium來模擬正常用戶訪問瀏覽器.
1.2 主要問題
1.2.1 懶加載問題
1.2.2 滾動加載問題
1.2.3 IP訪問限制
二、前期準備
2.1 分析獲取的需求
個人買房需求:
預算:400萬,最多不超過450萬;
教育:2梯隊學區房
戶型:二房以上
房齡:1990年后
面積:60平米以上
交通:離世紀大道乘公交不超過1小時
===》
1、學區情況,根據小區攻略的教育評分來過濾,確定小區范圍
2、根據a. 小區的中的房子的價格,使用預算過濾;b.小區的位置,通過距離來過濾交通,不滿足 的小區
3、通過符合要求的小區列表,來針對每個小區獲取房子列表,并確定跟蹤重點小區
特別說明:
1、為什么不直接獲取房子呢?房子無法判斷是否滿足教育;如果通過房子找小區,再找教育,考慮房子比小區多出幾個數量級,會有更多的時間浪費
2、通過預算和面積需求,可以確定房子的單價,通過單價來篩選小區,減少小區范圍。
2.2 分析頁面路徑
2.2.1 獲取小區列表
1、小區列表的鏈接分析
由于鏈家僅顯示前100頁內容,而整個上海的小區顯然比100頁更多,故根據區來獲取小區。
其中
bp5ep7.5為價格在5-7.5萬的區間,bp為begin price;ep為end price。
pg為page的頁面
2、小區是否有評價的判斷
可以根據第一步獲取的小區列表中,查看小區是否存在小區攻略標簽來判斷是否有小區評價信息
特別說明:并不是每一個小區,都可以查看到小區的教育評分
示例鏈接:https://m.ke.com/sh/xiaoqu/5011000016009/,可以獲取到小區的整體的評分
2.2.2 根據小區,獲取攻略
小區的攻略地址為:
https://m.ke.com/sh/xiaoqu/5011000016009/gonglueV2.html?click_source=m_resblock_detail#review
對于小區,有總體評分和分項評分,其中分項評分包含建筑品質、戶型設計、交通條件、教育質量、商業環境、花園景觀、物業管理等評分。
每個人可以根據自己的需求,使用不同的評分項進行小區過濾。
例如,我優先考慮教育,則以教育條件進行主要過濾條件,要求教育8分以上,而其他的要求6.5分以上。
2.2.3 根據小區,獲取房子列表
小區的房子列表:
增加過濾:https://m.ke.com/sh/ershoufang/bp350ep450l2l3ba67ea70c5011000016009
其中,bp350ep450為價格區間為350-450萬;l2l3戶型為2室或3室;ba67ea70為面積在67-70平米;c5011000016009 為小區編號為5011000016009。
三、項目代碼實現
3.1 獲取小區
def get_xiaoqu_list(self, area, save_path):
page_size = 100
# 由于僅收集上海,故未對多城市處理
fieldnames = ['area', 'page', 'xiaoqu_id', 'url', 'name', "brief", "loc", "build_type", "build_year", "price",
"have_gonglue"]
# 如果不存在,則創建一個空CSV文件,包含表頭
# 如果已存在,則將記錄已處理的記錄情況(針對IP限制,需要跑多次情況)
handled_list = []
if os.path.isfile(save_path):
with open(save_path, encoding='utf-8-sig') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
handled_list.append("%s_%s" % (row['area'], row['page']))
else:
with open(save_path, "a+", newline='\n', encoding='utf-8-sig') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
handled_set = set(handled_list)
logger.info(
"get_xiaoqu_list, have handled:%s " % (len(handled_set)))
# 針對上海各區進行處理
with open(save_path, "a+", newline='\n', encoding='utf-8-sig') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
for page_num in range(1, page_size):
# https://m.ke.com/sh/xiaoqu/pudong/pb4ep4.5pg10/
url = "https://m.ke.com/sh/xiaoqu/%s/bp5ep7.5pg%s/" % (area, str(page_num))
if "%s_%s" % (area, page_num) in handled_set:
logger.info("%s has been handled." % url)
continue
else:
logger.info(url)
# 獲取頁面內容
r = requests.get(url=url, headers=self.page_headers)
html = r.content
lj = BeautifulSoup(html, 'html.parser')
page_items = lj.find_all('li', attrs={'class': 'pictext'})
# 解析各頁中的小區列表
if len(page_items) > 0:
for item in page_items:
xiaoqu_url = item.a.get('href')
xiaoqu_id = xiaoqu_url.split("/")[-2]
xiaoqu_gonglue = item.find_all("p", attrs={"class": "gonglue_title"})
if len(xiaoqu_gonglue) == 0:
is_gonglue = 0
else:
is_gonglue = 1
xiaoqu_info = item.find_all("div", attrs={"class": "item_list"})[0]
xiaoqu_name = xiaoqu_info.find_all("div", attrs={"class": "item_main"})[0].string
xiaoqu_brief = xiaoqu_info.find_all("div", attrs={"class": "item_other"})[0].string.strip(
"\n\r \"")
xiaoqu_brief = " ".join(xiaoqu_brief.split())
xiaoqu_loc = xiaoqu_brief.split()[0]
build_type = xiaoqu_brief.split()[1]
build_year = re.search(r' (?P\d{1,})年建成', xiaoqu_brief, re.I)
if build_year:
xiaoqu_build = build_year.group("build_year")
else:
xiaoqu_build = ""
xiaoqu_price = xiaoqu_info.find_all("span", attrs={"class": "price_total"})[0].em.string
xiaoqu_dict = {
"area": area,
"page": page_num,
"xiaoqu_id": xiaoqu_id,
"url": xiaoqu_url,
"name": xiaoqu_name,
"brief": xiaoqu_brief,
"loc": xiaoqu_loc,
"build_type": build_type,
"build_year": xiaoqu_build,
"price": xiaoqu_price,
"have_gonglue": is_gonglue
}
writer.writerow(xiaoqu_dict)
else:
# 表面已到最后一頁
break
handled_set.update({"%s_%s" % (area, page_num)})
3.2 根據小區列表,獲取包含攻略的小區
3.2.1 根據單個頁面獲取小區詳細信息
# 根據指定小區的id,獲取小區的攻略信息
def get_xiaoqu_gonglue_dict(self, id):
url = "https://m.ke.com/sh/xiaoqu/%s/gonglueV2.html?click_source=m_resblock_detail#review" % id
logger.info(url)
# 根據url加載頁面
# https://m.ke.com/sh/xiaoqu/5011000007603/gonglueV2.html?click_source=m_resblock_detail#review
html = requests.get(url=url, headers=self.page_headers).content
lj = BeautifulSoup(html, 'html.parser')
loc_node = lj.find('div', attrs={'class': 'head_location'})
if loc_node is not None:
loc_name = loc_node.string
else:
loc_name = ""
cpt_content = lj.find_all('div', attrs={'id': 'review'})[0]
totoal_score = cpt_content.find('div', attrs={'class': "review_score"}).get_text().replace("綜合測評得分", "")
review_txt = ""
if cpt_content.find('div', attrs={'class': "review_txt_box"}) is not None:
review_txt = cpt_content.find('div', attrs={'class': "review_txt_box"}).get_text().strip(" \n\r")
review_list_txt = cpt_content.find('ul', attrs={'class': "review_list"})
review_list = review_list_txt.find_all('li')
other = ""
jianzhu_score = huxing_score = jiaotong_score = shangye_score = jiaoyu_score = jingguan_score = wuye_score = ""
for item in review_list:
key = item.span.string
value = item.progress.get('value')
if key == "建筑品質":
jianzhu_score = value
elif key == "戶型設計":
huxing_score = value
elif key == "交通條件":
jiaotong_score = value
elif key == "教育質量":
jiaoyu_score = value
elif key == "商業環境":
shangye_score = value
elif key == "花園景觀":
jingguan_score = value
elif key == "物業管理":
wuye_score = value
else:
other = " %s:%s " % (key, value)
peitao_node = lj.find('div', attrs={"class": "box peitao card_box"})
map_api_node = peitao_node.find('img') if peitao_node is not None else None
if map_api_node is not None:
map_api = map_api_node.get('src')
else:
map_api = ""
def get_geo_from_mapapi(map_api):
geo = re.search(r'center=(?P[\d.]+),(?P[\d.]+)', map_api, re.I)
if geo:
lat = geo.group("lat")
lng = geo.group("lng")
else:
lat = lng = None
return lat, lng
lat, lng = get_geo_from_mapapi(map_api)
gonglue_dict = {
"xiaoqu_id": id,
"loc_name": loc_name,
"total_score": totoal_score,
"review_txt": review_txt if review_txt is not None else "",
"jianzhu_score": jianzhu_score if jianzhu_score is not None else "",
"huxing_score": huxing_score if huxing_score is not None else "",
"jiaotong_score": jiaotong_score if jiaotong_score is not None else "",
"jiaoyu_score": jiaoyu_score if jiaoyu_score is not None else "",
"shangye_score": shangye_score if shangye_score is not None else "",
"jingguan_score": jingguan_score if jingguan_score is not None else "",
"wuye_score": wuye_score if wuye_score is not None else "",
"map_api": map_api,
"lng": lng if lng is not None else "",
"lat": lat if lat is not None else "",
"other": other
}
return gonglue_dict
3.2.2 根據列表,生成所有攻略信息列表
# 根據第一步獲取的小區列表,逐項生成攻略列表
def handle_gonglue_by_xiaoqu(self, file_path, save_path, if_distance=False, local_geo=None):
# 判斷參數是否正確
if if_distance == True and local_geo is None:
logger.error("in handle_gonglue_by_xiaoqu, if_distance's Ture, local_geo can't be None")
exit(1)
# 生成小區列表
url_list = []
with open(file_path, encoding='utf-8-sig') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
if row['have_gonglue'] == "1":
url_list.append(row['xiaoqu_id'])
# 如果攻略列表已存在,則統計已處理的記錄
handled_list = []
fieldnames = ['xiaoqu_id', 'loc_name', 'total_score', "review_txt", "jianzhu_score", "huxing_score",
"jiaotong_score", "jiaoyu_score", "shangye_score", "jingguan_score", "wuye_score",
"map_api", "lat", "lng", "distance", "other"]
if os.path.isfile(save_path):
with open(save_path, encoding='utf-8-sig') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
handled_list.append(row['xiaoqu_id'])
else:
# 如果不存在,則創建一個空CSV文件,包含表頭
with open(save_path, "a+", newline='\n', encoding='utf-8-sig') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
handled_set = set(handled_list)
logger.info("handle_gonglue_by_xiaoqu, the length of url_list: %s" % len(url_list))
# 針對每一個小區列表,獲取攻略信息
with open(save_path, "a+", newline='\n', encoding='utf-8-sig') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
for xiaoqu_id in url_list:
if xiaoqu_id not in handled_set:
gonglue_dict = self.get_xiaoqu_gonglue_dict(id=xiaoqu_id)
if if_distance:
distance = get_distance((gonglue_dict["lat"], gonglue_dict["lng"]), local_geo)
gonglue_dict["distance"] = distance
writer.writerow(gonglue_dict)
handled_set.update({xiaoqu_id})
else:
logger.info("xiaoqu %s is handled" % xiaoqu_id)
3.3 根據攻略列表,生成關注的房子列表
3.3.1 獲取單個小區的房子列表
# 根據小區id,獲取小區的滿足條件的房子列表
def get_houselist_by_xiaoqu(self, xiaoqu_id):
# https://m.ke.com/sh/ershoufang/bp350ep450l2l3ba67ea70c5011000009590
# bp350ep450 表示價格開始和結束
# l2l3 戶型2室和3室
# ba67ea70 面積67-70
# c5011000009590 小區編號
url = "https://m.ke.com/sh/ershoufang/bp350ep450l2l3ba60ea90c%s" % xiaoqu_id
html = requests.get(url=url, headers=self.page_headers).content
house_list = []
lj = BeautifulSoup(html, 'html.parser')
# 頁面中包含多個列表,包含當前搜索以及推薦其他小區
view_body = lj.find('div', attrs={'class': 'list-view-section-body'})
item_list = view_body.find_all('div', attrs={'class': 'lj-track', 'data-click-event': 'SearchClick'})
for item in item_list:
house_body = item.find("div", attrs={'class': 'kem__house-tile-ershou'})
house_id = house_body.get("data-id")
logger.info("handle house_id:%s" % house_id)
house_txt = house_body.find("div", attrs={'class': 'house-text'})
house_title = house_txt.find("div", attrs={"class": 'house-title'}).text
house_desc = house_txt.find("div", attrs={"class": 'house-desc'}).string
house_price_total = house_txt.find("span", attrs={"class": "price-total"}).strong.string
house_price_unit = house_txt.find("span", attrs={"class": "price-unit"}).string.strip("元/平")
house_dict = {
"xiaoqu_id": xiaoqu_id,
"house_id": house_id,
"title": house_title,
"desc": house_desc,
"price_total": house_price_total,
"price_unit": house_price_unit
}
house_list.append(house_dict)
return house_list
3.3.2 根據攻略列表,生成房子列表
# 根據攻略列表,提取關注的小區,再逐項獲取列表
def handle_hoselist_by_gonglue(self, file_path, save_path, filter_func=None):
xiaoqu_list = []
with open(file_path, encoding='utf-8-sig') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
if filter_func is not None:
if filter_func(row):
# 將小區的ID,加入到處理列表中
xiaoqu_list.append((row["xiaoqu_id"], row["loc_name"], row["distance"]))
else:
xiaoqu_list.append((row["xiaoqu_id"], row["loc_name"], row["distance"]))
handled_list = []
fieldnames = ['xiaoqu_id', 'xiaoqu_name', 'distance', 'house_id', 'title', "desc", "price_total", "price_unit"]
if os.path.isfile(save_path):
with open(save_path, encoding='utf-8-sig') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
handled_list.append(row['xiaoqu_id'])
else:
# 如果不存在,則創建一個空CSV文件,包含表頭
with open(save_path, "a+", newline='\n', encoding='utf-8-sig') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
handled_set = set(handled_list)
logger.info(
"handle_hoselist_by_xiaoqu, to be handled: %s, have handled:%s " % (len(xiaoqu_list), len(handled_set)))
with open(save_path, "a+", newline='\n', encoding='utf-8-sig') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
for xiaoqu_id, xiaoqu_loc_name, distance in xiaoqu_list:
if xiaoqu_id not in handled_set:
logger.info("handle xiaoqu:%s" % xiaoqu_id)
house_list = self.get_houselist_by_xiaoqu(xiaoqu_id)
if len(house_list) > 0:
for house_dict in house_list:
house_dict["xiaoqu_name"] = xiaoqu_loc_name
house_dict["distance"] = distance
writer.writerow(house_dict)
else:
house_dict = {
"xiaoqu_id": xiaoqu_id,
"xiaoqu_name": xiaoqu_loc_name,
"distance": distance
}
writer.writerow(house_dict)
logger.info("小區:%s %s have no match house." % (xiaoqu_id, xiaoqu_loc_name))
handled_set.update({xiaoqu_id})
else:
logger.info("%s is handled" % xiaoqu_id)
總結
以上是生活随笔為你收集整理的python爬虫requests源码链家_python的爬虫项目(链家买二手房)的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: python join_详解Python
- 下一篇: python数据的格式输出_python