生活随笔
收集整理的這篇文章主要介紹了
爬虫笔记_05
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
使用BeautifulSoup
安居客房源信息
import requests
from bs4 import BeautifulSoup
import time
import json
import re
headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36',}def save_house_image(url,name):r=requests.get(url,headers=headers)# name=str.replace(name,' ','_')# 字符串刪掉除漢字和數字以外的其他字符name_s=re.sub('([^\u4e00-\u9fa5\u0030-\u0039])', '', name)with open('image/'+name_s+".png",'wb') as f:f.write(r.content)def get_link(url):result=requests.get(url,headers=headers)soup=BeautifulSoup(result.text,'lxml')# 提取包含<a>標簽的<div>標簽links=soup.find_all(class_='zu-info')i=0for link in links:# 獲取每個房源主頁URLhref=link.h3.a['href']# print(link.h3.a['href'])house_info=get_info(href)print(house_info)# 將房源信息以json格式保存到houses.txt文件中f.write(json.dumps(house_info,ensure_ascii=False) + '\n')def get_info(url):
# get_link('https://sz.zu.anjuke.com/fangyuan/p1/')result=requests.get(url,headers=headers)soup=BeautifulSoup(result.text,'lxml')div=soup.find(class_='wrapper')# 提取標題title=div.h1.div.stringprint(title)# 提取地址# 提取地址1,最后一個<li>節點address1=div.find(class_='lbox').ul.contents[-2].a# 提取地址2,兄弟節點,下一個<a>節點address2=address1.next_sibling.next_sibling# 提取地址3,兄弟節點,下一個<a>節點address3=address2.next_sibling.next_siblingaddress=address2.string+address3.string+' '+address1.stringprint(address)# 提取價格price=div.find(class_='lbox').ul.li.span.textpay_type=div.find(class_='lbox').ul.li.span.next_sibling.next_sibling.stringprint(price)print(pay_type)# 提取圖像URLimage_url=div.find(id='room_pic_wrap').div.img['src']print(image_url)# 提取房主名字broker_name=div.find(class_='broker-name').stringprint(broker_name)info={'title':title,'address':address,'price':price+' '+pay_type,'image_url':image_url,'broker_name':broker_name}# 保存圖像save_house_image(image_url,title)return infoif __name__ == '__main__':f=open('./houses.txt','a+',encoding='gbk')urls=['https://sz.zu.anjuke.com/fangyuan/p{}/'.format(i) for i in range(1,11)]# 開始抓取和分析前10頁房源頁面的HTML代碼for singel_url in urls:get_link(singel_url)time.sleep(1)f.close()
運行之前需要手動建立一個images子目錄
?
總結
以上是生活随笔為你收集整理的爬虫笔记_05的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。