xpath爬虫-抓取全国行政区划和城乡区划数据
生活随笔
收集整理的這篇文章主要介紹了
xpath爬虫-抓取全国行政区划和城乡区划数据
小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.
數(shù)據(jù)來源地址:2020年度全國行政區(qū)劃和城鄉(xiāng)劃
代碼示例:以安徽省合肥市為例
import requests from lxml import etree import pandas as pddef get_html(url):header = {'user-agent': '你自己的瀏覽器信息'}try:response = requests.get(url, headers=header)# 判斷網(wǎng)頁是否正確返回if response.status_code == 200:return response.content.decode('gbk')else:print("{0}網(wǎng)頁請求狀態(tài)碼錯(cuò)誤!{0}".format("-" * 10))except Exception as e:print("{0}請求參數(shù)出現(xiàn)錯(cuò)誤:{1}{0}".format("-" * 10, e))def parse_url(url, xpath_path):html = get_html(url)# 構(gòu)建下一級(jí)跳轉(zhuǎn)初始url部分next_base_url = "/".join(url.split("/")[:-1])# 初始化HTML = etree.HTML(html)# 獲取區(qū)級(jí)名稱和對應(yīng)下一級(jí)鏈接all_area = HTML.xpath(f'{xpath_path}/text()')next_link = HTML.xpath(f'{xpath_path}/@href')return [(i[0], next_base_url + "/" + i[1]) for i in list(zip(all_area, next_link))]def parse_url2(url, xpath_path):"""最后一級(jí),無跳轉(zhuǎn)鏈接"""html = get_html(url)# 初始化HTML = etree.HTML(html)villagetr = HTML.xpath(f'{xpath_path}/text()')return villagetrresult = [] xpath_path = '//tr[@class="countytr"]/td[2]/a' url = "http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2019/34/3401.html" # 市 get ==》 區(qū):名字&鏈接 for i in parse_url(url, xpath_path):area1, url = ixpath_path = '//tr[@class="towntr"]/td[2]/a'# 區(qū) get ==》 鎮(zhèn):名字&鏈接for j in parse_url(url, xpath_path):area2, url = jxpath_path = '//tr[@class ="villagetr"]/td[3]'# 鎮(zhèn) get ==》 街道:名字for k in parse_url2(url, xpath_path):result.append([area1, area2, k])df = pd.DataFrame(result, columns=["區(qū)", "鎮(zhèn)/街道", "居委會(huì)"]) df.to_excel("合肥市行政區(qū)域劃分.xlsx", index=False)總結(jié)
以上是生活随笔為你收集整理的xpath爬虫-抓取全国行政区划和城乡区划数据的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: KMP算法学习
- 下一篇: CISP——其他法律法规政策(分享笔记)