當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

爬取国家统计局2020年五级联动行政区划（精确）

發(fā)布時(shí)間：2024/1/1 编程问答 28 豆豆

生活随笔收集整理的這篇文章主要介紹了爬取国家统计局2020年五级联动行政区划（精确）小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

無(wú)其他新鮮數(shù)據(jù)的情況下，這篇應(yīng)該是國(guó)家統(tǒng)計(jì)局專(zhuān)欄的最后一篇
思路和之前爬國(guó)家統(tǒng)計(jì)局運(yùn)用的根節(jié)點(diǎn)葉節(jié)點(diǎn)思路基本相同，先放代碼，具體的說(shuō)明想好再解釋~（代碼中有部分注釋）
爬到最小的村級(jí)大概用時(shí)一個(gè)半小時(shí)(因?yàn)闆](méi)用代理ip或者多進(jìn)程,最后要訪(fǎng)問(wèn)40000+鄉(xiāng)級(jí)網(wǎng)頁(yè)比較耗時(shí))，總共630000+村級(jí)數(shù)據(jù)，但是網(wǎng)上的數(shù)據(jù)量分布在640000-740000之間，先不說(shuō)和我數(shù)據(jù)不符，竟然體量也不盡相同，我也不知道是怎么回事qaq
（最新：看到這位博主的博客https://blog.csdn.net/xuemu2008/article/details/110262257，他的數(shù)據(jù)條目和這篇代碼實(shí)現(xiàn)能完成的數(shù)目完全一致，所以應(yīng)該是完全正確的）

import requests import re import xlsxwriter import time time_start=time.time() agent={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'} choose_ls=[depth*2 if depth<=3 else 3*(depth-1) for depth in range(1,6)]#根據(jù)深度大小取12位代碼前**位 match_level=['provincetr','citytr','countytr','towntr','villagetr'] initurl='http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2020/index.html' total_dict={} depth=0 each_root={initurl:('','')} max_depth=5#可選，1-5分別表示省級(jí)、地級(jí)、縣級(jí)、鄉(xiāng)級(jí)、村級(jí),進(jìn)而爬取固定深度范圍內(nèi)所有的葉節(jié)點(diǎn)以及該深度下的根節(jié)點(diǎn) while depth<max_depth:total_count=0next_root={}for url in each_root:code_join=each_root[url][0]+'-' if depth!=0 else each_root[url][0]zone_join=each_root[url][1]+'-' if depth!=0 else each_root[url][1]change_root='/'.join(url.split('/')[:-1])+'/'while True:try:req=requests.get(url,headers=agent)req.encoding='GBK'#中文解碼,不要用req.encoding=req.apparent_encoding,這樣識(shí)別出來(lái)的req.encoding='gb2312',有好多復(fù)雜漢字解不出碼text=req.texttext=text.replace('\n','\\n')#正則表達(dá)式會(huì)跳過(guò)換行符（無(wú)法識(shí)別下一行），因此將換行符替換special_sigh=Falseif match_level[depth] in text:match_text=re.findall(r"class='%s'>(.*?)</table"%match_level[depth],text)[0]breakelse:search=Falsefor level in range(depth,5):#東莞、中山、儋州缺縣級(jí)單位，因此需要進(jìn)行識(shí)別并放入下一節(jié)點(diǎn)存儲(chǔ)if match_level[level] in text:match_text=re.findall(r"class='%s'>(.*?)</table"%match_level[level],text)[0]search=Truespecial_sigh=Trueprint('特殊區(qū)劃:%s'%each_root[url][1])breakif search:breakelse:print('服務(wù)器繁忙')time.sleep(2)except:print('服務(wù)器繁忙')time.sleep(2)if special_sigh:next_root[url]=(code_join,zone_join)else:if depth!=0:has_tree=re.findall(r"href='(.*?)'>(\d+?)<.*?html'>(.*?)</a></td></tr>",match_text)else:base_tree=re.findall(r"href='(.*?)'>(.*?)<br/",match_text)has_tree=[(each[0],each[0].split('.html')[0],each[1]) for each in base_tree]base_no=re.findall(r"td>(\d+?)</td><td>(.*?)</td></tr>",match_text)no_tree=[(each[0],re.findall(r'<td>(.+)',each[1])[0] if 'td' in each[1] else each[1]) for each in base_no]for each in has_tree:each_dir=change_root+each[0]next_root[each_dir]=(code_join+each[1][:choose_ls[depth]],zone_join+each[2])if depth==3:if (total_count+1)%100==0:print('已爬取%d個(gè),在路徑%s處'%(total_count+1,zone_join+each[2]))else:print('在路徑%s處'%(zone_join+each[2]))if no_tree:for each in no_tree:total_dict[code_join+each[0][:choose_ls[depth]]]=zone_join+each[1]if depth==4:if (total_count+1)%800==0:print('已爬取%d個(gè),在路徑%s處'%(total_count+1,zone_join+each[1]))else:print('已獲取路徑%s'%(zone_join+each[1]))total_count+=1depth+=1each_root=next_root def decompose(each):if type(total_dict[each])==tuple:codelist=total_dict[each][0].split('-')namelist=total_dict[each][1].split('-')else:codelist=each.split('-')namelist=total_dict[each].split('-')if len(codelist)<depth:for i in range(len(codelist),depth):codelist.append('')namelist.append('')ziplist=list(zip(codelist,namelist))return [i for j in ziplist for i in j] sort_name=['省級(jí)','地級(jí)','縣級(jí)','鄉(xiāng)級(jí)','村級(jí)'] real_column=[(sort_name[each]+'代碼',sort_name[each]+'名稱(chēng)') for each in range(depth)] flat_col=[i for each in real_column for i in each] total_dict.update(each_root) if depth<=3:#縣級(jí)及以上數(shù)據(jù)量不大（約三千行），可以用excel存儲(chǔ)wk=xlsxwriter.Workbook('五級(jí)聯(lián)動(dòng).xlsx')sh=wk.add_worksheet('sheet1')for each in range(2*depth):sh.write(0,each,flat_col[each])totalrow=1for each in total_dict:flatlist=decompose(each)for i in range(2*depth):sh.write(totalrow,i,flatlist[i])totalrow+=1wk.close() else:#縣級(jí)往下數(shù)據(jù)較多，excel沒(méi)有優(yōu)勢(shì)，因此寫(xiě)入csv存儲(chǔ)book=open('五級(jí)聯(lián)動(dòng).csv','w',encoding='utf-8')book.write(','.join(flat_col)+'\n')for each in total_dict:flatten=decompose(each)book.write(','.join(flatten)+'\n')book.close() time_end=time.time() rest_second=time_end-time_start print('用時(shí)%d分%d秒'%divmod(rest_second,60))

村級(jí)經(jīng)pandas sort_values排序后，如圖所示:

總結(jié)

以上是生活随笔為你收集整理的爬取国家统计局2020年五级联动行政区划（精确）的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：回顾我的2018
下一篇： CAN、RS485总线中120欧姆终端电

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

编程问答

爬取国家统计局2020年五级联动行政区划（精确）

總結(jié)