當前位置：首頁 > 编程语言 > python >内容正文

python

读书笔记（十）——python简单爬取企查查网企业信息，并以excel格式存储

發布時間：2023/12/15 python 34 豆豆

生活随笔收集整理的這篇文章主要介紹了读书笔记（十）——python简单爬取企查查网企业信息，并以excel格式存储小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

2019獨角獸企業重金招聘Python工程師標準>>>

今天這個小爬蟲是應朋友，幫忙寫的一個簡單的爬蟲，目的是爬取企查查這個網站的企業信息。

編程最終要的就是搭建編程環境，這里我們的編程環境是：

python3.6
BeautifulSoup模塊
lxml模塊
requests模塊
xlwt模塊
geany

首先分析需求網頁的信息：

http://www.qichacha.com/search?key=婚慶

可以看到我們想要提取的消息內容有公司的名字，法定代表人，注冊資本，成立時間，電話，郵箱，地址。好的，接下來我們打開firebug,查看各個內容在網頁中的具體位置：

可以看到這些消息分別位于：

#公司名字------<a class="ma_h1" href="/firm_8c640ea3b396783ab4e013ea5f7f295e.html" target="_blank"> # 昆明嘉馨 # <em> # 有限公司 # </a> #法定代表人----<p class="m-t-xs"> # 法定代表人： # <a class="a-blue" href="********">鄢顯莉</a> #注冊資本---- <span class="m-l">注冊資本：100萬</span> #成立時間---- <span class="m-l">成立時間：2002-05-20</span> # </p> # <p class="m-t-xs"> #聯系方式---- 電話：13888677871 #公司郵箱---- <span class="m-l">郵箱：-</span> # </p> #公司地址---- <p class="m-t-xs"> 地址：昆明市南屏街88號世紀廣場B2幢12樓A+F號 </p> # <p></p>

但是有一個巨大的問題擺在我們面前，企查查在點擊搜索按鈕后，雖然也能呈現部分資料，但是首當其沖的是一個登錄頁面，在沒有登錄前，我們實際上通過爬蟲訪問到的是僅有前五個公司信息+登錄窗口的網頁

如果我們不處理這個登錄頁面，那么很抱歉，這次爬取到此結束了。

所以我們必須處理這個問題，首先需要在企查查上注冊一個帳號，注冊步驟略，一般可以通過

構造請求頭，配置cookies
使用selenium
requests.post去遞交用戶名密碼等

selenium模擬真實的瀏覽器去訪問頁面，但是其訪問速度又慢，還要等加載完成，容易報錯，直接放棄。

requests.post方法，這個可能可以，沒仔細研究，因為企查查登錄涉及三個選項，第一個是手機號，第二個是您的密碼，第三個是一個滑塊，滑塊估計需要構造一個True或者什么東西吧。

第一先想肯定是構造請求頭，配置一個cookies。在這里我要說明自己犯的一個錯誤，User-Agent寫成了User_agent，導致我的請求頭是錯誤的，訪問得到的是一個被防火墻攔截的網頁頁面，如下：

#-*- coding-8 -*- import requests import lxml from bs4 import BeautifulSoupdef craw(url):user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:55.0) Gecko/20100101 Firefox/55.0'headers = {'User-Agent':user_agent,}response = requests.get(url,headers = headers)if response.status_code != 200:response.encoding = 'utf-8'print(response.status_code)print('ERROR')soup = BeautifulSoup(response.text,'lxml')print(soup) if __name__ == '__main__':url = r'http://www.qichacha.com/search?key=%E5%A9%9A%E5%BA%86's1 = craw(url)

代碼僅僅是輸出soup，方便調試，請求狀態是一個405錯誤，得到的頁面如下：

<!DOCTYPE html> <html lang="zh-cn"> <head> <meta charset="utf-8"/> <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/> <meta content="a3c0e" name="data-spm"/> <title>405</title> <style>html, body, div, a, h2, p { margin: 0; padding: 0; font-family: 微軟雅黑; }a { text-decoration: none; color: #3b6ea3; }.container { width: 1000px; margin: auto; color: #696969; }.header { padding: 50px 0; }.header .message { height: 36px; padding-left: 120px; background: ur l(https://errors.aliyun.com/images/TB1TpamHpXXXXaJXXXXeB7nYVXX-104-162.png) no-r epeat 0 -128px; line-height: 36px; }.main { padding: 50px 0; background: #f4f5f7; }.main img { position: relative; left: 120px; }.footer { margin-top: 30px; text-align: right; }.footer a { padding: 8px 30px; border-radius: 10px; border: 1px soli d #4babec; }.footer a:hover { opacity: .8; }.alert-shadow { display: none; position: absolute; top: 0; left: 0; width: 100%; height: 100%; background: #999; opacity: .5; }.alert { display: none; position: absolute; top: 200px; left: 50%; w idth: 600px; margin-left: -300px; padding-bottom: 25px; border: 1px solid #ddd; box-shadow: 0 2px 2px 1px rgba(0, 0, 0, .1); background: #fff; font-size: 14px; color: #696969; }.alert h2 { margin: 0 2px; padding: 10px 15px 5px 15px; font-size: 14px; font-weight: normal; border-bottom: 1px solid #ddd; }.alert a { display: block; position: absolute; right: 10px; top: 8px ; width: 30px; height: 20px; text-align: center; }.alert p { padding: 20px 15px; }</style> </head> <body data-spm="7663354"> <div data-spm="1998410538"> <div class="header"> <div class="container"> <div class="message">很抱歉，由于您訪問的URL有可能對網站造成安全威脅，您的訪問被阻斷。</div> </div> </div> <div class="main"> <div class="container"> <img src="https://errors.aliyun.com/images/TB15QGaHpXXXXXOaXXXXia39XXX-660-117.p ng"/> </div> </div> <div class="footer"> <div class="container"> <a data-spm-click="gostr=/waf.123.123;locaid=d001;" href="javascript:;" id="repo rt" target="_blank">誤報反饋</a> </div> </div> </div> <div class="alert-shadow" id="alertShadow"></div> <div class="alert" id="alertContainer"> <h2>提示：<a href="javascript:;" id="closeAlert" title="關閉">X</a></h2> <p>感謝您的反饋，應用防火墻會盡快進行分析和確認。</p> </div> <script>function show() {var g = function(ele) { return document.getElementById(ele); };var reportHandle = g('report');var alertShadow = g('alertShadow');var alertContainer = g('alertContainer');var closeAlert = g('closeAlert');var own = {};own.report = function() {// SPMown.alert();};own.alert = function() {alertShadow.style.display = 'block';alertContainer.style.display = 'block';};own.close = function() {alertShadow.style.display = 'none';alertContainer.style.display = 'none';};};</script> <script charset="utf-8" src="https://errors.aliyun.com/error.js?s=3" type="text/ javascript"></script> </body> </html>

這個錯誤也說明了請求頭的重要性，這一般是服務器根據你的請求頭來簡單判斷你是一個攻擊者、爬蟲，還是一個正常訪問的人。所以干脆直接把請求頭整個復制下來。

這邊還有一點要注意，就是你使用的瀏覽器需打開COOKIES功能，而且關閉瀏覽器的時候不能自動或守清除cookies，否則都會導致只能得到前五個公司的信息，剩下的還是登陸消息。

直接上代碼，一點點的記錄：

#-*- coding-8 -*- import requests import lxml from bs4 import BeautifulSoup import xlwtdef craw(url):user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:55.0) Gecko/20100101 Firefox/55.0'headers = { 'Host':'www.qichacha.com', 'User-Agent':r'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:55.0) Gecko/20100101 Firefox/55.0', 'Accept':'*/*', 'Accept-Language':'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3', 'Accept-Encoding':'gzip, deflate', 'Referer':'http://www.qichacha.com/', 'Cookie':r'UM_distinctid***************', 'Connection':'keep-alive', 'If-Modified-Since':'Wed, 30 **********', 'If-None-Match':'"59*******"', 'Cache-Control':'max-age=0',}response = requests.get(url,headers = headers)if response.status_code != 200:response.encoding = 'utf-8'print(response.status_code)print('ERROR') soup = BeautifulSoup(response.text,'lxml')#print(soup)com_names = soup.find_all(class_='ma_h1')#print(com_names)#com_name1 = com_names[1].get_text()#print(com_name1)peo_names = soup.find_all(class_='a-blue')#print(peo_names)peo_phones = soup.find_all(class_='m-t-xs')#tags = peo_phones[4].find(text = True).strip()#print(tags)#tttt = peo_phones[0].contents[5].get_text()#print (tttt)#else_comtent = peo_phones[0].find(class_='m-l')#print(else_comtent)global com_name_listglobal peo_name_listglobal peo_phone_listglobal com_place_listglobal zhuceziben_listglobal chenglishijian_listprint('開始爬取數據，請勿打開excel')for i in range(0,len(com_names)):n = 1+3*im = i+2*(i+1)peo_phone = peo_phones[n].find(text = True).strip()com_place = peo_phones[m].find(text = True).strip()zhuceziben = peo_phones[3*i].find(class_='m-l').get_text()chenglishijian = peo_phones[3*i].contents[5].get_text()peo_phone_list.append(peo_phone)com_place_list.append(com_place) zhuceziben_list.append(zhuceziben)chenglishijian_list.append(chenglishijian)for com_name,peo_name in zip(com_names,peo_names):com_name = com_name.get_text()peo_name = peo_name.get_text()com_name_list.append(com_name)peo_name_list.append(peo_name)if __name__ == '__main__':com_name_list = []peo_name_list = []peo_phone_list = []com_place_list = []zhuceziben_list = []chenglishijian_list = []key_word = input('請輸入您想搜索的關鍵詞：')print('正在搜索，請稍后')for x in range(1,11):url = r'http://www.qichacha.com/search?key={}#p:{}&'.format(key_word,x)s1 = craw(url)workbook = xlwt.Workbook()#創建sheet對象，新建sheetsheet1 = workbook.add_sheet('xlwt', cell_overwrite_ok=True)#---設置excel樣式---#初始化樣式style = xlwt.XFStyle()#創建字體樣式font = xlwt.Font()font.name = 'Times New Roman'font.bold = True #加粗#設置字體style.font = font#使用樣式寫入數據# sheet.write(0, 1, "xxxxx", style)print('正在存儲數據，請勿打開excel')#向sheet中寫入數據name_list = ['公司名字','法定代表人','聯系方式','注冊人資本','成立時間','公司地址']for cc in range(0,len(name_list)):sheet1.write(0,cc,name_list[cc],style)for i in range(0,len(com_name_list)):sheet1.write(i+1,0,com_name_list[i],style)#公司名字sheet1.write(i+1,1,peo_name_list[i],style)#法定代表人sheet1.write(i+1,2,peo_phone_list[i],style)#聯系方式sheet1.write(i+1,3,zhuceziben_list[i],style)#注冊人資本sheet1.write(i+1,4,chenglishijian_list[i],style)#成立時間sheet1.write(i+1,5,com_place_list[i],style)#公司地址#保存excel文件，有同名的直接覆蓋workbook.save(r'F:\work\2017_08_02\xlwt.xls')print('the excel save success')

首先我們引入requests、BeautifulSoup、lxml、xlwt四個模塊。

#-*- coding-8 -*- import requests import lxml from bs4 import BeautifulSoup import xlwt

簡要說明一下四個模塊：

requests是一個第三方模塊，源碼位于Github上，它相對于urrllib/httplib更加的人性化，一般推薦使用這個，requests具有多種請求方式。 import requests r1 = requests.get(r'http://www.baidu.com') postdata = {'key':'value'} r2 = requests.post(r'http://www.xxx.com/login',data=postdata) r3 = requests.put(r'http://www.xxx.com/put',data={'key':'value'}) r4 = requests.delete(r'http://www.xxx.com/delete') r5 = requests.head(r'http://www.xxx.com/get') r6 = requests.options(r'http://www.xxx.com/get')
還要說明一點，就是其響應編碼：
import requests r = requests.get(r'http://www.baidu.com') print(r.content)#返回的是字節形式 print(r.text)#返回的是文本形式 print(r.encoding)#根據HTTP頭猜測的網頁編碼格式，可以直接賦值修改
更多的requests后續找個機會補充。
BeautifulSoup這是一個可以從HTML或XML文件中提取數據的python庫，它會把HTML轉換成文檔樹，既然是樹形結構，它必有節點概念，便于在爬蟲中使用它的查找提取功能，它的這個功能一般有兩種方法：一、find、find_all等方法；二、CSS選擇器。
lxml模塊，這是使用XPath技術查詢和處理處理HTML/XML文檔的庫，只會局部遍歷，所以速度會更快，占用的內存開銷也會比較小。
xlwt模塊，這是一個寫成excel的模塊，但是它只能重新生成一個excel，也就是說，如果在這個路徑下，已經有這個excel了，那么就會直接覆蓋掉這個excel，而且這個模塊不支持讀取。如果需要讀取功能的可以用xlrd，而寫入功能可以用xlutils模塊配合著xlrd模塊使用，具體我建議可以看看這篇博客《Python excel讀寫》

接下來就很簡單了，定義函數，構造請求頭，requests訪問網頁，如果請求相應碼不是200，則輸出對應的相應碼以及‘ERROR’，用BeautifulSoup和lxml解析網頁，從網頁中選出所要的信息，定義6個全局變量列表，搜索到的數據通過列表的方法append加入列表。

def craw(url):user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:55.0) Gecko/20100101 Firefox/55.0'headers = { 'Host':'www.qichacha.com', 'User-Agent':r'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:55.0) Gecko/20100101 Firefox/55.0', 'Accept':'*/*', 'Accept-Language':'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3', 'Accept-Encoding':'gzip, deflate', 'Referer':'http://www.qichacha.com/', 'Cookie':r'UM_di**********1', 'Connection':'keep-alive', 'If-Modified-Since':'Wed, *********', 'If-None-Match':'"59****"', 'Cache-Control':'max-age=0',}response = requests.get(url,headers = headers)if response.status_code != 200:response.encoding = 'utf-8'print(response.status_code)print('ERROR') soup = BeautifulSoup(response.text,'lxml')#print(soup)com_names = soup.find_all(class_='ma_h1')#print(com_names)#com_name1 = com_names[1].get_text()#print(com_name1)peo_names = soup.find_all(class_='a-blue')#print(peo_names)peo_phones = soup.find_all(class_='m-t-xs')#tags = peo_phones[4].find(text = True).strip()#print(tags)#tttt = peo_phones[0].contents[5].get_text()#print (tttt)#else_comtent = peo_phones[0].find(class_='m-l')#print(else_comtent)global com_name_listglobal peo_name_listglobal peo_phone_listglobal com_place_listglobal zhuceziben_listglobal chenglishijian_listprint('開始爬取數據，請勿打開excel')for i in range(0,len(com_names)):n = 1+3*im = i+2*(i+1)peo_phone = peo_phones[n].find(text = True).strip()com_place = peo_phones[m].find(text = True).strip()zhuceziben = peo_phones[3*i].find(class_='m-l').get_text()chenglishijian = peo_phones[3*i].contents[5].get_text()peo_phone_list.append(peo_phone)com_place_list.append(com_place) zhuceziben_list.append(zhuceziben)chenglishijian_list.append(chenglishijian)for com_name,peo_name in zip(com_names,peo_names):com_name = com_name.get_text()peo_name = peo_name.get_text()com_name_list.append(com_name)peo_name_list.append(peo_name)

通過不斷的調用函數craw，不斷的往list中添加數據，因為企查查非會員只能查看十頁的數據，所以我們只需要重復十次即可，這邊的range()有一個需要注意的地方，因為一般range都是從0開始循環的，但是網頁的第一頁就是1（比較網站的url，尤其是第一頁的url和第二頁的url更容易發現），所以如果我們需要循環十次，那么就需要從1開始，10是最后一次，11是截至，所以需要這么寫：rang(1,11)。接下來的就是創建sheet對象，新建sheet，定義sheet的樣式，然后通過for循環不斷的往excel中存儲數據，最后再通過方法save()保存到某個路徑下。

if __name__ == '__main__':com_name_list = []peo_name_list = []peo_phone_list = []com_place_list = []zhuceziben_list = []chenglishijian_list = []key_word = input('請輸入您想搜索的關鍵詞：')print('正在搜索，請稍后')for x in range(1,11):url = r'http://www.qichacha.com/search?key={}#p:{}&'.format(key_word,x)s1 = craw(url)workbook = xlwt.Workbook()#創建sheet對象，新建sheetsheet1 = workbook.add_sheet('xlwt', cell_overwrite_ok=True)#---設置excel樣式---#初始化樣式style = xlwt.XFStyle()#創建字體樣式font = xlwt.Font()font.name = 'Times New Roman'font.bold = True #加粗#設置字體style.font = font#使用樣式寫入數據# sheet.write(0, 1, "xxxxx", style)print('正在存儲數據，請勿打開excel')#向sheet中寫入數據name_list = ['公司名字','法定代表人','聯系方式','注冊人資本','成立時間','公司地址']for cc in range(0,len(name_list)):sheet1.write(0,cc,name_list[cc],style)for i in range(0,len(com_name_list)):sheet1.write(i+1,0,com_name_list[i],style)#公司名字sheet1.write(i+1,1,peo_name_list[i],style)#法定代表人sheet1.write(i+1,2,peo_phone_list[i],style)#聯系方式sheet1.write(i+1,3,zhuceziben_list[i],style)#注冊人資本sheet1.write(i+1,4,chenglishijian_list[i],style)#成立時間sheet1.write(i+1,5,com_place_list[i],style)#公司地址#保存excel文件，有同名的直接覆蓋workbook.save(r'F:\work\2017_08_02\xlwt.xls')print('the excel save success')

代碼基本上到這邊結束了，爬取效果也還可以。之前只做了個半成品，只處理一頁的數據，并沒有完善整個功能，后續加了翻頁，完善了存儲功能。

轉載于:https://my.oschina.net/u/3629884/blog/1532275

創作挑戰賽新人創作獎勵來咯，堅持創作打卡瓜分現金大獎

總結

以上是生活随笔為你收集整理的读书笔记（十）——python简单爬取企查查网企业信息，并以excel格式存储的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

python

读书笔记（十）——python简单爬取企查查网企业信息，并以excel格式存储

總結