使用python爬取东方财富网机构调研数据
最近有一個(gè)需求,需要爬取東方財(cái)富網(wǎng)的機(jī)構(gòu)調(diào)研數(shù)據(jù).數(shù)據(jù)所在的網(wǎng)頁(yè)地址為: 機(jī)構(gòu)調(diào)研
網(wǎng)頁(yè)如下所示:
可見(jiàn)數(shù)據(jù)共有8464頁(yè),此處不能直接使用scrapy爬蟲(chóng)進(jìn)行爬取,因?yàn)辄c(diǎn)擊下一頁(yè)時(shí),瀏覽器只是發(fā)起了javascript網(wǎng)絡(luò)訪問(wèn),然后將服務(wù)器返回的數(shù)據(jù)插入網(wǎng)頁(yè),無(wú)法通過(guò)網(wǎng)址直接獲取對(duì)應(yīng)頁(yè)的的頁(yè)面數(shù)據(jù).
通過(guò)chrome的開(kāi)發(fā)者工具,我們可以看到點(diǎn)擊下一頁(yè)按鈕背后發(fā)起的網(wǎng)頁(yè)訪問(wèn):
在點(diǎn)擊下一頁(yè)時(shí),瀏覽器向地址發(fā)起了訪問(wèn).我們分析一下這個(gè)地址的結(jié)構(gòu):
http://data.eastmoney.com/DataCenter_V3/jgdy/xx.ashx?pagesize=50&page=2&js=var%20ZUPcjFOK¶m=&sortRule=-1&sortType=0&rt=48759234
上述地址中的&page= ?之后指定的是需要獲取第幾個(gè)頁(yè)面的數(shù)據(jù).所以我們可以通過(guò)修改&page=后面的數(shù)字來(lái)訪問(wèn)不同頁(yè)面對(duì)應(yīng)的數(shù)據(jù).
現(xiàn)在看一下這個(gè)數(shù)據(jù)的結(jié)構(gòu):
可見(jiàn)這個(gè)數(shù)據(jù)是一個(gè)字符串,根據(jù)第一個(gè)出現(xiàn)的等于號(hào)對(duì)該字符串進(jìn)行切分,切分得到的后半段是一個(gè)json字符串,里面存儲(chǔ)了我們想要獲取的數(shù)據(jù). json數(shù)據(jù)中的字段pages的值就是頁(yè)面的總數(shù).根據(jù)這一特性我們可以寫出下述函數(shù)獲取頁(yè)面的總數(shù):
# 獲取頁(yè)數(shù) def get_pages_count():url = '''http://data.eastmoney.com/DataCenter_V3/jgdy/xx.ashx?pagesize=50&page=%d''' % 1url += "&js=var%20ngDoXCbV¶m=&sortRule=-1&sortType=0&rt=48753724"wp = urllib.urlopen(url)data = wp.read().decode("gbk")start_pos = data.index('=')json_data = data[start_pos + 1:]dict = json.loads(json_data)pages =dict['pages']return pages在給定頁(yè)數(shù)范圍的情況下可以獲取數(shù)據(jù)地址列表,如下所示:
# 獲取鏈接列表 def get_url_list(start,end):url_list=[]while(start<=end):url = '''http://data.eastmoney.com/DataCenter_V3/jgdy/xx.ashx?pagesize=50&page=%d''' %starturl += "&js=var%20ngDoXCbV¶m=&sortRule=-1&sortType=0&rt=48753724"url_list.append(url)start+=1return url_list為了保存這些數(shù)據(jù),我使用sqlalchemy中的orm模型來(lái)表示數(shù)據(jù)模型,數(shù)據(jù)模型定義如下:
# 此處需要設(shè)置charset,否則中文會(huì)亂碼 engine =create_engine('mysql+mysqldb://user:passwd@ip:port/db_name?charset=utf8') Base =declarative_base()class jigoudiaoyan(Base):__tablename__ = "jigoudiaoyan"# 自增的主鍵id =Column(Integer,primary_key=True)# 調(diào)研日期StartDate = Column(Date,nullable=True)# 股票名稱SName =Column(VARCHAR(255),nullable=True)# 結(jié)束日期 一般為空EndDate=Column(Date,nullable=True)# 接待方式Description =Column(VARCHAR(255),nullable=True)# 公司全稱CompanyName =Column(VARCHAR(255),nullable=True)# 結(jié)構(gòu)名稱OrgName=Column(VARCHAR(255),nullable=True)# 公司代碼CompanyCode=Column(VARCHAR(255),nullable=True)# 接待人員Licostaff=Column(VARCHAR(800),nullable=True)# 一般為空 意義不清OrgSum=Column(VARCHAR(255),nullable=True)# 漲跌幅ChangePercent=Column(Float,nullable=True)# 公告日期NoticeDate=Column(Date,nullable=True)# 接待地點(diǎn)Place=Column(VARCHAR(255),nullable=True)# 股票代碼SCode=Column(VARCHAR(255),nullable=True)# 結(jié)構(gòu)代碼OrgCode=Column(VARCHAR(255),nullable=True)# 調(diào)研人員Personnel=Column(VARCHAR(255),nullable=True)# 最新價(jià)Close=Column(Float,nullable=True)#機(jī)構(gòu)類型OrgtypeName=Column(VARCHAR(255),nullable=True)# 機(jī)構(gòu)類型代碼Orgtype=Column(VARCHAR(255),nullable=True)# 主要內(nèi)容,一般為空 意義不清Maincontent=Column(VARCHAR(255),nullable=True) Session =sessionmaker(bind=engine) session =Session() # 創(chuàng)建表 Base.metadata.create_all(engine) # 獲取鏈接列表在上述基礎(chǔ)上,我們就可以定義下屬函數(shù)用于抓取鏈接的內(nèi)容,并將其解析之后存入數(shù)據(jù)庫(kù),如下所示:
#記錄并保存數(shù)據(jù) def save_json_data(user_agent_list):pages =get_pages_count()len_user_agent=len(user_agent_list)url_list =get_url_list(1,pages)count=0for url in url_list:request = urllib2.Request(url)request.add_header('Referer','http://data.eastmoney.com/jgdy/')# 隨機(jī)從user_agent池中取userpos =random.randint(0,len_user_agent-1)request.add_header('User-Agent', user_agent_list[pos])reader = urllib2.urlopen(request)data=reader.read()# 自動(dòng)判斷編碼方式并進(jìn)行解碼encoding = chardet.detect(data)['encoding']# 忽略不能解碼的字段data = data.decode(encoding,'ignore')start_pos = data.index('=')json_data = data[start_pos + 1:]dict = json.loads(json_data)list_data = dict['data']count+=1for item in list_data:one = jigoudiaoyan()StartDate =item['StartDate'].encode("utf8")if(StartDate ==""):StartDate = Noneelse:StartDate = datetime.datetime.strptime(StartDate,"%Y-%m-%d").date()SName=item['SName'].encode("utf8")if(SName ==""):SName =NoneEndDate = item["EndDate"].encode("utf8")if(EndDate==""):EndDate=Noneelse:EndDate=datetime.datetime.strptime(EndDate,"%Y-%m-%d").date()Description=item['Description'].encode("utf8")if(Description ==""):Description= NoneCompanyName=item['CompanyName'].encode("utf8")if(CompanyName==""):CompanyName=NoneOrgName=item['OrgName'].encode("utf8")if(OrgName ==""):OrgName=NoneCompanyCode=item['CompanyCode'].encode("utf8")if(CompanyCode==""):CompanyCode=NoneLicostaff=item['Licostaff'].encode("utf8")if(Licostaff ==""):Licostaff=NoneOrgSum = item['OrgSum'].encode("utf8")if(OrgSum ==""):OrgSum=NoneChangePercent=item['ChangePercent'].encode("utf8")if(ChangePercent ==""):ChangePercent=Noneelse:ChangePercent=float(ChangePercent)NoticeDate=item['NoticeDate'].encode("utf8")if(NoticeDate==""):NoticeDate=Noneelse:NoticeDate=datetime.datetime.strptime(NoticeDate,"%Y-%m-%d").date()Place=item['Place'].encode("utf8")if(Place==""):Place=NoneSCode=item["SCode"].encode("utf8")if(SCode==""):SCode=NoneOrgCode=item['OrgCode'].encode("utf8")if(OrgCode==""):OrgCode=NonePersonnel=item['Personnel'].encode('utf8')if(Personnel==""):Personnel=NoneClose=item['Close'].encode("utf8")if(Close==""):Close=Noneelse:Close =float(Close)OrgtypeName =item['OrgtypeName'].encode("utf8")if(OrgtypeName==""):OrgtypeName=NoneOrgtype=item['Orgtype'].encode("utf8")if(Orgtype==""):Orgtype=NoneMaincontent=item['Maincontent'].encode("utf8")if(Maincontent==""):Maincontent=Noneone.StartDate=StartDateone.SName=SNameone.EndDate=EndDateone.Description=Descriptionone.CompanyName=CompanyNameone.OrgName=OrgNameone.CompanyCode=CompanyCodeone.Licostaff=Licostaffone.OrgSum=OrgSumone.ChangePercent=ChangePercentone.NoticeDate=NoticeDateone.Place=Placeone.SCode=SCodeone.OrgCode=OrgCodeone.Personnel=Personnelone.Close=Closeone.OrgtypeName=OrgtypeNameone.Orgtype=Orgtypeone.Maincontent=Maincontentsession.add(one)session.commit()print 'percent:' ,count*1.0/pages,"complete!,now ",count# delay 1stime.sleep(1)
為了加快抓取速度,我設(shè)置了user_agent池,每次訪問(wèn)設(shè)置user_agent時(shí)隨機(jī)從池中取一條作為這次訪問(wèn)的user_agent.對(duì)應(yīng)列表user_agent_list ,定義如下:
# user_agent 池 user_agent_list=[] user_agent_list.append("Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 ") user_agent_list.append("Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50") user_agent_list.append("Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1") user_agent_list.append("Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11") user_agent_list.append("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 ") user_agent_list.append("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36")請(qǐng)注意,為了自動(dòng)識(shí)別網(wǎng)頁(yè)編碼并解碼,我使用了chardet模塊識(shí)別網(wǎng)頁(yè)的編碼.為了應(yīng)對(duì)極端情況下解碼失敗的問(wèn)題,我在解碼時(shí)設(shè)置跳過(guò)那些不能正確解碼的字符串.相關(guān)代碼截取如下:
encoding = chardet.detect(data)['encoding']# 忽略不能解碼的字段data = data.decode(encoding,'ignore')補(bǔ)充:
網(wǎng)址中最后一個(gè)字段代碼時(shí)間戳,用于確定獲取哪一個(gè)時(shí)刻的最新價(jià)(maybe for ban crawler?),在查看網(wǎng)頁(yè)源代碼之后,我確定時(shí)間戳的生成代碼如下,給有需要的人(我發(fā)現(xiàn)東方財(cái)富網(wǎng)的這個(gè)字段都是這么生成的):
# 獲取當(dāng)前的時(shí)間戳 def get_timstamp():timestamp =int(int(time.time())/30)return str(timestamp)?
轉(zhuǎn)載于:https://www.cnblogs.com/zhoudayang/p/5474053.html
總結(jié)
以上是生活随笔為你收集整理的使用python爬取东方财富网机构调研数据的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 获取访客进站关键词_拼多多访客突然下降是
- 下一篇: python series用法_如何使用