python3 scrapy爬取智联招聘存mongodb
寫在前面,這次寫智聯(lián)招聘的爬蟲是其次,主要的是通過智聯(lián)招聘上的數(shù)據(jù)信息弄一個(gè)數(shù)據(jù)挖掘的小項(xiàng)目,這一篇主要是如何一氣呵成的將智聯(lián)招聘上的招聘信息給爬下來
(一)scrapy框架的使用
scrapy框架是python爬蟲里面一個(gè)比較出色的框架,支持分布式,里面內(nèi)部已經(jīng)實(shí)現(xiàn)了從爬取解析到下載的一條龍服務(wù),用這個(gè)框架或者是基于這個(gè)框架,可以很大程度上避免了一些不必要的bug,當(dāng)然前提是你需要懂得并能去使用它。scrapy的簡(jiǎn)單安裝與使用這里就暫時(shí)不介紹了,大家可以借助搜索引擎了解一下
(二) 創(chuàng)建項(xiàng)目
選好一個(gè)適合工作的空間目錄,使用命令生成一個(gè)scrapy項(xiàng)目,我這選擇了E盤
記不住scrapy命令的可以直接在dos輸入 scrapy ,然后會(huì)給出一些提示的。
命令一:
scrapy startproject zhilianspider
這里是創(chuàng)建是一個(gè)工程,我們?cè)賱?chuàng)建一個(gè)spider,
命令二:
scrapy genspider zhilian "https://m.zhaopin.com/beijing"
(三)pycharm打開工程
盡量像這樣子打開,麻煩會(huì)少些。馬賽克的是我自己創(chuàng)建的,下面會(huì)公開的,沒有馬賽克的是最原始的生成工程的文件。
(四)編寫spider
(1)item.py
import scrapy class ZhilianspiderItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()job_name = scrapy.Field()job_link = scrapy.Field()job_info = scrapy.Field()job_tags = scrapy.Field()company = scrapy.Field()address = scrapy.Field()salary = scrapy.Field()獲取的信息如下:
(2)pipelines.py (數(shù)據(jù)存入mongodb中)
import pymongoclass ZhilianspiderPipeline(object):def __init__(self):self.client = pymongo.MongoClient("localhost",connect=False)db = self.client["zhilian"]self.collection = db["python"]def process_item(self, item, spider):content = dict(item)self.collection.insert(content)print("###################已經(jīng)存入MongoDB########################")return itemdef close_spider(self,spider):self.client.close()pass這里用的本地mongodb數(shù)據(jù)庫(kù)
(3)middlewares.py (主要是做一些被反爬的處理)
from zhilianspider.ua_phone import ua_list""" Ua 頭信息 """ class UserAgentmiddleware(UserAgentMiddleware):def process_request(self, request, spider):agent = random.choice(ua_list)request.headers['User-Agent'] = agentua_phone.py里的內(nèi)容:
ua_list = ["HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1","Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0","Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1","Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1","Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1","Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3","Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1","Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2","Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1","Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1","Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2","Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522 (KHTML, like Gecko) Safari/419.3","Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2","Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17","Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13","Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2","Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1","Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", ](4)settings.py 文件的設(shè)置
主要是關(guān)閉了robots協(xié)議,延遲0.5秒發(fā)生請(qǐng)求,UA頭設(shè)置,pipeline下載設(shè)置。
SPIDER_MIDDLEWARES = {'zhilianspider.middlewares.UserAgentmiddleware': 400, }ITEM_PIPELINES = {'zhilianspider.pipelines.ZhilianspiderPipeline': 300, }(5)zhilian.py (spider解析)
# -*- coding: utf-8 -*- import scrapy from zhilianspider.items import ZhilianspiderItem from bs4 import BeautifulSoupclass ZhilianSpider(scrapy.Spider):name = 'zhilian'allowed_domains = ['m.zhaopin.com']# start_urls = ['https://m.zhaopin.com/hangzhou/']start_urls = ['https://m.zhaopin.com/beijing-530/?keyword=python&pageindex=1&maprange=3&islocation=0']base_url = 'https://m.zhaopin.com/'def parse(self, response):print(response.url)# 這里是body 而不是textsoup = BeautifulSoup(response.body,'lxml')all_sec = soup.find('div',class_='r_searchlist positiolist').find_all('section')for sec in all_sec:d_link = sec.find('a',class_='boxsizing')['data-link']detail_link = self.base_url+d_linkif detail_link:yield scrapy.Request(detail_link,callback=self.parse_detail)# 是否有下一頁(yè)的鏈接if soup.find('a',class_='nextpage'):next_url = self.base_url+soup.find('a',class_='nextpage')['href']print('next_url ',next_url)# 若果有重復(fù)的,則不進(jìn)行過濾yield scrapy.Request(next_url,callback=self.parse,dont_filter=True)def parse_detail(self,response):item = ZhilianspiderItem()item['job_link'] = response.urlitem['job_name'] = response.xpath('//*[@class="job-name fl"]/text()')[0].extract()item['company'] = response.xpath('//*[@class="comp-name"]/text()')[0].extract()item['address'] = response.xpath('//*[@class="add"]/text()').extract_first()item['job_info'] = ''.join(response.xpath('//*[@class="about-main"]/p/text()').extract())item['salary'] = response.xpath('//*[@class="job-sal fr"]/text()')[0].extract()item['job_tags'] = ';'.join(response.xpath("//*[@class='tag']/text()").extract())yield itempass(五)運(yùn)行spider
- 方式一:scrapy crawl zhilian
- 方式二(建議):創(chuàng)建一個(gè)run.py 文件,然后運(yùn)行
這里我用的是第二種方式。直接右鍵運(yùn)行就可以了,這樣方便許多。
代碼截圖:
(六)mongodb數(shù)據(jù)展示
一共爬取了北京python崗位4541條數(shù)據(jù)。
下面將在房天下使用scrapy_redis進(jìn)行分布式爬取租房信息。
總結(jié)
以上是生活随笔為你收集整理的python3 scrapy爬取智联招聘存mongodb的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: JDBC系列教程
- 下一篇: 系统集成项目管理之项目成本管理(EV A