python爬虫——智联招聘(上)
開發環境
win7+,python3.4+
pymysql庫,安裝:pip3 install pymysql
selenium庫,火狐瀏覽器56.0版本,geckodriver.exe,selenium知識點
MySQL5.5數據庫,Navicat圖形化界面
爬取步驟
1.分析智聯招聘網,獲取網頁信息
????打開“https://www.zhaopin.com/”選擇城市“北京”,輸入“GIS”點擊“搜工作”網頁將顯示與“GIS”相關的北京地區的招聘信息
?? F12進去開發者后臺“城市”“工作輸入”“搜工作按鈕”的html元素分別為“id=JobLocation”,“id=KeyWord_kw2”,“class=dosearch”(selenium知識點)。根據這些可以自動轉入下個頁面:
代碼一:
def get_main_page(keyword, city):fox = webdriver.Firefox()url = 'https://www.zhaopin.com/' fox.get(url)time.sleep(1)jl = fox.find_element_by_id('JobLocation')jl.clear()jl.send_keys(city)zl = fox.find_element_by_id('KeyWord_kw2')zl.clear()zl.send_keys(keyword)sj = fox.find_element_by_class_name('doSearch').click()time.sleep(3)
2.分析招聘信息,獲取信息
????查看源代碼找到各個部分的信息具體如下
def get_everypage_info(fox, keyword, city):fox.switch_to_window(fox.window_handles[-1])tables = fox.find_elements_by_tag_name('table') for i in range(0, len(tables)):if i == 0:''' row = ['職位名稱', '公司名稱', '工作地點', '公司規模', '工作經驗', '平均月薪', '學歷要求', '職位描述'] information.append(row) ''' else:address, develop, jingyan, graduate, require = " ", " ", " ", " ", " " job = tables[i].find_element_by_tag_name('a').textcompany = tables[i].find_element_by_css_selector('.gsmc a').textsalary = tables[i].find_element_by_css_selector('.zwyx').textspans = tables[i].find_elements_by_css_selector('.newlist_deatil_two span')for j in range(0, len(spans)):if "地點" in spans[j].get_attribute('textContent'):address = (spans[j].get_attribute('textContent'))[3:]elif "公司規模" in spans[j].get_attribute('textContent'):develop = (spans[j].get_attribute('textContent'))[5:]elif "經驗" in spans[j].get_attribute('textContent'):jingyan = (spans[j].get_attribute('textContent'))[3:]elif "學歷" in spans[j].get_attribute('textContent'):graduate = (spans[j].get_attribute('textContent'))[3:]require = (tables[i].find_element_by_css_selector('.newlist_deatil_last').get_attribute('textContent'))[8:]以上代碼得到每一頁的每個招聘公司的信息:職位名稱', '公司名稱', '工作地點', '公司規模', '工作經驗', '平均月薪', '學歷要求', '職位描述'
3.信息存入MySQL數據庫
????連接mysql并且創建新表,將數據逐行寫入數據庫,同時將“職位描述”寫入一個txt文件
連接mysql:
table_name = city + '_' + keyword conn = pymysql.Connect(host='127.0.0.1', port=3306, user='root', passwd='', db='python', charset='utf8') cursor = conn.cursor()
創建新表:
sql = """CREATE TABLE IF NOT EXISTS %s( 職位名稱 CHAR(100), 公司名稱 CHAR(100), 工作地點 CHAR(100), 公司規模 CHAR(100), 工作經驗 CHAR(100), 平均月薪 CHAR(100), 學歷要求 CHAR(100) )default charset=UTF8""" % (table_name) cursor.execute(sql)
將信息分別寫入mysql和txt:
insert_row = ('insert into {0}(職位名稱,公司名稱,工作地點,公司規模,工作經驗,平均月薪,學歷要求) VALUES(%s,%s,%s,%s,%s,%s,%s)'.format(table_name)) insert_data = (job, company, address, develop, jingyan, salary, graduate) cursor.execute(insert_row, insert_data) conn.commit() with open('%s職位描述.txt' % (table_name), 'a', encoding='utf-8') as f:f.write(require)
4.招聘信息頁面跳轉
“下一頁”按鈕的html元素通過下面代碼找到并跳轉:
count = 0 while count <= 10:try:next_page = fox.find_element_by_class_name('pagesDown-pos').click()break except:time.sleep(8)count += 1 continue if count > 10:fox.close() else:time.sleep(1)get_everypage_info(fox, keyword, city) 注意:此處十分重要,while循環用于判斷是否到了最后一頁,如果進行10次“next_page = fox.find_element_by_class_name('pagesDown-pos').click()”仍然沒反應,就會跳出循環進去下面的if,關閉瀏覽器;如果“next_page = fox.find_element_by_class_name('pagesDown-pos').click()”有反應break也會跳出while進入下面“else”進而跳轉到下一頁5.“main”設置進行城市循環
if __name__ == "__main__":citys = ['上海', '深圳', '廣州', '武漢', '杭州', '南京', '成都', '青島'] # '北京', 已爬取 job = '數據挖掘分析' for city in citys:print(" ")get_main_page(job, city)
每個城市的job信息爬取完了自動進行列表中下個城市信息爬取
6.注意和問題
(1)創建mysql表問題一:定義表的編碼形式“default charset=UTF8”,不然輸入寫入時報錯
(2)數據寫入mysql表問題二:'insert into {0}(職位名稱,公司名稱,工作地點,公司規模,工作經驗,平均月薪,學歷要求) VALUES(%s,%s,%s,%s,%s,%s,%s)'.format(table_name)處要先將表名帶入,insert 語句中表名和列名都不能帶單引號和雙引號,提前寫入可以避免。和值一起寫入時默認代了引號;
?insert_row = ('insert into {0}(職位名稱,公司名稱,工作地點,公司規模,工作經驗,平均月薪,學歷要求) VALUES(%s,%s,%s,%s,%s,%s,%s)'.format(table_name))
????????????insert_data = (job, company, address, develop, jingyan, salary, graduate)
? ? ? ? ? ? cursor.execute(insert_row, insert_data)
(3)time.sleep()根據網速和電腦性能而定,上佳的時間可以設置短;不佳的就要適當延長時間設置,不讓代碼將捕捉不到html元素
完整代碼:
from selenium import webdriver from selenium.webdriver.common.keys import Keys import time import pymysqldef get_main_page(keyword, city):fox = webdriver.Firefox()url = 'https://www.zhaopin.com/' fox.get(url)time.sleep(1)jl = fox.find_element_by_id('JobLocation')jl.clear()jl.send_keys(city)zl = fox.find_element_by_id('KeyWord_kw2')zl.clear()zl.send_keys(keyword)sj = fox.find_element_by_class_name('doSearch').click()time.sleep(3)get_everypage_info(fox, keyword, city)def get_everypage_info(fox, keyword, city):fox.switch_to_window(fox.window_handles[-1])tables = fox.find_elements_by_tag_name('table')table_name = city + '_' + keywordconn = pymysql.Connect(host='127.0.0.1', port=3306, user='root', passwd='', db='python', charset='utf8')cursor = conn.cursor()sql = """CREATE TABLE IF NOT EXISTS %s( 職位名稱 CHAR(100), 公司名稱 CHAR(100), 工作地點 CHAR(100), 公司規模 CHAR(100), 工作經驗 CHAR(100), 平均月薪 CHAR(100), 學歷要求 CHAR(100) )default charset=UTF8""" % (table_name)cursor.execute(sql)for i in range(0, len(tables)):if i == 0:''' row = ['職位名稱', '公司名稱', '工作地點', '公司規模', '工作經驗', '平均月薪', '學歷要求', '職位描述'] information.append(row) ''' else:address, develop, jingyan, graduate, require = " ", " ", " ", " ", " " job = tables[i].find_element_by_tag_name('a').textcompany = tables[i].find_element_by_css_selector('.gsmc a').textsalary = tables[i].find_element_by_css_selector('.zwyx').textspans = tables[i].find_elements_by_css_selector('.newlist_deatil_two span')for j in range(0, len(spans)):if "地點" in spans[j].get_attribute('textContent'):address = (spans[j].get_attribute('textContent'))[3:]elif "公司規模" in spans[j].get_attribute('textContent'):develop = (spans[j].get_attribute('textContent'))[5:]elif "經驗" in spans[j].get_attribute('textContent'):jingyan = (spans[j].get_attribute('textContent'))[3:]elif "學歷" in spans[j].get_attribute('textContent'):graduate = (spans[j].get_attribute('textContent'))[3:]require = (tables[i].find_element_by_css_selector('.newlist_deatil_last').get_attribute('textContent'))[8:]row = [job, company, address, develop, jingyan, salary, graduate, require]insert_row = ('insert into {0}(職位名稱,公司名稱,工作地點,公司規模,工作經驗,平均月薪,學歷要求) VALUES(%s,%s,%s,%s,%s,%s,%s)'.format(table_name))insert_data = (job, company, address, develop, jingyan, salary, graduate)cursor.execute(insert_row, insert_data)conn.commit()with open('%s職位描述.txt' % (table_name), 'a', encoding='utf-8') as f:f.write(require)print('此頁已抓取···')conn.close()count = 0 while count <= 10:try:next_page = fox.find_element_by_class_name('pagesDown-pos').click()break except:time.sleep(8)count += 1 continue if count > 10:fox.close()else:time.sleep(1)get_everypage_info(fox, keyword, city)if __name__ == "__main__":citys = ['上海', '深圳', '廣州', '武漢', '杭州', '南京', '成都', '青島'] # '北京', 已爬取 job = '數據挖掘分析' for city in citys:print(" ")get_main_page(job, city)
最后獲取的輸入如圖
總結
以上是生活随笔為你收集整理的python爬虫——智联招聘(上)的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 当领导招了100个初级开发去做3个资深开
- 下一篇: ps国画效果案例制作教程和思路介绍