scrape创建_确实在2分钟内对Scrape公司进行了评论和评分
scrape創(chuàng)建
網(wǎng)頁(yè)搜羅,數(shù)據(jù)科學(xué) (Web Scraping, Data Science)
In this tutorial, I will show you how to perform web scraping using Anaconda Jupyter notebook and the BeautifulSoup library.
在本教程中,我將向您展示如何使用Anaconda Jupyter筆記本和BeautifulSoup庫(kù)執(zhí)行Web抓取。
We’ll be scraping Company reviews and ratings from Indeed platform, and then we will export them to Pandas library dataframe and then to a .CSV file.
我們將從Indeed平臺(tái)上抓取公司的評(píng)論和評(píng)分,然后將它們導(dǎo)出到Pandas庫(kù)數(shù)據(jù)框,然后導(dǎo)出到.CSV文件。
Let us get straight down to business, however, if you’re looking on a guide to understanding Web Scraping in general, I advise you of reading this article from Dataquest.
但是,讓我們直接從事業(yè)務(wù),但是,如果您正在尋找一般理解Web爬網(wǎng)的指南,建議您閱讀Dataquest的這篇文章。
Let us start by importing our 3 libraries
讓我們從導(dǎo)入3個(gè)庫(kù)開(kāi)始
from bs4 import BeautifulSoupimport pandas as pd
import requests
Then, let’s go to indeed website and examine which information we want, we will be targeting Ernst & Young firm page, you can check it from the following link
然后,讓我們轉(zhuǎn)到確實(shí)的網(wǎng)站并檢查我們想要的信息,我們將以安永會(huì)計(jì)師事務(wù)所為目標(biāo)頁(yè)面,您可以從以下鏈接中進(jìn)行檢查
https://www.indeed.com/cmp/Ey/reviews?fcountry=ITBased on my location, the country is indicated as Italy but you can choose and control that if you want.
根據(jù)我的位置,該國(guó)家/地區(qū)顯示為意大利,但您可以根據(jù)需要選擇和控制該國(guó)家/地區(qū)。
In the next picture, we can see the multiple information that we can tackle and scrape:
在下一張圖片中,我們可以看到我們可以解決和抓取的多種信息:
1- Review Title
1-評(píng)論標(biāo)題
2- Review Body
2-審查機(jī)構(gòu)
3- Rating
3-評(píng)分
4- The role of the reviewer
4-審稿人的角色
5- The location of the reviewer
5-評(píng)論者的位置
6- The review date
6-審查日期
However, you can notice that Points 4,5&6 are all in one line and will be scraped together, this can cause a bit of confusion for some people, but my advice is to scrape first then solve problems later. So, let’s try to do this.
但是,您會(huì)注意到,點(diǎn)4,5&6都在同一行中,并且將被刮擦在一起,這可能會(huì)使某些人感到困惑,但是我的建議是先刮擦然后再解決問(wèn)題。 因此,讓我們嘗試執(zhí)行此操作。
After knowing what we want to scrape, we need to find out how much do we need to scrape, do we want only 1 review? 1 page of reviews or all pages of reviews? I guess the answer should be all pages!!
知道要抓取的內(nèi)容后,我們需要找出需要抓取的數(shù)量,我們只需要進(jìn)行1次審核嗎? 1頁(yè)評(píng)論或所有頁(yè)面評(píng)論? 我想答案應(yīng)該是所有頁(yè)面!
If you scrolled down the page and went over to page 2 you will find that the link for that page became as following:
如果您向下滾動(dòng)頁(yè)面并轉(zhuǎn)到頁(yè)面2,則會(huì)發(fā)現(xiàn)該頁(yè)面的鏈接如下:
https://www.indeed.com/cmp/Ey/reviews?fcountry=IT&start=20Then try to go to page 3, you will find the link became as following:
然后嘗試轉(zhuǎn)到第3頁(yè),您會(huì)發(fā)現(xiàn)鏈接如下所示:
https://www.indeed.com/cmp/Ey/reviews?fcountry=IT&start=4Looks like we have a pattern here, page 2=20 , page 3 = 40, then page 4 = 60, right? All untill page 8 = 140
看起來(lái)我們這里有一個(gè)模式,第2頁(yè)= 20,第3頁(yè)= 40,然后第4頁(yè)= 60,對(duì)嗎? 全部直到第8頁(yè)= 140
Let’s get back to coding, start by defining your dataframe that you want.
讓我們回到編碼,首先定義所需的數(shù)據(jù)框。
df = pd.DataFrame({‘review_title’: [],’review’:[],’author’:[],’rating’:[]})In the next code I will make a for loop that starts from 0, jumps 20 and stops at 140.
在下一個(gè)代碼中,我將創(chuàng)建一個(gè)for循環(huán),該循環(huán)從0開(kāi)始,跳20,然后在140處停止。
1- Inside that for loop we will make a GET request to the web server, which will download the HTML contents of a given web page for us.
1-在該for循環(huán)內(nèi),我們將向Web服務(wù)器發(fā)出GET請(qǐng)求,該服務(wù)器將為我們下載給定網(wǎng)頁(yè)HTML內(nèi)容。
2- Then, We will use the BeautifulSoup library to parse this page, and extract the text from it. We first have to create an instance of the BeautifulSoup class to parse our document
2-然后,我們將使用BeautifulSoup庫(kù)解析此頁(yè)面,并從中提取文本。 我們首先必須創(chuàng)建BeautifulSoup類的實(shí)例來(lái)解析我們的文檔
3- Then by inspecting the html, we choose the classes from the web page, classes are used when scraping to specify specific elements we want to scrape.
3-然后通過(guò)檢查html,我們從網(wǎng)頁(yè)上選擇類,在抓取時(shí)使用這些類來(lái)指定要抓取的特定元素。
4- And then we can conclude by adding the results to our DataFrame created before.
4-然后我們可以通過(guò)將結(jié)果添加到之前創(chuàng)建的DataFrame中來(lái)得出結(jié)論。
“I added a picture down for how the code should be in case you copied and some spaces were added wrong”
“我在圖片上添加了圖片,以防萬(wàn)一您復(fù)制了代碼并添加了錯(cuò)誤的空格,應(yīng)該如何處理”
for i in range(10,140,20):url = (f’https://www.indeed.com/cmp/Ey/reviews?fcountry=IT&start={i}')
header = {“User-Agent”:”Mozilla/5.0 Gecko/20100101 Firefox/33.0 GoogleChrome/10.0"}
page = requests.get(url,headers = header)
soup = BeautifulSoup(page.content, ‘lxml’)
results = soup.find(“div”, { “id” : ‘cmp-container’})
elems = results.find_all(class_=’cmp-Review-container’)
for elem in elems:
title = elem.find(attrs = {‘class’:’cmp-Review-title’})
review = elem.find(‘div’, {‘class’: ‘cmp-Review-text’})
author = elem.find(attrs = {‘class’:’cmp-Review-author’})
rating = elem.find(attrs = {‘class’:’cmp-ReviewRating-text’})
df = df.append({‘review_title’: title.text,
‘review’: review.text,
‘a(chǎn)uthor’: author.text,
‘rating’: rating.text
}, ignore_index=True)
DONE. Let’s check our dataframe
完成。 讓我們檢查一下數(shù)據(jù)框
df.head()Now, once scraped, let’s try solve the problem we have.
現(xiàn)在,一旦刮掉,讓我們嘗試解決我們遇到的問(wèn)題。
Notice the author coulmn had 3 differnt information seperated by (-)
請(qǐng)注意,作者可能有3個(gè)不同的信息,并以(-)分隔
So, let’s split them
所以,讓我們分開(kāi)
author = df[‘a(chǎn)uthor’].str.split(‘-’, expand=True)Now, let’s rename the columns and delete the last one.
現(xiàn)在,讓我們重命名列并刪除最后一列。
author = author.rename(columns={0: “job”, 1: “l(fā)ocation”,2:’time’})del author[3]Then let’s join those new columns to our original dataframe and delete the old author column
然后,將這些新列添加到原始數(shù)據(jù)框中,并刪除舊的author列
df1 = pd.concat([df,author],axis=1)del df1[‘a(chǎn)uthor’]
let’s examine our new dataframe
讓我們檢查一下新的數(shù)據(jù)框
df1.head()Let’s re-organize the columns and remove any duplicates
讓我們重新整理各列并刪除所有重復(fù)項(xiàng)
df1 = df1[[‘job’, ‘review_title’, ‘review’, ‘rating’,’location’,’time’]]df1 = df1.drop_duplicates()
Then finally let’s save the dataframe to a CSV file
最后,讓我們將數(shù)據(jù)框保存到CSV文件中
df1.to_csv(‘EY_indeed.csv’)You should now have a good understanding of how to scrape and extract data from Indeed. A good next step for you if you are familiar a bit with web scraping it to pick a site and try some web scraping on your own.
您現(xiàn)在應(yīng)該對(duì)如何從Indeed抓取和提取數(shù)據(jù)有很好的了解。 如果您對(duì)網(wǎng)絡(luò)抓取有點(diǎn)熟悉,可以選擇一個(gè)不錯(cuò)的下一步來(lái)選擇一個(gè)站點(diǎn),然后自己嘗試一些網(wǎng)絡(luò)抓取。
Happy Coding:)
快樂(lè)編碼:)
翻譯自: https://towardsdatascience.com/scrape-company-reviews-ratings-from-indeed-in-2-minutes-59205222d3ae
scrape創(chuàng)建
總結(jié)
以上是生活随笔為你收集整理的scrape创建_确实在2分钟内对Scrape公司进行了评论和评分的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。