爬虫goodreads数据_使用Python从Goodreads数据中预测好书
爬蟲goodreads數據
Photo of old books by Ed Robertson on Unsplash 埃德·羅伯森 ( Ed Robertson)的舊書照片,內容為Unsplash什么 (What)
This is a data set of the first 50,000 book ids pulled from Goodreads’ API on July 30th, 2020. A few thousand ids did not make it through because the book id was changed, the URL or API broke, or the information was stored in an atypical format.
這是2020年7月30日從Goodreads的API中提取的前50,000個圖書ID的數據集。由于圖書ID發生更改,URL或API中斷或信息存儲在其中,數千個ID沒有通過非典型格式。
為什么 (Why)
From the reader’s perspective, books are a multi-hour commitment of learning and leisure (they don’t call it Goodreads for nothing). From the author’s and publisher’s perspectives, books are a way of living (with some learning and leisure too). In both cases, knowing which factors explain and predict great books will save you time and money. Because while different people have different tastes and values, knowing how a book is rated in general is a sensible starting point. You can always update it later.
從讀者的角度來看,書籍是學習和休閑的一個多小時的承諾(他們不稱其為“ 好讀書無所作為”)。 從作者和出版者的角度來看,書籍是一種生活方式(也有一些學習和休閑的機會)。 在這兩種情況下,了解哪些因素可以解釋和預測優秀的書將為您節省時間和金錢。 因為盡管不同的人有不同的品味和價值觀,但了解一本書的總體評級是明智的起點。 您以后可以隨時對其進行更新。
環境 (Environment)
It’s good practice to work in a virtual environment, a sandbox with its own libraries and versions, so we’ll make one for this project. There are several ways to do this, but we’ll use Anaconda. To create and activate an Anaconda virtual environment called ‘gr’ (for Goodreads) using Python 3.7, run the following commands in your terminal or command line:
在虛擬環境中工作是一個好習慣,這是一個具有自己的庫和版本的沙箱,因此我們將為該項目創建一個。 有幾種方法可以做到這一點,但是我們將使用Anaconda 。 要使用Python 3.7創建并激活名為'gr'(用于Goodreads)的Anaconda虛擬環境,請在終端或命令行中運行以下命令:
裝置 (Installations)
You should see ‘gr’ or whatever you named your environment at the left of your prompt. If so, run these commands. Anaconda will automatically install any dependencies of these packages, including matplotlib, numpy, pandas, and scikit-learn.
您應該在提示符左側看到“ gr”或您所命名的環境。 如果是這樣,請運行這些命令。 Anaconda將自動安裝這些軟件包的所有依賴項,包括matplotlib,numpy,pandas和scikit-learn。
進口貨 (Imports)
數據采集 (Data Collection)
We pull the first 50,000 book ids and their associated information using a lightweight wrapper around the Goodreads API made by Michelle D. Zhang (code and documentation here), then write each as a dictionary to a JSON file called book_data.
我們使用輕量級包裝器(由Michelle D. Zhang制作, 此處是代碼和文檔)使用輕量級包裝提取前50,000個書籍ID及其相關信息,然后將每個書籍ID作為字典寫到名為book_data的JSON文件中。
數據清理 (Data Cleaning)
We’ll define and describe some key functions below, but we’ll run them in one big wrangle function later.
我們將在下面定義和描述一些關鍵功能,但是稍后將在一個大的爭用函數中運行它們。
威爾遜下界 (Wilson Lower Bound)
A rating of 4 stars based on 20 reviews and a rating of 4 stars based on 20,000 reviews are not equal. The rating based on more reviews has less uncertainty about it and is therefore a more reliable estimate of the “true” rating. In order to properly define and predict great books, we must transform average_rating by putting a penalty on uncertainty.
基于20條評論的4星評級和基于20,000條評論的4星評級不相等。 基于更多評論的評級具有較少的不確定性,因此是對“真實”評級的更可靠的估計。 為了正確定義和預測好書,我們必須通過對不確定性進行懲罰來轉變average_rating 。
We’ll do this by calculating a Wilson Lower Bound, where we estimate the confidence interval of a particular rating and take its lower bound as the new rating. Ratings based on tens of thousands of reviews will barely be affected because their confidence intervals are narrow. Ratings based on fewer reviews, however, have wider confidence intervals and will be scaled down more.
我們將通過計算威爾遜下界(Wilson Lower Bound)來做到這一點,在此我們估計特定等級的置信區間,并將其下界作為新等級。 基于數萬條評論的評分幾乎不會受到影響,因為它們的置信區間很窄。 但是,基于較少評論的評分具有較大的置信區間,并且將被縮小。
Note: We modify the formula because our data is calculated from a 5-point system, not a binary system as described by Wilson. Specifically, we decrement average_rating by 1 for a conservative estimate of the true non-inflated rating, and then normalize it. If this penalty is too harsh or too light, more ratings will over time raise or lower the book’s rating, respectively. In other words, with more information, this adjustment is self-correcting.
注意 :我們修改公式是因為我們的數據是根據5點系統而不是Wilson所描述的二進制系統計算的。 具體來說,我們對平均非虛假評級的保守估計值將average_rating遞減1,然后對其進行歸一化。 如果此懲罰太苛刻或太輕,隨著時間的流逝,更多的評級將分別提高或降低該書的評級。 換句話說,有了更多信息,此調整是自校正的。
體裁 (Genres)
Goodreads’ API returns ‘shelves’, which encompass actual genres like “science-fiction” and user-created categories like “to-read”. We extracted only the 5 most popular shelves when pulling the data to limit this kind of clean-up; here, we’ll finish the job.
Goodreads的API返回“貨架”,其中包含諸如“科幻小說”之類的實際類型和諸如“待閱讀”之類的用戶創建類別。 在提取數據時,我們僅提取了5個最受歡迎的書架,以限制此類清理工作。 在這里,我們將完成工作。
After some inspection, we see that these substrings represent the bulk of non-genre shelves. We’ll filter them out using a regular expression. Note: We use two strings in the regex so the line doesn’t get cut off. Adjacent strings inside parantheses are joined at compile time.
經過檢查,我們發現這些子字符串代表了大部分非類型的貨架。 我們將使用正則表達式過濾掉它們。 注意 :我們在正則表達式中使用了兩個字符串,因此該行不會被截斷。 括號內的相鄰字符串在編譯時連接。
多合一清潔 (All-in-one Cleaning)
Now we’ll build and run one function to wrangle the data set. This way, the cleaning is more reproducible and debug-able.
現在,我們將構建并運行一個函數來糾纏數據集。 這樣,清潔更可重復且可調試。
比較未調整和調整后的平均評分 (Compare Unadjusted and Adjusted Average Ratings)
Numerically, the central measures of tendency of mean (in blue) and median (in green) slightly decrease, and the variance significantly decreases.
在數值上,均值(藍色)和中位數(綠色)趨勢的中心度量略有降低,并且方差顯著降低。
Visually, we can see the rating adjustment in the much smoother and wider distribution (although note that the x-axis is truncated). This is from eliminating outlier books with no or very few ratings, and scaling down ratings with high uncertainty.
在視覺上,我們可以看到等級調整更加平滑和廣泛(盡管請注意,x軸被截斷了)。 這是因為消除了沒有評級或評級很少的離群圖書,并降低了不確定性很高的評級。
Unadjusted mean: 3.82Unadjusted median: 3.93
Unadjusted variance: 0.48Adjusted mean: 3.71
Adjusted median: 3.77
Adjusted variance: 0.17
資料泄漏 (Data Leakage)
Because our target is derived from ratings, training our model using ratings is effectively training with the target. To avoid distorting the model, we must drop these columns.
由于我們的目標是從評級得出的,因此使用評級訓練模型可以有效地對目標進行訓練。 為了避免扭曲模型,我們必須刪除這些列。
It is also possible that review_count is a bit of leakage, but it seems more like a proxy for popularity, not greatness, in the same way that pop(ular) songs are not often considered classics. Of course, we'll reconsider this if its permutation importance is suspiciously high.
review_count也可能有點泄漏,但似乎更像是流行的 review_count ,而不是review_count ,就像流行(ular)歌曲通常不被視為經典之類。 當然,如果其排列重要性很高,我們將重新考慮。
分割資料 (Split Data)
We’ll do an 85/15 train-test split, then re-split our train set to make the validation set about the same size as the test set.
我們將進行85/15火車測試拆分,然后重新拆分火車集,以使驗證集的大小與測試集的大小相同。
(20281, 12) (20281,) (4348, 12) (4348,) (4347, 12) (4347,)評估指標 (Evaluation Metrics)
With classes this imbalanced, accuracy (correct predictions / total predictions) can become misleading. There just aren’t enough true positives for this fraction to be the best measure of model performance. So we’ll also use ROC AUC, a Receiver Operator Characteristic Area Under the Curve. Here is a colored drawing of one, courtesy of Martin Thoma.
對于此類不平衡的類,準確性(正確的預測/總的預測)可能會引起誤解。 對于這部分,沒有足夠的真實肯定作為衡量模型性能的最佳方法。 因此,我們還將使用ROC AUC,即曲線下方的接收器操作員特征區域。 這是馬丁·托馬(Martin Thoma)提供的彩色圖畫。
Drawing of ROC AUC in the style of XKCDXKCD樣式的ROC AUC圖紙The ROC curve is a plot of a classification model’s true positive rate (TPR) against its false positive rate (FPR). The ROC AUC is the area from [0, 1] under and to the right of this curve. Since optimal model performance maximizes true positives and minimizes false positives, the optimal point in this 1x1 plot is the top left, where the area under the curve (ROC AUC) = 1.
ROC曲線是分類模型的真實陽性率(TPR)相對其假陽性率(FPR)的圖。 ROC AUC是此曲線下方和右側的[0,1]區域。 由于最佳模型性能可最大化真實正值并最小化錯誤正值,因此此1x1圖中的最佳點位于左上方,曲線下面積(ROC AUC)= 1。
For imbalanced classes such as great, ROC AUC outperforms accuracy as a metric because it better reflects the relationship between true positives and false positives. It also depicts the classifier’s performance across all its values, giving us more information about when and where the model improves, plateaus, or suffers.
對于不平衡類(例如great ,ROC AUC優于準確性作為度量標準,因為它可以更好地反映真假陽性之間的關系。 它還描述了分類器在其所有值上的表現,為我們提供了有關模型何時何地改進,平穩或遭受損失的更多信息。
擬合模型 (Fit Models)
Predicting great books is a binary classification problem, so we need a classifier. Below, we’ll encode, impute, and fit to the data a linear model (Logistic Regression) and two tree-based models (Random Forests and XGBoost), then compare them to each other and to the majority baseline. We’ll calculate their accuracy and ROC AUC, and then make a visualization.
預測好書是一個二進制分類問題,因此我們需要一個分類器。 下面,我們將對數據進行編碼,估算和擬合,以線性模型(Logistic回歸)和兩個基于樹的模型(Random Forests和XGBoost)進行比較,然后將它們相互比較并與多數基準進行比較。 我們將計算其準確性和ROC AUC,然后進行可視化。
多數階級基線 (Majority Class Baseline)
First, by construction, great books are the top 20% of books by Wilson-adjusted rating. That means our majority class baseline (no books are great) has an accuracy of 80%.
首先,從結構上看,按威爾遜調整后的評分, great書籍是排名前20%的書籍。 這意味著我們的大多數班級基線( 沒有一本好書)的準確性為80%。
Second, this “model” doesn’t improve, plateau, or suffer since it has no discernment to begin with. A randomly chosen positive would be treated no differently than a randomly chosen negative. In other wrods, its ROC AUC = 0.5.
第二,此“模型”沒有改善,穩定或遭受損害,因為它一開始沒有識別力。 隨機選擇的陽性與隨機選擇的陰性沒有區別。 在其他情況下,其ROC AUC = 0.5。
Baseline Validation Accuracy: 0.8Baseline Validation ROC AUC: 0.5
邏輯回歸 (Logistic Regression)
Now we’ll fit a linear model with cross-validation, re-calculate evaluation metrics, and plot a confusion matrix.
現在,我們將使用帶有交叉驗證的線性模型,重新計算評估指標,并繪制混淆矩陣。
Baseline Validation Accuracy: 0.8Logistic Regression Validation Accuracy: 0.8013
Baseline Validation ROC AUC: 0.5
Logistic Regression Validation ROC AUC: 0.6424
Logistic回歸混淆矩陣 (Logistic Regression Confusion Matrix)
隨機森林分類器 (Random Forest Classifier)
Now we’ll do the same as above with a tree-based model with bagging (Bootstrap AGGregation).
現在,我們將對帶有袋裝( B ootstrap AGG規約)的基于樹的模型進行與上述相同的操作。
Baseline Validation Accuracy: 0.8Logistic Regression Validation Accuracy: 0.8013
Random Forest Validation Accuracy: 0.8222
Majority Class Baseline ROC AUC: 0.5
Logistic Regression Validation ROC AUC: 0.6424
Random Forest Validation ROC AUC: 0.8015
隨機森林混淆矩陣 (Random Forest Confusion Matrix)
XGBoost分類器 (XGBoost Classifier)
Now we’ll do the same as above with another tree-based model, this time with boosting.
現在,我們將對另一個基于樹的模型進行與上述相同的操作,這一次是使用boosting。
Baseline Validation Accuracy: 0.8Logistic Regression Validation Accuracy: 0.8013
Random Forest Validation Accuracy: 0.8245
XGBoost Validation Accuracy: 0.8427
Majority Class Baseline ROC AUC: 0.5
Logistic Regression Validation ROC AUC: 0.6424
Random Forest Validation ROC AUC: 0.8011
XGBoost Validation ROC AUC 0.84
XGBClassifier performes the best in accuracy and ROC AUC.
XGBClassifier在準確性和ROC AUC方面表現最佳。
圖形和比較模型的ROC AUC (Graph and Compare Models’ ROC AUC)
Below, we see that logistic regression lags far behind XGBoost and Random Forests in achieving a high ROC AUC. Among the top two, XGBoost initially outperforms RandomForest, and then the two roughly converge around FPR=0.6. We see in the lower right legend, however, that XGBoost has the highest AUC of 0.84, followed by Random Forest at 0.80 and Logistic Regression at 0.64.
在下面,我們看到邏輯回歸在實現高ROC AUC方面遠遠落后于XGBoost和Random Forests。 在前兩個中,XGBoost最初優于RandomForest,然后兩個大致在FPR = 0.6左右收斂。 但是,我們在右下角的圖例中看到,XGBoost的AUC最高,為0.84,其次是隨機森林,為0.80,邏輯回歸為0.64。
In less technical language, the XGBoost model was the best at classifying great books as great (true positives) and not classifying not-great books as great (false positives).
用較少技術的語言,XGBoost模型最擅長將優秀書籍分類為優秀書籍(真實肯定),而不是將非優秀書籍分類為優秀書籍(錯誤肯定)。
排列重要性 (Permutation Importances)
One intuitive way of identifying whether and to what extent something is important is by seeing what happens when you take it away. This is the best in a situation unconstrained by time and money.
識別某些事物是否重要以及在什么程度上重要的一種直觀方法是查看將其拿走時發生的情況。 在不受時間和金錢約束的情況下,這是最好的選擇。
But in the real world with real constrains, we can use permutation instead. Instead of eliminating the column values values by dropping them, we eliminate the column’s signal by randomizing it. If the column really were a predictive feature, the order of its values would matter, and shuffling them around would substantially dilute if not destroy the relationship. So if the feature’s predictive power isn’t really hurt or is even helped by randomization, we can conclude that it is not actually important.
但是,在具有實際約束的現實世界中,我們可以改用排列。 我們沒有通過丟棄來消除列值的值,而是通過隨機化來消除列的信號 。 如果該列確實是一個預測性特征,則其值的順序將很重要,并且如果不破壞該關系,將它們改組也將大大稀釋。 因此,如果該功能的預測能力并沒有真正受到損害,甚至沒有受到隨機化的幫助,我們可以得出結論,它實際上并不重要。
Let’s take a closer look at the permutation importances of our XGBoost model. We’ll have to refit it to be compatible with eli5.
讓我們仔細看看XGBoost模型的排列重要性。 我們必須對其進行改裝以使其與eli5兼容。
排列重要性分析 (Permutation Importance Analysis)
As we assumed at the beginning, review_count matters but it is not suspiciously high. This does not seem to rise to the level of data leakage. What this means is that if you were wondering what book to read next, a useful indicator is how many reviews it has, a proxy for how many others have read it.
正如我們在一開始所假設的那樣, review_count重要,但并不高。 這似乎沒有達到數據泄漏的程度。 這意味著如果您想知道接下來要讀什么書,一個有用的指標是它擁有多少評論,以及多少其他書已經讀過它的代理。
We see that genres is the second most important feature for ROC AUC in the XGBoost model.
我們看到genres是XGBoost模型中ROC AUC的第二重要功能。
author is third, which is surprising and perhaps a bit concerning. Because our test set is not big, the model may just be identifying authors whose books are the most highly rated in wilson-adjusted terms, such as J.K. Rowling and Suzanne Colins. More data would be useful to test this theory.
author排名第三,這令人驚訝,也許有點令人擔憂。 由于我們的測試集不大,因此該模型可能只是在識別按威爾森調整后的書評鑒最高的作者,例如JK Rowling和Suzanne Colins。 更多的數據將有助于檢驗這一理論。
Fourth is num_pages. I thought this would be higher for two reasons:
第四是num_pages 。 我認為這會更高,原因有兩個:
帶走 (Takeaway)
We’ve seen how to collect, clean, analyze, visualize, and model data. Some actionable takeaways are that when and who publishes a book doesn’t really matter, but its review count does — the more reviews, the better.
我們已經看到了如何收集,清理,分析,可視化和建模數據。 一些可行的建議是,何時以及誰出版一本書并不重要,但它的評論數卻很重要–評論越多越好。
For further analysis, we could break down genres and authors to find out which ones were rated highest. For now, happy reading.
為了進行進一步的分析,我們可以分解genres和authors ,找出哪些被評為最高。 現在,閱讀愉快。
翻譯自: https://medium.com/@ryan.koul/predicting-great-books-from-goodreads-data-using-python-1d378e7ef926
爬蟲goodreads數據
總結
以上是生活随笔為你收集整理的爬虫goodreads数据_使用Python从Goodreads数据中预测好书的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 快手关注的人发作品会有提醒吗
- 下一篇: 河北:启动雄安国际互联网数据专用通道建设