當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

spotify歌曲下载_使用Spotify数据预测哪些“ Novidades da semana”歌曲会成为热门歌曲

發布時間：2023/11/29 编程问答 37 豆豆

生活随笔收集整理的這篇文章主要介紹了 spotify歌曲下载_使用Spotify数据预测哪些“ Novidades da semana”歌曲会成为热门歌曲小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

spotify歌曲下載

TL; DR (TL;DR)

Spotify is my favorite digital music service and I’m very passionate about the potential to extract meaningful insights from data. Therefore, I decided to do this article to consolidate my knowledge of some classification models and to contribute to the study of other beginners in Data Science.

Spotify是我最喜歡的數字音樂服務，我非常熱衷于從數據中提取有意義的見解的潛力。因此，我決定寫這篇文章來鞏固我對一些分類模型的了解，并為研究數據科學的其他初學者做出貢獻。

I constructed a dataset with 2755 hit and non-hit songs and extracted their audio features using the Spotipy library. I tested three classification models (Random Forest, Logistic Regression, and SVM) and choose the model with the best accuracy to predict what new songs would be hits.

我用2755首熱門歌曲和未熱門歌曲構建了一個數據集，并使用Spotipy庫提取了它們的音頻特征。我測試了三種分類模型(Random Forest，Logistic回歸和SVM)，并選擇了精度最高的模型來預測將要流行的新歌曲。

1.簡介 (1. Introduction)

Spotify API provides full access to all music data available on Spotify. To access Spotify API, you have to register on the Spotify website dedicated to developers, select “Create an App”, register your information, and get your CLIENT_ID and CLIENT_SECRET. The API documentation and the data are easy to understand, maintained, and include essential metadata.

Spotify API提供對Spotify上所有可用音樂數據的完全訪問權限。要訪問Spotify API，您必須在專用于開發人員的Spotify網站上注冊，選擇“創建應用程序”，注冊信息，并獲取CLIENT_ID和CLIENT_SECRET。 API 文檔和數據易于理解，維護，并包含必要的元數據。

We will try to discover what are the five artists that have more songs considered hits, what kind of music is most successful (positive or negative), and try to predict which songs in “Novidades da semana” can become a hit.

我們將嘗試找出五首歌手中有更多歌曲被視為熱門歌曲的藝術家，哪種音樂最成功(正或負)，并嘗試預測“ Novidades da semana”中的哪些歌曲會成為熱門歌曲。

2.數據集和功能 (2. Dataset and Features)

Using Spotipy library, I created two datasets:

使用Spotipy庫，我創建了兩個數據集：

2.1數據集 (2.1 dataset)

Composed of songs that are considered hits in the world, e.g., it was collected unique songs of the playlist “Top 50 by country” of all countries. These songs are considered as a hit (success = 1).The dataset is also composed of unique songs of random playlists from each genre (Sertanejo, Funk, Samba & Pagode, Rock, Jazz, Reggae, among others). These songs are considered as a non-hit (success = 0).That way, the dataset has 2755 songs considered hits and non-hits.

由世界上流行歌曲組成，例如，它是所有國家/地區的“國家排名前50位”播放列表中的獨特歌曲。這些歌曲被視為熱門歌曲(成功= 1)。數據集還由來自各流派(Sertanejo，Funk，Samba＆Pagode，Rock，Jazz，Reggae等)的隨機播放列表的獨特歌曲組成。這些歌曲被認為是非熱門歌曲(成功= 0)。這樣，數據集中有2755首歌曲被視為熱門歌曲和非熱門歌曲。

2.2測試儀 (2.2 test set)

The test set is composed of the best new releases “Novidades da semana” playlist that will be used to predict the probability of new songs become a hit.

測試集由最佳的新專輯“ Novidades da semana ”播放列表組成，這些播放列表將用于預測新歌流行的可能性。

More datails about how I created the datasets could be found at my Github repository.

在我的Github存儲庫中可以找到有關如何創建數據集的更多數據。

2.3特點 (2.3 Features)

Each track contains features categorized by track, artist and album information, and also audio analysis features. See more about the features HERE. The most relevant features for this article are explained in greater detail in later sections.

每個曲目都包含按曲目，藝術家和專輯信息分類的功能，以及音頻分析功能。在此處查看有關功能的更多信息。與本文最相關的功能將在后面的部分中詳細說明。

讓我們開始吧！ (Let’s get started!)

3.導入庫 (3. Import the libraries)

We will use pandas for data manipulation, NumPy for numerical computing, matplotlib and seaborn to data visualization, and sklearn for machine learning models, evaluation and dataset split.

我們將使用熊貓進行數據處理，使用NumPy進行數值計算，使用matplotlib和seaborn進行數據可視化，使用sklearn進行機器學習模型，評估和數據集拆分。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.metrics import accuracy_score

The playlists “Top 50 by country” are updated daily and the “Novidades da semana” playlist, besides is updated every week, it could be different based on your profile. That way, the csv files contain the date that they were generated.

每天更新“國家排名前50”的播放列表，每周更新“ Novidades da semana”的播放列表，但根據您的個人資料，播放列表可能會有所不同。這樣，csv文件將包含它們的生成日期。

dataset = pd.read_csv('spotifyAnalysis-08022020.csv')
test = pd.read_csv('predictSpotifyAnalysis-08022020.csv')

5.數據概述 (5. Data overview)

Let’s visualize the dataset and its features.

讓我們可視化數據集及其特征。

dataset.head()

Using pandas.DataFrame.describe, we can see the following statistics and analyze the central tendency, dispersion and shape of a dataset’s distribution.

使用pandas.DataFrame.describe ，我們可以查看以下統計信息，并分析數據集分布的集中趨勢，離散度和形狀。

dataset.describe()

We can observe that tempo, key, duration_ms, loudness and popularity features are not on the same scale, so we will rescaling the data in the next section.

我們可以觀察到速度，音調，duration_ms，響度和流行度功能不在同一個比例上，因此我們將在下一部分中重新縮放數據。

6.數據清理 (6. Data Cleaning)

There are no missing data and there is no need to treat categorical variables.

沒有丟失的數據，也不需要處理分類變量。

6.1數據縮放 (6.1 Data Rescaling)

We will use the MinMaxScaler which rescaling is done independently between each column, in such a way that the new scale will be between 0 and 1 (or -1 and 1 if there are negative values ??in the dataset) and also preserves the original distribution.

我們將使用MinMaxScaler，它在每一列之間獨立地進行重新縮放，這樣新的縮放比例將在0和1之間(如果數據集中有負值，則在-1和1之間)并保留原始值分配。

MinMaxScaler subtracts each value by the lowest value in the column and then divides it by the difference between the maximum and minimum value.

MinMaxScaler用列中的最小值減去每個值，然后將其除以最大值和最小值之間的差。

# Rescaling tempo, key, duration_ms, loudness and popularity features.
scaler = MinMaxScaler()scaled_values = scaler.fit_transform(dataset[['tempo', 'key', 'duration_ms','loudness', 'popularity']])
dataset[['tempo', 'key', 'duration_ms','loudness', 'popularity']] = scaled_valuesscaled_values = scaler.fit_transform(test[['tempo', 'key', 'duration_ms','loudness', 'popularity']])
test[['tempo', 'key', 'duration_ms','loudness', 'popularity']] = scaled_values

7.探索性數據分析 (7. Exploratory Data Analysis)

7.1關聯 (7.1 Correlation)

Correlation is a statistical technique to measure how variables are related.

關聯是一種統計技術，用于衡量變量之間的關系。

Positive correlation: Indicates that the two variables move together.Negative correlation: Indicates that the two variables move in opposite directions.

正相關：表示兩個變量一起移動。 負相關 ：指示兩個變量沿相反方向移動。

Source: Correlation Co-efficient [1]資料來源：相關系數[1] plt.figure(figsize=(12,12))
corr = dataset.corr()
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask, 1)] = True
sns.heatmap(corr, mask=mask, annot=True, cmap="Greens")

The variables with a stronger correlation are loudness x energy (strong and positive) and acousticness x energy (strong and negative).

相關性更強的變量是響度 x 能量 (強和正) 和聲學 x 能量 (強和負)。

Loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.

響度：軌道的整體響度，以分貝(dB)為單位。響度值是整個軌道的平均值，可用于比較軌道的相對響度。響度是聲音的質量，它是身體力量(振幅)的主要心理關聯。值的典型范圍是-60至0 db。

Energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.

能量：能量是從0.0到1.0的量度，表示強度和活動的感知量度。通常，充滿活力的曲目會感覺快速，響亮且嘈雜。例如，死亡金屬具有較高的能量，而巴赫前奏的得分則較低。有助于此屬性的感知特征包括動態范圍，感知的響度，音色，發作率和一般熵。

Acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.

聲學：軌道是否聲學的置信度，范圍為0.0到1.0。 1.0表示音軌是聲學的高置信度。

Let’s visualize the correlation between the variables.

讓我們可視化變量之間的相關性。

axis = ['ax0','ax1']
features = [['energy','loudness'],['energy','acousticness']]
colors = ['#48d66c', '#bd36d8']
titles = ['Energy x Loudness', 'Energy x Acousticness']
plot_dist_reg(1, 2, axis, features, colors, titles)

It can be concluded that tracks with higher energy tend to have higher volume in decibels (loudness) and tracks with less energy tend to be an acoustic song.

可以得出結論，具有較高能量的音軌往往具有較大的分貝(響度)音量，具有較低能量的音軌往往是聲學歌曲。

7.2類的可視化 (7.2 Class visualization)

Let’s visualize the class distribution.

讓我們可視化類分布。

plt.figure(1 , figsize = (15 , 5))
ax = sns.countplot(y = 'success', data = dataset, palette="Greens")
ax.set_title('Number of success (1) and non success (0) songs')
show_values_on_bars(ax, "h", 10)
plt.show()

There are more non-hit songs then hit songs in the dataset.

數據集中的非流行歌曲比非流行歌曲多。

7.3熱門歌曲 (7.3 Hit songs)

The next step is to analyze the songs considered as hits.

下一步是分析被視為熱門歌曲。

# Get only hit songs
hits_df = dataset[dataset['success'] == 1]

What are the five artists that have more songs considered hit?

五位擁有更多歌曲的歌手被認為是熱門？

top_artists = hits_df['artist'].value_counts()[:5]
name = top_artists.index.tolist()
amount = top_artists.values.tolist()plt.figure(1 , figsize = (15, 5))
ax = sns.barplot(x = name, y = amount, palette="Purples_d")
ax.set_title('Artists with more hit songs')
show_values_on_bars(ax, "v", 10)
plt.show()

What kind of music is the most successful: positive or negative?

哪種音樂最成功：正面還是負面？

valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

價：從0.0到1.0的小節，描述了曲目傳達的音樂積極性。價態高的音軌聽起來更積極(例如，快樂，開朗，欣快)，而價態低的音軌聽起來更加消極(例如，悲傷，沮喪，憤怒)。

To exemplify what is a song considered positive or negative, the song with the lowest valence (0.0349) in the dataset is Maia (Kamilo Sanclemente) and the song with the highest valence (0.9770) in the dataset is Corona (Minutemen).

為了舉例說明什么是陽性或陰性歌曲，數據集中具有最低價(0.0349)的歌曲是Maia (Kamilo Sanclemente)，數據集中具有最高價(0.9770)的歌曲是Corona (Minutemen)。

valence = hits_df['valence'].value_counts()
valence_value = valence.index.tolist()
amount = valence.values.tolist()
i, high, low = 0, 0, 0for v in valence_value:
if (float(v) >= 0.5):
high += amount[i]
else:
low += amount[i]
i += 1print('Positive tracks: ', high)
print('Negative tracks: ', low)output >>> Positive tracks: 704
Negative tracks: 547

So, most hit songs are positive (happy, cheerful, euphoric).

因此，大多數熱門歌曲都是正面的(快樂，開朗，欣快)。

8.機器學習建模與評估 (8. Machine Learning Modeling and Evaluation)

The dataset was split into training (70%) and test (30%).

數據集分為訓練(70％)和測試(30％)。

# Split features and class data and drop irrelevant columns
X = dataset.drop(['success', 'artist', 'track_name'], axis=1).values
y = dataset[['success']].values# Split train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

To predict whether a song will be a hit or not, we will use three different models (Random Forest, Logistic Regression and SVM) and select the best one based on the accuracy result.

為了預測歌曲是否會流行，我們將使用三種不同的模型(Random Forest，Logistic回歸和SVM)，并根據準確性結果選擇最佳的模型。

Accuracy is how close a measurement is to the true value.

精度是測量值與真實值的接近程度。

8.1隨機森林 (8.1 Random Forest)

The random forest model combines a hundred of decision trees, each of which is trained on a different subset of the song features and different subset of the training data. The model makes a prediction, i.e., decides if a song is a hit or non-hit, performs a vote for each predicted result and then selects the prediction result with the most votes as the final prediction [3].

隨機森林模型結合了一百個決策樹，每個決策樹都在歌曲特征的不同子集和訓練數據的不同子集上進行訓練。該模型進行預測，即確定歌曲是熱門還是非熱門，對每個預測結果進行投票，然后選擇投票最多的預測結果作為最終預測[3]。

# Create the classifier object
rf_model = RandomForestClassifier(n_estimators = 100)# Train
rf_model.fit(X_train, y_train.ravel())# Predict
y_pred = rf_model.predict(X_test)print('Accuracy: ', accuracy_score(y_test, y_pred))output >>> Accuracy: 0.7315598548972189

8.2 Logistic回歸 (8.2 Logistic Regression)

The logistic regression model linearly separates the data into two categories, i.e., predicts the probability of occurrence of a binary event utilizing a logit function and assigning a weight to each song feature, then uses these weights to predict whether a song is in the “hit” or “non-hit” category [4].

Logistic回歸模型將數據線性地分為兩類，即，使用logit函數預測二進制事件的發生概率，并為每首歌曲特征分配權重，然后使用這些權重來預測歌曲是否在“熱門歌曲”中”或“非熱門”類別[4]。

# Create the classifier object
lg_model = LogisticRegression()# Train
lg_model.fit(X_train, y_train.ravel())# Predict
y_pred = lg_model.predict(X_test)print('Accuracy: ', accuracy_score(y_test, y_pred))output >>> Accuracy: 0.6952841596130592

8.3支持向量機 (8.3 SVM)

The SVM model selects the best “hyperplane” (e.g., the “hyperplane” which has the maximum possible margin between support vectors) that separates the data into two categories [5].

SVM模型選擇將數據分為兩類的最佳“超平面”(例如，在支持向量之間具有最大可能余量的“超平面”)。

# Create the classifier object
svm_model = svm.SVC(kernel='linear')# Train
svm_model.fit(X_train, y_train.ravel())# Predict
y_pred = svm_model.predict(X_test)print('Accuracy: ', accuracy_score(y_test, y_pred))output >>> Accuracy: 0.6977025392986699

8.4評估 (8.4 Evaluation)

The accuracy of the 3 modeling methods are:

三種建模方法的準確性為：

Random Forest: 0.731Logistic Regression: 0.695SVM: 0.697

隨機森林：0.731邏輯回歸：0.695支持向量：0.697

9.結果 (9. Result)

As a result, the Random Forest model will be applied to predict the songs from “Novidades da semana” on Spotify.

結果，隨機森林模型將用于預測Spotify上“ Novidades da semana”中的歌曲。

# Drop irrelevant columns
df_test = test.drop(['artist', 'track_name'], axis=1).values# Predict
test_predict = rf_model.predict(df_test)# Get only predict hit songs
hits_predict = (test_predict == 1).sum()
print(hits_predict, "out of", len(test_predict), "was predicted as HIT")output >>> 13 out of 60 was predicted as HIT

Which songs in “Novidades da semana” can become a hit? Let’s see the result.

“ Novidades da semana”中的哪些歌曲可以成為熱門？讓我們看看結果。

df = pd.DataFrame({'Song': test['track_name'], 'Artist': test['artist'], 'Predict': test_predict})
df.sort_values(by=['Predict'], inplace=True, ascending=False)
df

10.結論 (10. Conclusion)

Analyzing Spotify data on August 2nd, 2020, it could be concluded:

分析2020年8月2日的 Spotify數據，可以得出以下結論：

- The five artists who have more songs considered hits are Taylor Swift, KESI, Boza, Apache 207 and Bad Bunny.

-五位擁有更多熱門歌曲的歌手是Taylor Swift，KESI，Boza，Apache 207和Bad Bunny 。

- Most hit songs are positive (happy, cheerful, euphoric).

-大多數熱門歌曲都是正面的 (快樂，開朗，欣快)。

- The model with the best accuracy to predict what new songs will be hits is Random Forest.

-最準確地預測哪些新歌曲會流行的模型是Random Forest 。

- The songs of “Novidades da semana” that have a probability to be hits based on hits characteristics of “Top 50 by country” (all countries) are Clap From Road To Fast 9 Mixtape (Don Toliver), Cuidado Que Eu Te Supero (Yasmin Santos), my future (Billie Eilish), My Oasis feat. Burna Boy (Sam Smith), I Should Probably Go To Bed (Dan + Shay), Who’s Laughing Now (Ava Max), WHAT YOU GONNA DO??? (Bastille), TOMA (Luísa Sonza), Lei áurea (Borges), The Usual (Sam Fischer), By Any Means (Jorja Smith), Move Ya Hips feat. Nicki Minaj & MadeinTYO (A$AP Ferg) and Hawái (Maluma).

-基于“按國家排名前50位”(所有國家/地區)的流行特征而很有可能被選為“ Novidades da semana”的歌曲，包括《 從公路到快9混音帶》(Don Toliver)，《 Cuidado Que Eu Te Supero》( Yasmin Santos)，我的未來(Billie Eilish)，My Oasis壯舉。 Burna Boy(Sam Smith)，我應該上床睡覺(Dan + Shay)，誰在笑(Ava Max)，您想做什么？？？ (巴士底獄)，托馬(路易斯·桑薩)，雷阿雷亞(博格斯)，慣常(薩姆·菲舍爾)，通過任何方式(喬爾·史密斯)，莫亞·希普斯壯舉。 Nicki Minaj和MadeinTYO(A $ AP Ferg)和Hawái(Maluma)。

請在此處查看完整的代碼。 (See the complete code HERE.)

11.參考 (11. References)

[1] CORRELATIONAL ANALYSIS: POSITIVE, NEGATIVE AND ZERO CORRELATIONS. https://psychologyhub.co.uk/correlational-analysis-positive-negative-and-zero-correlations/
[1]相關分析：正，負和零相關。 https://psychologyhub.co.uk/correlational-analysis-positive-negative-and-zero-correlations/
[2] Song hit prediction: predicting billboard hits using Spotify data. arXiv:1908.08609 [cs.IR]. arxiv.org/abs/1908.08609
[2]歌曲匹配預測：使用Spotify數據預測廣告牌匹配。 arXiv：1908.08609 [cs.IR]。 arxiv.org/abs/1908.08609
[3] NAVLANI, Avinash. Understanding Random Forests Classifiers in Python. https://www.datacamp.com/community/tutorials/random-forests-classifier-python
[3] NAVLANI，阿維納什。了解Python中的隨機森林分類器。 https://www.datacamp.com/community/tutorials/random-forests-classifier-python
[4] NAVLANI, Avinash. Understanding Logistic Regression in Python. https://www.datacamp.com/community/tutorials/understanding-logistic-regression-python
[4] NAVLANI，Avinash。了解Python中的邏輯回歸。 https://www.datacamp.com/community/tutorials/understanding-logistic-regression-python
[5] NAVLANI, Avinash. Support Vector Machines with Scikit-learn. https://www.datacamp.com/community/tutorials/svm-classification-scikit-learn-python
[5] NAVLANI，阿維納什。支持帶有Scikit學習的矢量機。 https://www.datacamp.com/community/tutorials/svm-classification-scikit-learn-python

翻譯自: https://medium.com/@jcarolinedias1/using-spotify-data-to-predict-which-novidades-da-semana-songs-would-become-hits-e817ae0c091