當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

遭遇棘手交接_Librosa的城市声音分类-棘手的交叉验证

發布時間：2023/12/15 编程问答 40 豆豆

生活随笔收集整理的這篇文章主要介紹了遭遇棘手交接_Librosa的城市声音分类-棘手的交叉验证小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

遭遇棘手交接

大綱 (Outline)

The goal of this post is two-fold:

這篇文章的目標有兩個：

I’ll show an example of implementing the results of an interesting research paper on classifying audio clips based on their sonic content. This will include applications of the librosa library, which is a Python package for music and audio analysis. The clips are short audio clips from city, and the classification task is predicting the appropriate category label.

我將展示一個示例，該示例將實現一個有趣的研究論文的結果，該論文基于音頻片段的聲音內容對音頻片段進行分類。這將包括librosa庫的應用程序，該庫是用于音樂和音頻分析的Python軟件包。這些剪輯是來自城市的簡短音頻剪輯，并且分類任務正在預測適當的類別標簽。

I’ll show the importance of a valid cross-validation scheme. Given the nuances of the audio source dataset I’ll be using, it is very easy to accidentally leak information from the recording that will overfit your model and prevent it from generalizing. The solution is somewhat subtle so it seemed like a nice opportunity for a blog post.

我將展示有效的交叉驗證方案的重要性。考慮到我將要使用的音頻源數據集的細微差別，很容易意外地從錄音中泄漏信息，這些信息會過度擬合模型并阻止其泛化。該解決方案有些微妙，因此對于博客帖子來說似乎是一個不錯的機會。

原始研究論文 (Original research paper)

http://www.justinsalamon.com/uploads/4/3/9/4/4394963/salamon_urbansound_acmmm14.pdf

來源數據集，論文作者 (Source dataset, by paper authors)

https://urbansounddataset.weebly.com/urbansound8k.html

他們的數據集摘要 (Summary of their dataset)

“This dataset contains 8732 labeled sound excerpts (<=4s) of urban sounds from 10 classes: air_conditioner, car_horn, children_playing, dog_bark, drilling, engine_idling, gun_shot, jackhammer, siren, and street_music. The classes are drawn from the urban sound taxonomy.”

“此數據集包含來自10類的城市聲音的 8732個標記的聲音摘錄(<= 4s) ：空調，汽車喇叭，兒童游戲，狗吠，鉆探，引擎空轉，槍聲，手提鉆，警笛，警笛和street_music。這些類別是根據城市聲音分類法得出的。”

I’ll extract features from these sound excerpts and fit a classifier to predict one of the 10 classes. Let’s get started!

我將從這些聲音摘錄中提取特征，并使用分類器來預測10個類別之一。讓我們開始吧！

請注意我的代碼 (Note on my Code)

I’ve created a repo that allows you to re-create my example in full:

我創建了一個存儲庫，使您可以完整地重新創建示例：

Script runner: https://github.com/marcmuon/urban_sound_classification/blob/master/main.py

腳本執行者 ： https : //github.com/marcmuon/urban_sound_classification/blob/master/main.py

Feature extraction module: https://github.com/marcmuon/urban_sound_classification/blob/master/audio.py

特征提取模塊 ： https : //github.com/marcmuon/urban_sound_classification/blob/master/audio.py

Model module: https://github.com/marcmuon/urban_sound_classification/blob/master/model.py

模型模塊 ： https : //github.com/marcmuon/urban_sound_classification/blob/master/model.py

The script runner handles loading the source audio from disk, parsing the metadata about the source audio, and passing this information to the feature extractor and the model.

腳本運行程序處理從磁盤加載源音頻的過程，解析有關源音頻的元數據，并將此信息傳遞給功能提取器和模型。

下載數據 (Downloading the data)

You can download the data, which extracts to 7.09GB, using this form from the research paper authors: https://urbansounddataset.weebly.com/download-urbansound8k.html

您可以使用以下來自研究論文作者的表格下載數據，該數據可提取至7.09GB： https ://urbansounddataset.weebly.com/download-urbansound8k.html

目錄結構[可選部分-如果您想自己運行它] (Directory structure [optional section — if you want to run this yourself])

Obviously you can fork the code and re-map it to whatever directory structure you want, but if you want to follow mine:

顯然，您可以分叉代碼并將其重新映射到所需的任何目錄結構，但是如果您要遵循我的代碼：

In your home directory: create a folder called datasets, and in there place the unzipped UrbanSound8K folder [from link in ‘Downloading the Data’]
在您的主目錄中：創建一個名為datasets的文件夾，然后在其中放置解壓縮的UrbanSound8K文件夾[來自“下載數據”中的鏈接]
Also in your home directory: create a projects folder and put the cloned repo there ending up with ~/projects/urban_sound_classification/…
同樣在您的主目錄中：創建一個項目文件夾，并將克隆的存儲庫放在此處，以?/ projects / urban_sound_classification /…結尾。

Within the code, I use some methods to automatically write the extracted feature vectors for each audio file into ~/projects/urban_sound_classification/data

在代碼中，我使用一些方法將每個音頻文件的提取特征向量自動寫入?/ projects / urban_sound_classification / data

I do this because the feature extraction takes a long time and you won’t want to do it twice. There’s also code that checks to see if these feature vectors exist.

我這樣做是因為特征提取需要很長時間，并且您不想重復兩次。還有代碼檢查這些特征向量是否存在。

tl;dr — if you follow my directory structure, you can simply run the main.py script and everything should work!

tl; dr —如果遵循我的目錄結構，則只需運行main.py腳本，一切就可以正常工作！

為什么這個問題需要仔細的交叉驗證 (Why this problem requires careful cross-validation)

Note that the source data is split up into 10 sub-folders, labeled ‘Fold1’, ‘Fold2’, etc.

請注意，源數據分為10個子文件夾，分別標記為“ Fold1”，“ Fold2”等。

We have 8,732 four-second audio clips of various city sounds. These clips were manually created by the research paper authors, where they labeled them into groups such as ‘car horn’, ‘jackhammer’, ‘children playing’, and so on. In addition to the 10 folds, there are 10 classes.

我們提供了8,732個四秒鐘的各種城市聲音的音頻剪輯。這些片段是由研究論文作者手動創建的，他們將它們標記為“汽車喇叭”，“手提鑿巖機”，“玩耍的孩子”等組。除10折外，還有10類。

The fold numbers do not have anything to do with the class labels; rather, the folds refer to the uncut audio file(s) that these 4-second training examples were spliced from.

折疊數字與類別標簽無關； 折疊是指這些4秒鐘訓練示例的未切割音頻文件。

What we don’t want is for the model to be able to learn how to classify things based on aspects of the particular underlying recording.

我們不希望模型能夠學習如何基于特定基礎記錄的各個方面對事物進行分類。

We want a generalizable classifier that will work with a wide array of recording types, but that still classifies the sounds correctly.

我們想要一個可歸類為通用的分類器，該分類器可用于多種錄音類型，但仍可以正確分類聲音。

論文作者關于適當簡歷的指導 (Guidance from the paper authors on proper CV)

That’s why the authors have pre-built folds for us, and offered the following guidance, which is worth quoting:

這就是為什么作者為我們預先構建折頁，并提供以下指導的原因，值得引用：

Don’t reshuffle the data! Use the predefined 10 folds and perform 10-fold (not 5-fold) cross validation…

不要改組數據！ 使用預定義的10折并執行10折(而非5折)交叉驗證…

…If you reshuffle the data (e.g. combine the data from all folds and generate a random train/test split) you will be incorrectly placing related samples in both the train and test sets, leading to inflated scores that don’t represent your model’s performance on unseen data. Put simply, your results will be wrong.

… 如果您重新整理數據 (例如，合并所有折疊的數據并生成隨機訓練/測試拆分)， 您將在訓練和測試集中錯誤地放置相關樣本，從而導致分數膨脹而不能代表模型的性能在看不見的數據上。 簡而言之，您的結果將是錯誤的。

適當方法總結 (Summary of the proper approach)

Train on folds 1–9, then test on fold 10 and record the score. Then train on folds 2–10, and test on fold 1 and record the score.
訓練1-9倍，然后測試10倍并記錄得分。然后訓練2-10倍，并測試1倍并記錄分數。
Repeat this until each fold has served as the holdout one time.
重復此過程， 直到每一次折疊都成為一次保留。
The overall score will be the average of the 10 accuracy score from 10 different holdout sets.
總體得分將是10個不同的保留集的10個準確性得分的平均值。

重新創建論文結果 (Re-creating the paper results)

Note that the research paper does not have any code examples. What I want to do is first see if I can re-create (more or less) the results from the paper with my own implementation.

請注意，該研究論文沒有任何代碼示例。我想做的是首先查看我是否可以使用自己的實現重新創建(或多或少)論文的結果。

Then if that looks in line, I’ll work on some model improvements to see if I can beat it.

然后，如果這符合要求，我將進行一些模型改進，以查看是否可以擊敗它。

Here’s a snapshot of their model accuracy across folds from the paper [their image, not mine]:

這是紙上折痕處的模型精度的快照(他們的圖像，不是我的)：

Image from Research Paper Authors — Justin Salamon, Christopher Jacoby, and Juan Pablo Bello圖片由研究論文作者提供-賈斯汀·薩拉蒙，克里斯托弗·雅各比和胡安·帕勃羅·貝洛

Thus we’d like to get up to the high 60%/low 70% accuracy across the folds as shown in 3a.

因此，我們希望在折痕處達到60％/ 70％的高精度，如圖3a所示。

音頻特征提取 (Audio Feature Extraction)

Image by Author [Marc Kelechava]圖片作者[Marc Kelechava]

Librosa is an excellent and easy to use Python library that implements music information retrieval techniques. I recently wrote another blog post on a model using the librosa library here. The goal of that exercise was to train an audio genre classifier on labeled audio files (label=music genre) from my personal library. Then I use that trained model to predict the genre for other untagged files in my music library.

Librosa是一個出色且易于使用的Python庫，它實現了音樂信息檢索技術。我最近在這里使用librosa庫在模型上寫了另一篇博客文章。該練習的目的是在我的個人圖書館中的帶有標簽的音頻文件(標簽=音樂流派)上訓練音頻流派分類器。然后，我使用訓練有素的模型來預測音樂庫中其他未標記文件的類型。

I will use some of the music information retrieval techniques I learned from that exercise and apply them to audio feature extraction for the city sound classification problem. In particular I’ll use:

我將使用從該練習中學到的一些音樂信息檢索技術，并將其應用于針對城市聲音分類問題的音頻特征提取。特別是，我將使用：

Mel-Frequency Cepstral Coefficients (MFCC)
梅爾頻率倒譜系數(MFCC)
Spectral Contrast
光譜對比
Chromagram
色度圖

快速繞過音頻轉換[可選] (A quick detour on audio transformations [optional])

[My other blog post expands on some of this section in a bit more detail if any of this is of particular interest]

[如果其中任何一個特別令人感興趣，我的其他博客文章將在本節的某些部分進行更詳細的擴展]

Note that it is technically possible to convert a raw audio source to a numerical vector and train that directly. However, a (downsampled) 7-minute audio file will yield a time series vector nearly ~9,000,000 floating point numbers in length!

請注意，從技術上講，可以將原始音頻源轉換為數字矢量并直接對其進行訓練。但是，一個(縮減采樣)的7分鐘音頻文件將產生一個時間序列向量，其長度約為9,000,000個浮點數！

Even for our 4-second clips, the raw time series representation is a vector ~7000-dim. Given we only have 8,732 training examples, this is likely too high-dim to be workable.

即使對于我們的4秒鐘剪輯，原始時間序列表示形式也是一個約7000維的向量。假設我們只有8,732個培訓示例，那么這可能太過困難了而無法工作。

The various music informational retrieval techniques reduce the dimensionality of the raw audio vector representation and make this more tractable for modeling.

各種音樂信息檢索技術降低了原始音頻矢量表示的維數，并使建模更易于處理。

The techniques that we’ll be using to extract features seek to capture different qualities about the audio over time. For instance, the MFCCs describe the spectral envelope [amplitude spectrum] of a sound. Using librosa we get this information over time — i.e., we get a matrix!

我們將用于提取特征的技術旨在隨著時間的流逝捕獲音頻的不同質量。例如，MFCC描述了聲音的頻譜包絡(振幅頻譜)。使用librosa，我們可以隨著時間的推移獲取此信息-即，我們得到一個矩陣！

The MFCC matrix for a particular audio file will have coefficients on the y-axis and time on the x-axis. Thus we want to summarize these coefficients over time (across the x-axis, or axis=1 in numpy land). Say we take an average over time — then we get the average value for each MFCC coefficient across time, i.e., a feature vector of numbers for that particular audio file!

特定音頻文件的MFCC矩陣在y軸上具有系數，在x軸上具有時間。因此，我們想隨時間總結這些系數(在x軸上或numpy區域中軸= 1)。假設我們隨時間取平均值，然后取每個MFCC系數隨時間的平均值，即該特定音頻文件的數字特征向量！

What we can do is repeat this process for different music informational retrieval techniques, or different summary statistics. For instance, the spectral contrast technique will also yield a matrix of different spectral characteristics for different frequency ranges over time. Again we can repeat the aggregation process over time and pack it into our feature vector.

我們可以做的是針對不同的音樂信息檢索技術或不同的摘要統計重復此過程。例如，頻譜對比技術還將隨著時間的推移針對不同頻率范圍生成具有不同頻譜特性的矩陣。同樣，我們可以隨著時間的推移重復聚合過程，并將其打包到我們的特征向量中。

作者使用了什么 (What the authors used)

The paper authors call out MFCC explicitly. They mention pulling the first 25 MFCC coefficients and

本文作者明確地調用了MFCC。他們提到拉前25個MFCC系數，

“The per-frame values for each coefficient are summarized across time using the following summary statistics: minimum, maximum, median, mean, variance, skewness, kurtosis and the mean and variance of the first and second derivatives, resulting in a feature vector of dimension 225 per slice.”

“每個系數的每幀值使用以下匯總統計信息在整個時間進行匯總：最小值，最大值，中位數，均值，方差，偏度，峰度以及一階和二階導數的均值和方差，從而得出特征向量為每片225尺寸。”

Thus in their case they kept aggregating the 25 MFCCs over different summary statistics and packed them into a feature vector.

因此，在他們的情況下，他們不斷將25個MFCC匯總到不同的摘要統計數據中，并將它們打包成一個特征向量。

I’m going to implement something slightly different here, since it worked quite well for me in the genre classifier problem mentioned previously.

我將在此處實現一些稍有不同的方法，因為它在前面提到的體裁分類器問題中對我來說非常有效。

I take (for each snippet):

我參加(針對每個摘錄)：

Mean of the MFCC matrix over time
MFCC矩陣隨時間的平均值
Std. Dev of the MFCC matrix over time
標準 MFCC矩陣隨時間的變化
Mean of the Spectral Contrast matrix over time
光譜對比度矩陣隨時間的平均值
Std. Dev of the Spectral Contrast matrix over time
標準光譜對比度矩陣隨時間的偏差
Mean of the Chromagram matrix over time
色度矩陣隨時間的平均值
Std. Dev of the Chromagram matrix over time
標準時序圖的色度矩陣

My output (for each audio clip) will only be 82-dimensional as opposed to the 225-dim of the paper, so modeling should be quite a bit faster.

我的輸出(對于每個音頻剪輯)僅是82維的，而不是紙張的225維，因此建模應該更快一些。

最后一些代碼！音頻特征提取正在起作用。 (Finally some code! Audio feature extraction in action.)

[Note that I’ll be posting code snippets both within the blog post and with GitHub Gist links. Sometime Medium does not render Github Gists correctly, which is why I’m doing this. Also all the in-document code is copy and pasteable to an ipython terminal, but GitHub gists are not].

[請注意，我將在博客文章中以及GitHub Gist鏈接中發布代碼片段。有時，Medium無法正確渲染Github Gists，這就是為什么我要這樣做。同樣，所有文檔中的代碼都可以復制并粘貼到ipython終端，但是GitHub要點不是]。

Referring to my script runner here:

在這里引用我的腳本運行器：

I parse through the metadata (given with the dataset) and grab the filename, fold, and class label for each audio file. Then this gets sent to an audio feature extractor class.

我解析元數據(與數據集一起)，并獲取每個音頻文件的文件名，折疊和類標簽。然后將其發送到音頻特征提取器類。

metadata = parse_metadata("metadata/UrbanSound8K.csv")
audio_features = []
for row in metadata:
path, fold, label = row
src_path = f"{Path.home()}/datasets/UrbanSound8K/audio/fold{fold}/{path}"
audio = AudioFeature(src_path, fold, label)
audio.extract_features("mfcc", "spectral", "chroma")
audio_features.append(audio)

The AudioFeature class wraps around librosa, and extracts the features you feed in as strings as shown above. It also then saves the AudioFeature object to disk for every audio clip. The process takes a while, so I save the class label and fold number in the AudioFeature object along with the feature vector. This way you can come back and play around with the model later on the extracted features.

AudioFeature類環繞librosa，并提取您作為字符串輸入的功能，如上所示。然后，還將每個音頻剪輯的AudioFeature對象保存到磁盤。該過程需要一段時間，因此我將類標簽和折疊號與特征向量一起保存在AudioFeature對象中。這樣，您可以稍后在提取的功能上返回并使用模型。

import librosa
import numpy as np
import pickle
from pathlib import Path
import os
class AudioFeature:
def __init__(self, src_path, fold, label):
self.src_path = src_path
self.fold = fold
self.label = label
self.y, self.sr = librosa.load(self.src_path, mono=True)
self.features = None
def _concat_features(self, feature):
self.features = np.hstack(
[self.features, feature] if self.features is not None else feature
)
def _extract_mfcc(self, n_mfcc=25):
mfcc = librosa.feature.mfcc(self.y, sr=self.sr,
n_mfcc=n_mfcc)
mfcc_mean = mfcc.mean(axis=1).T
mfcc_std = mfcc.std(axis=1).T
mfcc_feature = np.hstack([mfcc_mean, mfcc_std])
self._concat_features(mfcc_feature)
def _extract_spectral_contrast(self, n_bands=3):
spec_con = librosa.feature.spectral_contrast(
y=self.y, sr=self.sr, n_bands=n_bands
)
spec_con_mean = spec_con.mean(axis=1).T
spec_con_std = spec_con.std(axis=1).T
spec_con_feature = np.hstack([spec_con_mean, spec_con_std])
self._concat_features(spec_con_feature)
def _extract_chroma_stft(self):
stft = np.abs(librosa.stft(self.y))
chroma_stft = librosa.feature.chroma_stft(
S=stft, sr=self.sr)
chroma_mean = chroma_stft.mean(axis=1).T
chroma_std = chroma_stft.std(axis=1).T
chroma_feature = np.hstack([chroma_mean, chroma_std])
self._concat_features(chroma_feature)
def extract_features(self, *feature_list, save_local=True): extract_fn = dict(
mfcc=self._extract_mfcc,
spectral=self._extract_spectral_contrast,
chroma=self._extract_chroma_stft,
)
for feature in feature_list:
extract_fn[feature]()
if save_local:
self._save_local()
def _save_local(self, clean_source=True):
out_name = self.src_path.split("/")[-1]
out_name = out_name.replace(".wav", "")
filename = f"{Path.home()}/projects/urban_sound_classification/data/fold{self.fold}/{out_name}.pkl"
os.makedirs(os.path.dirname(filename), exist_ok=True)
with open(filename, "wb") as f:
pickle.dump(self, f)
if clean_source:
self.y = None

This class implements what I described earlier — which is aggregating the various music information retrieval techniques over time, and then packing everything into a single feature vector for each audio clip.

此類實現了我之前介紹的內容-隨時間聚集各種音樂信息檢索技術，然后將所有內容打包到每個音頻剪輯的單個特征向量中。

造型 (Modeling)

Image by Author [Marc Kelechava]圖片作者[Marc Kelechava]

Since we put all the AudioFeature objects in a list above, we can do some quick comprehensions to get what we need for modeling:

由于我們將所有AudioFeature對象都放在了上面的列表中，因此我們可以快速了解一下建模所需的內容：

feature_matrix = np.vstack([audio.features for audio in audio_features])
labels = np.array([audio.label for audio in audio_features])
folds = np.array([audio.fold for audio in audio_features])
model_cfg = dict(
model=RandomForestClassifier(
random_state=42,
n_jobs=10,
class_weight="balanced",
n_estimators=500,
bootstrap=True,
),
)
model = Model(feature_matrix, labels, folds, model_cfg)
fold_acc = model.train_kfold()

The Model class will implement the cross-validation loop as described by the authors (keeping the relevant pitfalls in mind!).

Model類將按照作者的描述實現交叉驗證循環(請牢記相關的陷阱！)。

As a reminder, here’s a second warning from the authors:

提醒一下，這是作者的第二條警告：

“Don’t evaluate just on one split! Use 10-fold (not 5-fold) cross validation and average the scoresWe have seen reports that only provide results for a single train/test split, e.g. train on folds 1–9, test on fold 10 and report a single accuracy score. We strongly advise against this. Instead, perform 10-fold cross validation using the provided folds and report the average score.

“ 不要一口氣評估！ 使用10倍(而非5倍)交叉驗證并取平均分數我們已經看到報告僅提供單個訓練/測試拆分的結果，例如，訓練1-9進行訓練，測試10折并報告單個準確性得分。我們強烈建議您不要這樣做。相反，請使用提供的折數執行10折交叉驗證，并報告平均得分。

Why?

為什么？

Not all the splits are as “easy”. That is, models tend to obtain much higher scores when trained on folds 1–9 and tested on fold 10, compared to (e.g.) training on folds 2–10 and testing on fold 1. For this reason, it is important to evaluate your model on each of the 10 splits and report the average accuracy.

并非所有拆分都一樣“容易”。也就是說，與(例如)在2–10倍訓練和1折測試相比，在1–9倍訓練和10折測試時，模型往往會獲得更高的分數。因此，評估您的對10個分割中的每個分割進行建模，并報告平均準確度。

Again, your results will NOT be comparable to previous results in the literature.”

同樣，您的結果將無法與文獻中的先前結果相提并論。”

On their latter point —(this is from the paper) it’s worth noting that different recordings/folds have different distributions of when the snippets appear in either the foreground or the background —this is why some folds are easy and some are hard.

在后一點上(這是從論文中得出的)，值得注意的是，不同的錄音/折頁在片段出現在前景或背景中時具有不同的分布，這就是為什么有些折痕很容易而有些折痕很困難的原因。

tl; dr CV (tl;dr CV)

We need to train on folds 1–9, predict & score on fold 10
我們需要訓練1-9，預測10并得分
Then train on folds 2–10, predict & score on fold 1
然后以2-10倍訓練，預測1并得分
…etc…
…等等…
Averaging the scores on the test folds with this process will match the existing research AND ensure that we aren’t accidentally leaking data about the source recording to our holdout set.
通過此過程對測試結果進行平均，將與現有研究相匹配，并確保我們不會將有關源記錄的數據意外泄漏到我們的保留集中。

離開一組 (Leave One Group Out)

Initially, I coded the split process described above by hand using numpy with respect to the given folds. While it wasn’t too bad, I realized that scikit-learn provides a perfect solution in the form of LeaveOneGroupOut KFold splitting.

最初，我針對給定的折疊使用numpy手動編碼了上述拆分過程。雖然還算不錯，但我意識到scikit-learn以LeaveOneGroupOut KFold拆分的形式提供了一個完美的解決方案。

To prove to myself it is what we want, I ran a slightly altered version of the test code for the splitter from the sklearn docs:

為了向自己證明這是我們想要的，我為sklearn docs的分離器運行了稍微改動的測試代碼版本：

import numpy as np
from sklearn.model_selection import LeaveOneGroupOut
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]])
y = np.array([1, 2, 1, 2, 1, 2])
groups = np.array([1, 2, 3, 1, 2, 3])
logo = LeaveOneGroupOut()
logo.get_n_splits(X, y, groups)
logo.get_n_splits(groups=groups) # 'groups' is always required
for train_index, test_index in logo.split(X, y, groups):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
print("TRAIN:", X_train, "TEST:", X_test)

"""
TRAIN:
[[ 3 4]
[ 5 6]
[ 9 10]
[11 12]]
TEST:
[[1 2]
[7 8]]

TRAIN:
[[ 1 2]
[ 5 6]
[ 7 8]
[11 12]]
TEST:
[[ 3 4]
[ 9 10]]
TRAIN:
[[ 1 2]
[ 3 4]
[ 7 8]
[ 9 10]]
TEST:
[[ 5 6]
[11 12]]
"""

Note that in this toy example there are 3 groups, ‘1’, ‘2’, and ‘3’.

請注意，在此玩具示例中，有3個組，“ 1”，“ 2”和“ 3”。

When I feed the group membership list for each training example to the splitter, it correctly ensures that the same group examples never appear in both train and test.

當我將每個訓練示例的組成員資格列表饋入拆分器時， 它可以正確地確保相同的組示例永遠不會出現在訓練和測試中。

模型類 (The Model Class)

Thanks to sklearn this ends up being pretty easy to implement!

多虧了sklearn，這最終很容易實現！

from sklearn.model_selection import LeaveOneGroupOut, StratifiedKFold, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report, f1_score
from sklearn.preprocessing import LabelEncoder, StandardScaler
import numpy as np
import random
class Model:
def __init__(self, feature_matrix, labels, folds, cfg):
self.X = feature_matrix
self.encoder = LabelEncoder()
self.y = self.encoder.fit_transform(labels)
self.folds = folds
self.cfg = cfg
self.val_fold_scores_ = []
def train_kfold(self):
logo = LeaveOneGroupOut()
for train_index, test_index in logo.split(self.X, self.y, self.folds):
X_train, X_test = self.X[train_index], self.X[test_index]
y_train, y_test = self.y[train_index], self.y[test_index]
ss = StandardScaler(copy=True)
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)
clf = self.cfg["model"]
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
fold_acc = accuracy_score(y_test, y_pred)
self.val_fold_scores_.append(fold_acc)
return self.val_fold_scores_

Here I add in some scaling, but in essence the splitter will give us the desired CV. After each iteration of the splitter I train the fold on 9 folds and predict on the holdout fold. This happens 10 times, and then we can average over the returned list of 10 scores on the holdout folds.

在這里，我添加了一些縮放比例，但實質上，分離器將為我們提供所需的CV。在拆分器的每次迭代之后，我將折頁訓練為9折，并預測保持折頁。這會發生10次，然后我們可以對返回的10個分數進行平均(在保持倍數上)。

結果 (Results)

"""
In: fold_acc
Out:
[0.6632302405498282,
0.7083333333333334,
0.6518918918918919,
0.6404040404040404,
0.7585470085470085,
0.6573511543134872,
0.6778042959427207,
0.6910669975186104,
0.7230392156862745,
0.7825567502986858]
In: np.mean(fold_acc)
Out: 0.6954224928485881
"""

69.5% is about right in line with what the authors have in their paper for the top models! Thus I’m feeling good that this was implemented as they envisioned. Also note they also show that fold10 was the easiest to score on (and we have that too), so we’re in line there as well.

69.5％的水準與作者在論文中對頂級型號的水準相符！ 因此，我很高興按他們的設想實現了這一目標。還要注意，他們還表明fold10是最容易得分的(我們也有)，因此我們也排在前面。

為什么不為此任務運行超參數搜索？ [非常可選] (Why not run a hyperparameter search for this task? [very optional])

Here’s where things get a little tricky.

這是有些棘手的地方。

A ‘Normal’ CV Process:

“常規”簡歷流程：

If we could train/test/split arbitrarily, we could do something like:

如果我們可以任意訓練/測試/拆分，則可以執行以下操作：

Split off a holdout test set

拆分保持測試集

From the larger train portion, split off a validation set.

從較大的火車部分中，分離出一個驗證集。

Run some type of parameter search algo (say, GridSearchCV) on the train (non-val, non-test).

在火車上運行某種類型的參數搜索算法(例如GridSearchCV)(非Val，非測試)。

The GridSearch will run k-fold cross-validation on the train test, splitting it into folds. At the end, an estimator will be refit on the train portion with the best params found in the inner k-fold cross-validation of GridSearchCV

GridSearch將在火車測試中進行k折交叉驗證，將其拆分為折疊。最后，將使用在GridSearchCV的內部k折交叉驗證中找到的最佳參數對火車部分重新安裝估算器

Then we take that fitted best estimator and score in on validation set

然后，我們采用最合適的估計量，并在驗證集上得分

Because we have the validation set in part 5, we can repeat steps 3 and 4 a bunch of times on different model families or different parameter search ranges.

因為我們在第5部分中設置了驗證集，所以我們可以在不同的模型族或不同的參數搜索范圍上重復執行步驟3和4多次。

Then when we are done we’d take our final model and see if it generalizes using the holdout test set, which we hadn’t touched to that point.

然后，當我們完成后，我們將使用最終模型，看看它是否可以使用保持測試集進行概括，而這一點我們還沒有涉及到。

But how is this going to work within our Fold based LeaveOneGroupOut approach? Imagine we tried to setup a GridSearchCV as follows:

但是，這在我們基于Fold的LeaveOneGroupOut方法中如何工作？想象一下，我們嘗試按以下步驟設置GridSearchCV：

def train_kfold(self):
logo = LeaveOneGroupOut()
for train_index, test_index in logo.split(self.X, self.y, self.folds):
X_train, X_test = self.X[train_index], self.X[test_index]
y_train, y_test = self.y[train_index], self.y[test_index]

ss = StandardScaler(copy=True)
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)
clf = self.cfg["model"]

kf = StratifiedKFold(n_splits=3, random_state=42, shuffle=True)

# This isn't valid. In this inner CV things from same recording fold
# could end up in same train and val set of GridSearchCV
grid_search = GridSearchCV(
estimator=clf,
param_grid=self.cfg["param_grid"],
cv=kf,
return_train_score=True,
verbose=3,
refit=True
)
grid_search.fit(X_train, y_train)
self.trained_models_.append(grid_search.best_estimator_)
y_pred = grid_search.predict(X_test)
fold_acc = accuracy_score(y_test, y_pred)
self.val_fold_scores_.append(fold_acc)
return self.val_fold_scores_

But now when GridSearchCV runs the inner split, we’ll run into the same problem that we had solved by using LeaveOneGroupOut!

但是現在，當GridSearchCV運行內部拆分時，我們將遇到使用LeaveOneGroupOut解決的相同問題！

That is, imagine the first run of this loop where the test set is fold 1 and the train set is on folds 2–10. If we then pass the train set (of folds 2–10) into the inner GridSearchCV loop, we’ll end up with inner KFold cases where the same fold is used in the inner GridSearchCV train and the inner GridSearchCV test.

也就是說，想象一下該循環的第一次運行，其中測試集為折疊1，訓練集為折疊2-10。如果然后將訓練集(2-10倍)傳遞到內部GridSearchCV循環中，我們將得到內部KFold情況，其中內部GridSearchCV訓練和內部GridSearchCV測試使用相同的折疊。

Thus it’s going to end up (very likely) overfitting the choice of best params within the inner GridSearchCV loop.

因此，最終(很有可能)將過度適合內部GridSearchCV循環內最佳參數的選擇。

And hence, I’m not going to run a hyperparameter search within the LeaveOneGroupOut loop.

因此，我不會在LeaveOneGroupOut循環中運行超參數搜索。

下一步 (Next Steps)

I’m pretty pleased this correctly implemented the research paper — at least in terms of very closely matching their results.

我很高興這項研究報告得以正確實施-至少在與他們的研究結果非常匹配方面。

I’d like to try extracting larger feature vectors per example, and then running these through a few different Keras based NN architectures following the same CV process here

我想嘗試每個示例提取較大的特征向量，然后按照此處相同的CV流程，通過一些不同的基于Keras的NN架構運行它們

In terms of feature extraction, I’d also like to consider the nuances of misclassifications between classes and see if I can think up better features for the hard examples. For instance, it’s definitely getting confused on the air conditioner v engine idling class. To check this, I have some code in my prior audio blog post that you can use to look at the False Positive Rate and False Negative rate per class: https://github.com/marcmuon/audio_genre_classification/blob/master/model.py#L84-L128

在特征提取方面，我還想考慮類之間錯誤分類的細微差別，看看我是否可以為這些硬示例想出更好的特征。例如，在空調v發動機空轉等級上肯定會感到困惑。為了對此進行檢查，我之前的音頻博客文章中有一些代碼，您可以使用這些代碼查看每類的誤報率和誤報率： https ： //github.com/marcmuon/audio_genre_classification/blob/master/model。 py＃L84-L128

Thanks for reading this far! I intend to do a 2nd part of this post addressing the Next Steps soon. Some other work that might be of interest can be found here:

感謝您閱讀本文！我打算在這篇文章的第二部分中介紹“下一步”。您可能會在這里找到其他一些有趣的工作：

https://github.com/marcmuonhttps://medium.com/@marckelechava

https://github.com/marcmuon https://medium.com/@marckelechava

引文 (Citations)

J. Salamon, C. Jacoby and J. P. Bello, “A Dataset and Taxonomy for Urban Sound Research”, 22nd ACM International Conference on Multimedia, Orlando USA, Nov. 2014.[ACM][PDF][BibTeX]

J. Salamon，C。Jacoby和JP Bello，“ 城市聲音研究的數據集和分類法 ”，第22屆ACM國際多媒體會議，美國奧蘭多，2014年11月。 [ ACM ] [ PDF ] [ BibTeX ]

翻譯自: https://towardsdatascience.com/urban-sound-classification-with-librosa-nuanced-cross-validation-5b5eb3d9ee30