當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

pytorch 音频分类_Pytorch中音频的神经风格转换

發布時間：2023/12/15 编程问答 29 豆豆

生活随笔收集整理的這篇文章主要介紹了 pytorch 音频分类_Pytorch中音频的神经风格转换小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

pytorch 音頻分類

They’ve been some really interesting applications of style transfer. It basically aims to take the ‘style’ from one image and change the ‘content’ image to meet that style.

它們是樣式轉移的一些非常有趣的應用程序。它的基本目的是從一個圖像中獲取“樣式”，并更改“內容”圖像以適應該樣式。

But so far it hasn’t really been applied to audio. So I explored the idea of applying neural style transfer to audio. To be frank, the results were less than stellar but I’m going to keep working on this in the future.

但是到目前為止，它還沒有真正應用于音頻。因此，我探索了將神經樣式轉換應用于音頻的想法。坦率地說，結果還不算很出色，但將來我會繼續努力。

For this exercise, I’m going to be using clips from the joe rogan podcast. I’m trying to make Joe Rogan, from the Joe Rogan Experience, sound like Joey Diaz, from the Church of Whats Happening Now. Joe Rogan already does a pretty good impression of joey diaz. But I’d like to improve his impression using deep learning.

在本練習中，我將使用joe rogan播客的剪輯。我試圖使來自“ 喬羅根體驗 ”( Joe Rogan Experience)的喬?羅根 ( Joe Rogan )聽起來像是“ 現在發生了一切的教堂”中的喬伊迪亞茲 ( Joey Diaz) 。喬·羅根(Joe Rogan)已經給喬伊·迪亞茲(joey diaz)留下了很好的印象。但是我想通過深度學習來改善他的印象。

First I’m going to download the youtube videos. There’s a neat trick mentioned on github that allows you to download small segments of youtube videos. That’s handy cause I don’t want to download the entire video. You’ll need youtube-dl and ffmpeg for this step.

首先，我要下載youtube視頻。 github上提到了一個巧妙的技巧，可讓您下載一小段youtube視頻。這很方便，因為我不想下載整個視頻。此步驟需要youtube-dl和ffmpeg 。

損失函數 (Loss Functions)

There are two types of loss for this

有兩種類型的損失

Content loss. Lower values for this means that the output audio sounds like joe rogan.

內容丟失。較低的值表示輸出音頻聽起來像joe rogan。

Style loss. Lower values for this means that the output audio sounds like joey diaz.

風格流失。較低的值表示輸出音頻聽起來像joey diaz。

Ideally we want both content and style loss to be minimised.

理想情況下，我們希望同時減少內容和樣式損失。

內容丟失 (Content loss)

The content loss function takes in an input matrix and a content matrix. The content matrix corresponds to joe rogan’s audio. Then it returns the weighted content distance: between the input matrix and the content matrix. This is implemented using a torch module. It can be calculated using nn.MSELoss.

內容損失函數接受輸入矩陣和內容矩陣。內容矩陣對應于joe rogan的音頻。然后，它返回輸入矩陣和內容矩陣之間的加權內容距離：這是落實使用炬模塊。可以使用nn.MSELoss進行計算。

This implementation of content loss was largely borrowed from here.

內容丟失的這種實現很大程度上是從這里借來的。

風格損失 (Style loss)

When looking at the style we really just want to extract the way in which joey diaz speaks. We don’t really want to extract the exact words he says. But we want to get the tone, the intonation, the inflection, etc. from his speech. For that we’ll need to get the gram matrix.

在查看樣式時，我們真的只想提取joey diaz說話的方式。我們真的不想提取他所說的確切詞。但是我們想從他的講話中獲得語氣，語調，曲折等。為此，我們需要獲取gram矩陣。

To calculate this we get the first slice in the input matrix and flatten it. Flattening this slice in the matrix removes a lot of audio information. Then we take another slice from the input matrix and flatten it. We take the dot product of the flattened matrices.

為了計算這一點，我們獲得輸入矩陣中的第一個切片并將其展平。將此矩陣中的片段展平會刪除大量音頻信息。然后，我們從輸入矩陣中獲取另一個切片并將其展平。我們取平坦矩陣的點積。

A dot product is a measure of how similar the two matrices are. If the matrices are similar then the we’ll get a really large result. If they are very different we’ll get a very small result.

點積是兩個矩陣相似程度的度量。如果矩陣相似，那么我們將得到非常大的結果。如果它們非常不同，我們將獲得非常小的結果。

So for example, let’s say that the first flattened matrix corresponded with pitch. And let’s say that the second flattened matrix corresponded with volume. If we get a high dot product, then it’s saying that when volume is high pitch is also high. Or in other words when joey talks very loudly his voice increases in pitch.

因此，例如，假設第一個展平的矩陣與音高相對應。假設第二個扁平化矩陣與體積相對應。如果我們得到的是高點積，那就意味著當音量高時音高也很高。換句話說，當喬伊大聲說話時，他的聲音變高。

The dot products can give us very large numbers. We normalize them by dividing each element by the total number of elements in the matrix.

點積可以給我們很大的數目。我們通過將每個元素除以矩陣中元素的總數來對其進行歸一化。

轉換Wav成矩陣 (Convert Wav to Matrix)

To convert the waveform audio to a matrix that we can pass to pytorch I’ll use librosa. Most of this code was borrowed from Dmitry Ulyanov's github repo and Alish Dipani's github repo.

要將波形音頻轉換成可以傳遞給pytorch的矩陣，我將使用librosa 。大部分代碼是從Dmitry Ulyanov的github存儲庫和Alish Dipani的github存儲庫中借用的。

We get the Short-time Fourier transform from the audio using the librosa library. The window size for this is 2048, which is also the default setting. There is scope here for replacing the code with code from torchaudio. But this works for now.

我們使用librosa庫從音頻中獲得了短時傅立葉變換。此窗口的大小是2048 ，這也是默認設置。這里有用torchaudio中的代碼替換代碼的范圍。但這暫時有效。

創建CNN (Create CNN)

This CNN is very shallow. It consists of 2 convolutions and a ReLU in between them. I originally took the CNN used here but I’ve made a few changes.

這個CNN非常淺。它由2個卷積和它們之間的ReLU組成。我本來是在這里使用的CNN ，但做了一些更改。

Firstly, I added content loss. This wasn’t added before and is obviously very useful. We’d like to know how close (or far away) the audio sounds to the original content.
首先，我增加了內容損失。這是以前沒有添加的，顯然非常有用。我們想知道音頻聽起來與原始內容有多近(或遠)。
Secondly, I added a ReLU to the model. It’s pretty well established that nonlinear activations are desired in a neural network. Adding a ReLU improved the model significantly.
其次，我在模型中添加了ReLU。它很好建立的是非線性的激活在神經網絡所需。添加ReLU可以顯著改善模型。
Increased the number of steps. From 2500 to 20000
增加了步驟數。從2500到20000
Slightly deepened the network. I added a layer of Conv1d. After this layer style loss and content loss is calculated. This improved the model as well, but adding ReLU resulted in the largest improvement by far.
網絡略有加深。我添加了一層Conv1d 。在此層之后，將計算樣式損失和內容損失。這也改進了模型，但是添加ReLU帶來了迄今為止最大的改進。

I personally found that my loss values — particularly for style loss — were very low. So low they were almost 0. I recitifed this by multiplying by a style_weight and a content_weight. This seems like a crude solution. But according to fastai you care about the direction of the loss and its relative size. So I think it's alright for now.

我個人發現我的損失值(尤其是樣式損失)非常低。如此之低，他們幾乎為0 。我通過將style_weight和content_weight相乘來說明這一點。這似乎是一個粗略的解決方案。但是根據fastai，您關心損失的方向及其相對大小。所以我認為目前還可以。

運行樣式轉換 (Run style transfer)

Now I’ll run the style transfer. This will use the optim.Adam optimizer. This piece of code was taken from the pytorch tutorial for neural style transfer. For each iteration of the network the style loss and content loss is calculated. In turn that is used to get the gradients. The gradients are mulitplied by the learning rates. That in turn updates the input audio matrix. In pytorch the optimizer requries a closure function.

現在，我將運行樣式轉換。這將使用optim.Adam優化程序。這段代碼來自pytorch教程，用于神經樣式轉換。對于網絡的每次迭代，都會計算樣式損失和內容損失。依次用于獲取漸變。梯度乘以學習率。依次更新輸入音頻矩陣。在pytorch中，優化器需要關閉函數。

重構音頻 (Reconstruct the Audio)

Finally the audio needs to be reconstructed. To do that the librosa inverse short-time fourier transform can be used.

最后，音頻需要重建。為此，可以使用librosa逆短時傅立葉逆變換。

Then we write to an audio file and use the jupyter notebook extension to play the audio in the notebook.

然后，我們寫入音頻文件，并使用jupyter筆記本擴展名在筆記本中播放音頻。

The notebook for this can be found on Github

可以在Github上找到用于此目的的筆記本

Originally published at https://spiyer99.github.io on August 2, 2020.

最初于 2020年8月2日發布在 https://spiyer99.github.io 。

翻譯自: https://towardsdatascience.com/neural-style-transfer-for-audio-in-pytorch-e1de972b1f68