mfcc中的fft操作_简化音频数据:FFT,STFT和MFCC
mfcc中的fft操作
What we should know about sound. Sound is produced when there’s an object that vibrates and those vibrations determine the oscillation of air molecules which creates an alternation of air pressure and this high pressure alternated with low pressure causes a wave.
w ^ 帽子,我們應該知道的聲音。 當有物體振動時,就會產生聲音,而這些振動決定了空氣分子的振動,從而產生了氣壓的交替變化,而這種高壓與低壓交替產生的波動。
Some key terms in audio processing.
音頻處理中的一些關鍵術語。
- Amplitude — Perceived as loudness 振幅-視為響度
- Frequency — Perceived as pitch 頻率-視為音高
- Sample rate — It is how many times the sample is taken of a sound file if it says sample rate as 22000 Hz it means 22000 samples are taken in each second. 采樣率—如果聲音文件的采樣率表示為22000 Hz,則它是對聲音文件進行采樣的次數,這表示每秒進行22000個采樣。
- Bit depth — It represents the quality of sound recorded, It just likes pixels in an image. So 24 Bit sound is of better quality than 16 Bit. 位深度—它代表所記錄聲音的質量,就像圖像中的像素一樣。 因此,24位聲音的質量比16位更好。
Here I have used the sound of a piano key from freesound.org
在這里,我使用了freesound.org上鋼琴琴鍵的聲音
signal, sample_rate = librosa.load(file, sr=22050)plt.figure(figsize=FIG_SIZE)
librosa.display.waveplot(signal, sample_rate, alpha=0.4)
plt.xlabel(“Time (s)”)
plt.ylabel(“Amplitude”)
plt.title(“Waveform”)
plt.savefig(‘waveform.png’, dpi=100)
plt.show()
To move wave from a time domain to frequency domain we need to perform Fast Fourier Transform on data. Basically what we do with the Fourier transform is the process of decomposing a periodic sound into a sum of sine waves which all vibrate oscillate at different frequencies. It is quite incredible so we can describe a very complex sound as long as it’s periodic as a sum as the superimposition of a bunch of different sine waves at different frequencies.
為了將波從時域移動到頻域,我們需要對數據執行快速傅立葉變換 。 基本上,我們對傅立葉變換所做的是將周期性聲音分解為正弦波之和的過程,這些正弦波均以不同的頻率振動。 這是非常不可思議的,因此我們可以描述一個非常復雜的聲音,只要它是周期性的,與一堆不同頻率的不同正弦波的疊加相加即可。
Below I have shown how two sine waves of different amplitude and frequency are combined into one.
下面我展示了如何將兩個振幅和頻率不同的正弦波組合為一個。
# perform Fourier transformfft = np.fft.fft(signal)# calculate abs values on complex numbers to get magnitude
spectrum = np.abs(fft)# create frequency variable
f = np.linspace(0, sample_rate, len(spectrum))# take half of the spectrum and frequency
left_spectrum = spectrum[:int(len(spectrum)/2)]
left_f = f[:int(len(spectrum)/2)]# plot spectrum
plt.figure(figsize=FIG_SIZE)
plt.plot(left_f, left_spectrum, alpha=0.4)
plt.xlabel(“Frequency”)
plt.ylabel(“Magnitude”)
plt.title(“Power spectrum”)
plt.savefig(‘FFT.png’)
plt.show()
By applying the Fourier transform we move in the frequency domain because here we have on the x-axis the frequency and the magnitude is a function of the frequency itself but by this we lose information about time so it’s as if this a special power spectrum here was a snapshot of all the elements which concur to form this sound, so basically what this spectrum is telling us is that these different frequencies have different powers but throughout all of them all of the sound here so it’s a snapshot it’s a static which could be seen as a problem because obviously audio data alike is a time series right so things change in time and so we want to know about how things change in time and it seems that with the Fourier transform we we can’t really do that so we are missing on a lot of information right but obviously there’s a solution to that and the solution it’s called the Short Time Fourier Transform or STFT and so what the short time Fourier transform does it computes several Fourier transforms at different intervals and in doing so it preserves information about time and the way sound evolved it’s over time right and so the different intervals at which we perform the Fourier transform is given by the frame size and so a frame is a bunch of samples and so we fix the number of samples and we say let’s consider only for example 200 samples and do the Fourier transform there and then let’s move on to let’s shift and move on to to the rest lack of the waveform and what happens here is that we get a spectogram which gives us information of (time + frequency + magnitude)
通過應用傅立葉變換,我們在頻域中移動,因為在這里我們在x軸上具有頻率,并且幅度是頻率本身的函數,但是由于此,我們會丟失有關時間的信息,因此好像這是一個特殊的功率譜是構成聲音的所有元素的快照,因此,基本上,該頻譜告訴我們的是,這些不同的頻率具有不同的功率,但是在所有這些聲音中,所有這些聲音都是靜態的,因此它是靜態的,可能是之所以認為這是一個問題,是因為顯然音頻數據都是一個時間序列,所以事情會隨時間變化,因此我們想知道事情是如何隨時間變化的,似乎通過傅立葉變換我們無法真正做到這一點,所以缺少很多信息,但是很顯然,有一個解決方案,該解決方案稱為短時傅立葉變換或STFT ,因此短時傅立葉變換所執行的操作將計算出幾個傅立葉tra ns以不同的時間間隔進行變換,這樣就保留了有關時間以及聲音隨時間變化的信息,因此執行傅立葉變換的不同時間間隔由幀大小給出,因此一幀是一堆樣本,因此我們確定了樣本數量,我們說讓我們僅考慮例如200個樣本,然后在其中進行傅立葉變換,然后繼續進行下去,移至剩下的沒有波形的地方,這就是我們得到的一張頻譜圖,它為我們提供了(時間+頻率+幅度)信息
# STFT -> spectrogramhop_length = 512 # in num. of samples
n_fft = 2048 # window in num. of samples
# calculate duration hop length and window in seconds
hop_length_duration = float(hop_length)/sample_rate
n_fft_duration = float(n_fft)/sample_rate
print(“STFT hop length duration is:{}s”.format(hop_length_duration))
print(“STFT window duration is: {}s”.format(n_fft_duration))
# perform stft
stft = librosa.stft(signal, n_fft=n_fft, hop_length=hop_length)
# calculate abs values on complex numbers to get magnitude
spectrogram = np.abs(stft)
# display spectrogram
plt.figure(figsize=FIG_SIZE)
librosa.display.specshow(spectrogram, sr=sample_rate, hop_length=hop_length)
plt.xlabel(“Time”)
plt.ylabel(“Frequency”)
plt.colorbar()
plt.title(“Spectrogram”)
plt.savefig(‘spectogram.png’)
plt.show()# apply logarithm to cast amplitude to Decibels
log_spectrogram = librosa.amplitude_to_db(spectrogram)
plt.figure(figsize=FIG_SIZE)
librosa.display.specshow(log_spectrogram, sr=sample_rate,
hop_length=hop_length)
plt.xlabel(“Time”)
plt.ylabel(“Frequency”)
plt.colorbar(format=”%+2.0f dB”)
plt.title(“Spectrogram (dB)”)
plt.savefig(‘spectogram_log.png’)
plt.show()
we have time here on the x-axis but we also have frequency on the y-axis and we have a third axis which is given by the colour and the colour is telling us how much a given frequency is present in the sound at a given time so for example here we see that low-frequency sound is more in the most of the audio.
我們在x軸上有時間,但在y軸上也有頻率,還有第三個軸,該第三軸由顏色給定,顏色告訴我們在給定的聲音中給定的頻率有多少時間,例如,在這里我們看到低頻聲音在大多數音頻中更多。
Mel Frequncy Cepstral Spectogram in short MFCC’s capture many aspects of sound so if you have for example a guitar or flute playing the same melody you would have potentially same frequency and same rhythm more or less there depending on the performance but what would change is the quality of sound and the MFCC’s are capable of capturing that information and for extracting them MFCC’s we perform a Fourier transform and we move from the time domain in so you the frequency domain so MFCC’s are basically frequency domain feature but the great advantage of MFCC’s over spectrograms is that they approximate the human auditory system they try to model the way we perceive frequency right and so that’s very important if you then want to do deep learning stuff to have some data that represent the way we kind of process audio now the results of extracting MFCC’s is a bunch of coefficients it’s an MFCC vector and so you can specify a number of different coefficients usually in all your music applications you want to use between 13 to 39 coefficients and then again you are going to calculate all of these coefficients at each frame so that you have an idea of how the M FCC’ are evolving over time right.
簡短的MFCC的 “ Mel頻率倒譜 ”可以捕獲聲音的許多方面,因此,例如,如果吉他或長笛演奏相同的旋律,則根據性能可能會或多或少地具有相同的頻率和相同的節奏,但是改變的是聲音的質量和MFCC能夠捕獲該信息并提取它們,我們執行傅立葉變換,并且從時域移到頻域,因此MFCC基本上是頻域功能,但MFCC的巨大優勢在于頻譜圖是他們近似于人類聽覺系統,他們試圖對我們正確感知頻率的方式進行建模,因此如果您想進行深度學習以獲取一些數據來表示我們現在處理音頻的方式,那么這非常重要。提取MFCC是一堆系數,它是MFCC向量,因此您通常可以在所有音樂應用程序中指定許多不同的系數 您希望使用13到39個系數,然后再次計算每個幀上的所有這些系數,這樣您就可以正確了解M FCC'隨時間的變化情況。
# MFCCs# extract 13 MFCCs
MFCCs = librosa.feature.mfcc(signal, sample_rate, n_fft=n_fft,
hop_length=hop_length, n_mfcc=13)
# display MFCCs
plt.figure(figsize=FIG_SIZE)
librosa.display.specshow(MFCCs, sr=sample_rate,
hop_length=hop_length)
plt.xlabel(“Time”)
plt.ylabel(“MFCC coefficients”)
plt.colorbar()
plt.title(“MFCCs”)
plt.savefig(‘mfcc.png’)
plt.show()
so here I have 13 MFCC’s coefficient represented in the y-axis, time in the x-axis and more the red, more is the value of that coefficient in that time frame.
因此,在這里,我有13個MFCC系數,以y軸表示,時間以x軸表示,紅色越多,則該系數在該時間范圍內的值越大。
MFCC’s are used for a number of the audio application. Originally they have been introduced for speech recognition, but it also has uses in music recognition, music instrument classification, music genre classification.
MFCC用于許多音頻應用程序。 最初將它們引入語音識別,但它也用于音樂識別,樂器分類,音樂體裁分類。
Link to code:
鏈接到代碼:
Code for sine waves
正弦波代碼
Code for FFT, STFT and MFCC’s
FFT,STFT和MFCC的代碼
翻譯自: https://medium.com/analytics-vidhya/simplifying-audio-data-fft-stft-mfcc-for-machine-learning-and-deep-learning-443a2f962e0e
mfcc中的fft操作
總結
以上是生活随笔為你收集整理的mfcc中的fft操作_简化音频数据:FFT,STFT和MFCC的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 微信支付商业版 结算周期_了解商业周期
- 下一篇: 梦到亲人快死了是什么征兆