睡眠音频分割及识别问题(三)
文獻一:PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition
[摘要] 音頻模式識別是機器學習領域的一個重要研究課題,包括音頻標注、聲場景分類、音樂分類、語音情感分類和聲音事件檢測等多項任務。最近,神經網絡已被應用于解決音頻模式識別問題。但是,以前的系統建立在持續時間有限的特定數據集上。最近,在計算機視覺和自然語言處理中,在大規模數據集上預訓練的系統已經很好地推廣到了幾個任務。然而,在用于音頻模式識別的大規模數據集上的預訓練系統的研究有限。在本文中,我們提出了在大規模 AudioSet 數據集上訓練的預訓練音頻神經網絡 (PANN)。這些 PANN 被轉移到其他與音頻相關的任務中。我們研究了由各種卷積神經網絡建模的 PANN 的性能和計算復雜性。我們提出了一種稱為 Wavegram-Logmel-CNN 的架構,使用 log-mel 頻譜圖和波形作為輸入特征。我們最好的 PANN 系統在 AudioSet 標記上實現了 0.439 的最先進的平均精度 (mAP),優于之前最好的系統 0.392。我們將 PANN 轉移到六個音頻模式識別任務中,并在其中幾個任務中展示了最先進的性能。
文獻二:Towards Duration Robust Weakly Supervised Sound Event Detection
> [1]
引言部分
SOUND event detection (SED) research classifies and localizes particular audio events (e.g., dog barking, alarm ringing) within an audio clip, assigning each event a label along with a start point (onset) and an endpoint (offset).
聲音事件檢測 (SED) 研究對音頻剪輯中的特定音頻事件(例如,狗吠、警報響起)進行分類和定位,為每個事件分配一個標簽以及起點(開始)和終點(偏移)。?
Label assignment is usually referred to as tagging, while the onset/offset detection is referred to as localization.
標簽分配通常稱為標記,而起始/偏移檢測稱為定位。?
SED can be used for query-based sound retrieval [1], smart cities, and homes [2], [3], as well as voice activity detection [4].
SED 可用于基于查詢的聲音檢索 [1]、智能城市和家庭 [2]、[3],以及語音活動檢測 [4]。?
Unlike common classification tasks such as image or speaker recognition, a single audio clip might contain multiple different sound events (multi-output), sometimes occurring simultaneously (multi-label).
與圖像或說話人識別等常見分類任務不同,單個音頻剪輯可能包含多個不同的聲音事件(多輸出),有時同時發生(多標簽)。?
In particular, the localization task escalates the difficulty within the scope of SED, since different sound events have various time lengths, and each occurrence is unique.
特別是定位任務在 SED 范圍內升級了難度,因為不同的聲音事件具有不同的時間長度,并且每次發生都是獨一無二的。?
Two main approaches exist to train an effective localization model: Fully supervised SED and weakly supervised SED (WSSED).
訓練有效定位模型的主要方法有兩種:全監督 SED 和弱監督 SED (WSSED)。?
Fully supervised approaches, which potentially perform better than weakly supervised ones, require manual time-stamp labeling.
完全監督的方法可能比弱監督的方法表現得更好,需要手動標記時間戳。?
However, manual labeling is a significant hindrance for scaling to large datasets due to the expensive labor cost.?
然而,由于昂貴的勞動力成本,手動標記是擴展到大型數據集的重大障礙。
This paper primarily focuses on WSSED, which only has access to clip event labels during training yet requires to predict onsets and offsets at the inference stage.
本文主要關注 WSSED,它只能在訓練期間訪問剪輯事件標簽,但需要在推理階段預測開始和偏移。?
Challenges such as the Detection and Classification of Acoustic Scenes and Events (DCASE) exemplify the difficulties in training robust SED systems.
聲學場景和事件的檢測和分類 (DCASE) 等挑戰體現了訓練穩健 SED 系統的困難。?
DCASE challenge datasets are real-world recordings (e.g., audio with no quality control and lossy compression), thus containing unknown noises and scenarios.?
DCASE 挑戰數據集是真實世界的錄音(例如,沒有質量控制和有損壓縮的音頻),因此包含未知的噪音和場景。?
Specifically, in each challenge since 2017, at least one task was primarily concerned with WSSED. Most previous work focuses on providing single target task-specific solutions for WSSED on either tagging-, segment- or event-level.?
具體而言,在 2017 年以來的每項挑戰中,至少有一項任務主要與 WSSED 相關。 以前的大部分工作都集中在為 WSSED 提供標記、段或事件級別的單一目標任務特定解決方案。?
Tagging-level solutions are often capable of localizing event boundaries, yet their temporal consistency is subpar to segment- and event-level methods.?
標記級解決方案通常能夠定位事件邊界,但它們的時間一致性低于段級和事件級方法。?
This has been seen during the DCASE2017 challenge, where no single model could win both tagging and localization subtasks.
這已經在 DCASE2017 挑戰中看到了,在那里沒有一個模型可以同時贏得標記和本地化子任務。?
Solutions optimized for segment level often utilize a fixed target time resolution (e.g., 1 Hz), inhibiting fine-scale localization performance (e.g., 50 Hz).
針對分段級別優化的解決方案通常使用固定的目標時間分辨率(例如 1 Hz),從而抑制精細定位性能(例如 50 Hz)。?
Lastly, successful event-level solutions require prior knowledge about each events’ duration to obtain temporally consistent predictions.
最后,成功的事件級解決方案需要關于每個事件持續時間的先驗知識,以獲得時間上一致的預測。?
Previous work in [5] showed that successful models such as the DCASE2018 task 4 winner are biased towards predicting tags from long-duration clips, which might limit themselves from generalizing towards different datasets (e.g., deploy the same model on a new dataset) since new datasets possibly contain short or unknown duration events.
[5] 之前的工作表明,成功的模型,例如 DCASE2018 任務 4 獲勝者傾向于從長持續時間的剪輯中預測標簽,這可能會限制自己對不同數據集的泛化(例如,在新數據集上部署相同的模型),因為 新數據集可能包含短時間或未知持續時間的事件。
In contrast, we aim to enhance WSSED performance, specifically in duration estimation regarding short, abrupt events, without a pre-estimation of each respective event’s individual weight.
相比之下,我們的目標是提高 WSSED 性能,特別是在關于短暫、突然事件的持續時間估計方面,而不預先估計每個事件的單獨權重。?
相關工作
Most current approaches within SED and WSSED utilize neural networks, in particular convolutional neural networks [6], [7] (CNN) and convolutional recurrent neural networks [4], [5] (CRNN).?
SED 和 WSSED 中的大多數當前方法都利用神經網絡,特別是卷積神經網絡 [6]、[7](CNN)和卷積循環神經網絡 [4]、[5](CRNN)。
CNN models generally excel at audio tagging [8], [9] and scale with data, yet falling behind CRNN approaches in onset and offset estimations [10].
CNN 模型通常在音頻標記 [8]、[9] 和數據規模方面表現出色,但在開始和偏移估計方面落后于 CRNN 方法 [10]。?
Apart from different modeling methods, many recent works propose other approaches for the localization conundrum.
除了不同的建模方法外,許多最近的工作還為定位難題提出了其他方法。?
A plethora of temporal pooling strategies are proposed, aiming to summarize frame-level beliefs into a single clip-wise probability.
提出了大量的時間池策略,旨在將幀級信念總結為單個剪輯概率。?
Contribution:?
In our work, we modify and extend the framework of [5] further towards other datasets and aim to analyze the benefits and the limits of duration robust training.?
貢獻:
在我們的工作中,我們將 [5] 的框架進一步修改和擴展到其他數據集,旨在分析持續時間穩健訓練的好處和限制。
Our main goal with this work is to bridge the gap between real-world SED and research models and facilitate a common framework that works well on both tagging and localization-level without utilizing dataset-specific knowledge.
我們這項工作的主要目標是彌合現實世界 SED 和研究模型之間的差距,并促進一個通用框架,該框架在標記和本地化級別上都能很好地工作,而無需利用特定于數據集的知識。?
Our contributions are:?
A new, lightweight, model architecture for WSSED using L4-norm temporal subsampling.?
我們的貢獻是:
使用 L4 范數時間子采樣的 WSSED 新的輕量級模型架構。
A novel thresholding technique named triple threshold, bridging the gap between tagging and localization performance.?
一種名為三重閾值的新閾值技術,彌合了標記和定位性能之間的差距。
Verification of our proposed approach across three publicly available datasets, without the requirement of manually optimizing towards dataset-specific hyperparameters.
在三個公開可用的數據集上驗證我們提出的方法,無需手動優化特定于數據集的超參數。
?
參考文獻
[1]: Paper is https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9335265
[2]: Source code is available https://github.com/RicherMans/CDur
總結
以上是生活随笔為你收集整理的睡眠音频分割及识别问题(三)的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: mysql移动文件后打不开_Window
- 下一篇: 别傻啦,不会高数,你连人话都听不懂