當(dāng)前位置：首頁 > 人工智能 > pytorch >内容正文

pytorch

安全警报该站点安全证书_深度学习如何通过实时犯罪警报确保您的安全

發(fā)布時(shí)間：2023/12/15 pytorch 30 豆豆

生活随笔收集整理的這篇文章主要介紹了安全警报该站点安全证书_深度学习如何通过实时犯罪警报确保您的安全小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

安全警報(bào) 該站點(diǎn)安全證書

Citizen scans thousands of public first responder radio frequencies 24 hours a day in major cities across the US. The collected information is used to provide real-time safety alerts about incidents like fires, robberies, and missing persons to more than 5M users. Having humans listen to 1000+ hours of audio daily made it very challenging for the company to launch new cities. To continue scaling, we built ML models that could discover critical safety incidents from audio.

公民每天24小時(shí)在美國(guó)主要城市掃描成千上萬個(gè)公共第一響應(yīng)無線電頻率。所收集的信息用于向超過500萬用戶提供有關(guān)火災(zāi)，搶劫和失蹤人員等事件的實(shí)時(shí)安全警報(bào)。每天讓人們聽1000多個(gè)小時(shí)的音頻，對(duì)于公司啟動(dòng)新城市來說非常困難。為了繼續(xù)擴(kuò)展，我們建立了ML模型，可以從音頻中發(fā)現(xiàn)嚴(yán)重的安全事件。

Our custom software-defined radios (SDRs) capture large swathes of radio frequency (RF) and create optimized audio clips that are sent to an ML model to flag relevant clips. The flagged clips are sent to operations analysts to create incidents in the app, and finally, users near the incidents are notified.

我們的定制軟件定義的無線電(SDR)捕獲了大量的射頻(RF)并創(chuàng)建優(yōu)化的音頻剪輯，然后將其發(fā)送到ML模型以標(biāo)記相關(guān)的剪輯。標(biāo)記的剪輯將發(fā)送給運(yùn)營(yíng)分析人員，以在應(yīng)用程序中創(chuàng)建事件，最后，事件附近的用戶將得到通知。

Figure 1. Safety alerts workflow (Image by Author)圖1.安全警報(bào)工作流(作者提供的圖像)

使公共語音轉(zhuǎn)文本引擎適應(yīng)我們的問題領(lǐng)域 (Adapting a Public Speech-to-Text Engine to Our Problem Domain)

Figure 2. Clip classifier using public speech-to-text engine (Image by Author)圖2.使用公共語音轉(zhuǎn)文本引擎的剪輯分類器(作者提供的圖像)

We started with a top performing speech-to-text engine based on the word error rate (WER). There are a lot of special codes used by police that are not part of the normal vernacular. For example, an NYPD officer requests backup units by transmitting a “Signal 13”. We customized the vocabulary to our domain using speech contexts.

我們從基于單詞錯(cuò)誤率(WER)的性能最高的語音到文本引擎開始。警察使用許多特殊代碼，這些特殊代碼不屬于普通語言。例如，NYPD官員通過發(fā)送“信號(hào)13”來請(qǐng)求備份單元。我們使用語音上下文針對(duì)我們的領(lǐng)域定制了詞匯表。

We also boosted some words to fit our domain, for example, “assault” isn’t used colloquially, but is very common in our use case. We had to bias our models towards detecting “assault” over “a salt”.

我們還增加了一些詞來適合我們的領(lǐng)域，例如，“突擊”不是口語化的，但在我們的用例中很常見。我們不得不將模型偏向于檢測(cè)“攻擊”而不是“鹽”。

After tuning the parameters, we were able to get reasonable accuracy for transcriptions in some cities. The next step was to use the transcribed data of the audio clips and figure out which ones were relevant to Citizen.

調(diào)整參數(shù)后，我們能夠在某些城市中獲得合理的轉(zhuǎn)錄準(zhǔn)確性。下一步是使用音頻片段的轉(zhuǎn)錄數(shù)據(jù)，找出哪些與公民有關(guān)。

基于轉(zhuǎn)錄和音頻特征的二進(jìn)制分類器 (Binary Classifier Based on Transcriptions and Audio Features)

We modeled a binary classification problem with the transcriptions as input and a confidence level as output. XGBoost gave us the best performance on our dataset.

我們用轉(zhuǎn)錄作為輸入，置信度作為輸出對(duì)二進(jìn)制分類問題建模。 XGBoost在我們的數(shù)據(jù)集上為我們提供了最佳性能。

We had insight from someone who previously worked in law enforcement that radio transmissions about major incidents in some cities are preceded by special alert tones to get the attention of police on the ground. This extra feature helped make our model more reliable, especially in cases of bad transcriptions. Some other useful features we found were the police channel and transmission IDs.

我們從以前在執(zhí)法部門工作過的人那里了解到，在某些城市發(fā)生重大事件的無線電廣播之前，會(huì)先發(fā)出特殊的提示音，以引起現(xiàn)場(chǎng)警察的注意。這個(gè)額外的功能使我們的模型更加可靠，尤其是在轉(zhuǎn)錄錯(cuò)誤的情況下。我們發(fā)現(xiàn)其他一些有用的功能是警察通道和傳輸ID。

We A/B tested the ML model in operations workflow. After a few days of running the test, we noticed no degradation in the incidents created by analysts who were using the model-flagged clips only.

我們A / B在操作流程中測(cè)試了ML模型。經(jīng)過幾天的測(cè)試，我們發(fā)現(xiàn)，僅使用模型標(biāo)記的剪輯的分析師所產(chǎn)生的事件沒有降低。

We launched the model in a few cities. Now a single analyst could handle multiple cities at once, which wasn’t previously possible! With the new spare capacity on operations, we were able to launch multiple new cities.

我們?cè)谝恍┏鞘型瞥隽嗽撃Ｐ汀?現(xiàn)在，一個(gè)分析師可以一次處理多個(gè)城市，這在以前是不可能的！有了新的運(yùn)營(yíng)備用容量，我們得以啟動(dòng)多個(gè)新城市。

Figure 3. Model rollout leading to a significant reduction in audio for analysts (Image by Author)圖3.模型的推出大大減少了分析人員的音頻(作者提供的圖像)

超越公共語音轉(zhuǎn)文字引擎 (Beyond a Public Speech-to-Text Engine)

The model didn’t turn out to be a panacea for all our problems. We could only use it in a few cities which had good quality audio. Public speech-to-text engines are trained on phone models with different acoustic profile than radios; as a result, the transcription quality was sometimes unreliable. Transcriptions were completely unusable for the older analog systems, which were very noisy.

該模型并沒有成為解決我們所有問題的靈丹妙藥。我們只能在有高質(zhì)量音頻的幾個(gè)城市中使用它。公開語音到文本引擎在電話模型上接受了與收音機(jī)不同的聲學(xué)配置的訓(xùn)練；結(jié)果，轉(zhuǎn)錄質(zhì)量有時(shí)是不可靠的。轉(zhuǎn)錄對(duì)于嘈雜的老式模擬系統(tǒng)是完全不可用的。

We tried multiple models from multiple providers, but none of them were trained on an acoustic profile similar to our dataset and couldn’t handle noisy audio.

我們嘗試了來自多個(gè)提供商的多個(gè)模型，但是沒有一個(gè)模型是在類似于我們的數(shù)據(jù)集的聲學(xué)輪廓上進(jìn)行訓(xùn)練的，并且無法處理嘈雜的音頻。

We explored replacing the speech-to-text engine with the one trained on our data while keeping the rest of the pipeline the same. However, we needed several hundred hours of transcription data for our audio which was very slow and expensive to generate. We had an option to optimize the process by only transcribing the “important” words defined in our vocabulary and adding blanks for the irrelevant words — but that was still just an incremental reduction in effort.

我們研究了用訓(xùn)練有素的數(shù)據(jù)替換語音到文本引擎，同時(shí)保持其余管道不變。但是，我們需要數(shù)百小時(shí)的轉(zhuǎn)錄數(shù)據(jù)來獲取音頻，這非常緩慢且生成成本很高。我們可以選擇僅通過轉(zhuǎn)錄詞匯中定義的“重要”單詞并為不相關(guān)的單詞添加空格來優(yōu)化流程的方法，但這仍然只是逐步減少的工作量。

Eventually, we decided to build a custom speech processing pipeline for our problem domain.

最終，我們決定為我們的問題域建立定制的語音處理管道。

卷積神經(jīng)網(wǎng)絡(luò)的關(guān)鍵詞識(shí)別 (Convolutional Neural Network for Keyword Spotting)

Since we only care about the presence of keywords, we didn’t need to find the right order of words and could reduce our problem to keyword spotting. That was a much easier problem to solve! We decided to do so using a convolutional neural network (CNN) trained on our dataset.

由于我們只關(guān)心關(guān)鍵字的存在，因此我們不需要找到正確的單詞順序，并且可以將我們的問題歸結(jié)為關(guān)鍵字發(fā)現(xiàn) 。那是一個(gè)容易解決的問題！我們決定使用在我們的數(shù)據(jù)集上訓(xùn)練的卷積神經(jīng)網(wǎng)絡(luò)(CNN)來做到這一點(diǎn)。

Using CNNs over Recurrent neural networks (RNNs) or Long short-term memory (LSTM) models meant that we could train much faster and had quicker iterations. We also evaluated using the Transformer model which is massively parallel but requires a lot of hardware to run. Since we were only looking for short term dependencies between audio segments to detect words, a computationally simple CNN seemed a better choice over Transformers and it freed up hardware for us to be more vigorous with hyperparameter tuning.

在遞歸神經(jīng)網(wǎng)絡(luò)(RNN)或長(zhǎng)短期記憶(LSTM)模型上使用CNN意味著我們可以更快地訓(xùn)練并且迭代更快。我們還使用了大規(guī)模并行但需要大量硬件才能運(yùn)行的Transformer模型進(jìn)行了評(píng)估。由于我們只是在尋找音頻片段之間的短期依賴關(guān)系來檢測(cè)單詞，因此與Transformers相比，計(jì)算簡(jiǎn)單的CNN似乎是更好的選擇，并且它釋放了硬件，使我們可以更靈活地進(jìn)行超參數(shù)調(diào)整。

Figure 4. Clip flagging model with a CNN for keyword spotting (Image by Author)圖4.帶有CNN的剪輯標(biāo)記模型，用于關(guān)鍵字發(fā)現(xiàn)(作者提供的圖像)

We split the audio clips into fixed duration subclips. We gave a positive label to a subclip if a vocabulary word was present. We then marked an audio clip as useful if any such subclip was found in it. During the training process, we tried how varying the duration of subclips affected our convergence performance. Long clips made it much harder for the model to figure out which portion of the clip was useful and also harder to debug. Short clips meant that words partially appeared across multiple clips, which made it harder for the model to identify them. We were able to tune this hyperparameter and find a reasonable duration.

我們將音頻片段分成固定持續(xù)時(shí)間的子片段。如果存在詞匯，則給子剪輯加上正標(biāo)簽。然后，如果在其中找到任何此類子剪輯，則將音頻剪輯標(biāo)記為有用。在訓(xùn)練過程中，我們嘗試了改變子剪輯的持續(xù)時(shí)間如何影響我們的收斂性能。較長(zhǎng)的剪輯使模型更難確定剪輯的哪個(gè)部分有用并且也較難調(diào)試。短剪輯意味著單詞在多個(gè)剪輯中部分出現(xiàn)，這使得模型更難識(shí)別它們。我們能夠調(diào)整此超參數(shù)并找到合理的持續(xù)時(shí)間。

For each subclip, we convert the audio into MFCC coefficients and also add the first and second-order derivatives. The features are generated with a frame size of 25ms and stride of 10ms. The features are then fed into a neural network based on Keras Sequential model using a Tensorflow backend. The first layer is a Gaussian noise which makes the model more robust to noise differences between different radio channels. We tried an alternative approach of artificially overlaying real noise to clips, but that slowed down training time significantly with no meaningful performance gains.

對(duì)于每個(gè)子剪輯，我們將音頻轉(zhuǎn)換為MFCC系數(shù)，還添加一階和二階導(dǎo)數(shù)。生成的特征具有25ms的幀大小和10ms的步幅。然后使用Tensorflow后端將特征輸入基于Keras 序列模型的神經(jīng)網(wǎng)絡(luò)中。第一層是高斯噪聲，它使模型對(duì)不同無線電信道之間的噪聲差異更魯棒。我們嘗試了一種將人工噪聲人為地疊加到片段上的替代方法，但是這種方法大大降低了訓(xùn)練時(shí)間，并且沒有明顯的性能提升。

We then added subsequent layers of Conv1D, BatchNormalization, and MaxPooling1D. Batch normalization helped with the model convergence, and max pooling helped in making the model more robust to minor variations in speech and also to channel noise. Also, we tried adding dropout layers, but those didn’t improve the model meaningfully. Finally, we added a densely-connected neural network layer which fed into a single output-dense layer with sigmoid activation.

然后，我們添加了Conv1D，BatchNormalization和MaxPooling1D的后續(xù)層。批量歸一化有助于模型收斂，最大池化有助于使模型對(duì)語音中的微小變化和信道噪聲更加健壯。另外，我們嘗試添加了輟學(xué)層，但是這些并沒有有意義地改善模型。最后，我們添加了一個(gè)緊密連接的神經(jīng)網(wǎng)絡(luò)層，該層通過S型激活被饋送到單個(gè)輸出密集層。

生成標(biāo)簽數(shù)據(jù) (Generating Labeled Data)

Figure 5. Labeling process for audio clips (Image by Author)圖5.音頻剪輯的標(biāo)記過程(作者提供的圖像)

To label the training data, we gave annotators the list of keywords for our domain and asked them to mark the start and end positions within a clip along with the word label if any of the vocabulary words were present.

為了標(biāo)記訓(xùn)練數(shù)據(jù)，我們?yōu)樽⑨屨咛峁┝宋覀冇虻年P(guān)鍵字列表，并要求他們?cè)谄沃袠?biāo)記單詞的開始和結(jié)束位置以及單詞標(biāo)簽(如果存在任何詞匯)。

To ensure the annotations were reliable, we had a 10% overlap across annotators and calculated how they performed on the overlapped clips. Once we had ~50 hours of labeled data, we started the training process. We kept collecting more data while iterating on the training process.

為確保注釋可靠，我們?cè)谧⑨屍髦g有10％的重疊，并計(jì)算了它們?cè)谥丿B剪輯上的表現(xiàn)。擁有約50個(gè)小時(shí)的標(biāo)記數(shù)據(jù)后，我們便開始了培訓(xùn)過程。我們不斷地在訓(xùn)練過程中不斷收集更多數(shù)據(jù)。

Since some words in our vocabulary were much more common than others, our model had a reasonable performance on common words but struggled with rarer words with fewer examples. We tried creating artificial examples of those words by overlaying the word utterance in other clips. However, the performance gains were not commensurate with actually getting labeled data for those words. Eventually, as our model improved with common words, we ran it on unlabeled audio clips and excluded the ones where the model found those words. That helped us reduce the redundant words from our future labeling.

由于詞匯中的某些單詞比其他單詞普遍得多，因此我們的模型對(duì)常見單詞具有合理的表現(xiàn)，但與較少的示例(較少的示例)一起苦苦掙扎。我們嘗試通過將單詞話語覆蓋在其他片段中來創(chuàng)建這些單詞的人工示例。但是，性能的提高與實(shí)際獲得這些單詞的標(biāo)簽數(shù)據(jù)并不相稱。最終，隨著我們的模型改進(jìn)了常用詞，我們將其在未標(biāo)記的音頻剪輯上運(yùn)行，并排除了模型找到這些詞的剪輯。這有助于我們減少將來標(biāo)簽中的多余單詞。

模型發(fā)布 (Model Launch)

After several iterations of data collection and hyperparameter tuning, we were able to train a model with high recall on our vocabulary words and reasonable precision. High recall was very important to capture critical safety alerts. The flagged clips are always listened to before an alert is sent, so false positives were not a huge concern.

經(jīng)過數(shù)次數(shù)據(jù)收集和超參數(shù)調(diào)整的迭代，我們能夠訓(xùn)練出詞匯量和合理精確度高的召回模型。高召回率對(duì)于捕獲關(guān)鍵的安全警報(bào)非常重要。發(fā)送警報(bào)之前，始終會(huì)監(jiān)聽標(biāo)記的剪輯，因此誤報(bào)并不是一個(gè)大問題。

We A/B tested the model in some boroughs of New York City. The model was able to cut down audio volume by 50–75% (depending on the channel). It also clearly outcompeted our model trained on public speech-to-text engine since NYC has very noisy audio due to analog systems.

我們A / B在紐約市的一些行政區(qū)測(cè)試了該模型。該模型能夠?qū)⒁纛l音量降低50％至75％(取決于通道)。由于紐約市由于模擬系統(tǒng)而產(chǎn)生的聲音非常嘈雜，因此它顯然也勝過我們?cè)诠舱Z音轉(zhuǎn)文本引擎上訓(xùn)練過的模型。

Somewhat surprisingly, we then found that the model transferred well to audio from Chicago even though the model was trained on NYC data. After collecting a few hours of Chicago clips, we were able to transfer-learn from the NYC model to get reasonable performance in Chicago.

出乎意料的是，我們隨后發(fā)現(xiàn)該模型可以很好地從芝加哥傳輸?shù)揭纛l，即使該模型是根據(jù)NYC數(shù)據(jù)進(jìn)行訓(xùn)練的。收集了幾個(gè)小時(shí)的芝加哥片段后，我們就可以從紐約市模型中轉(zhuǎn)移學(xué)習(xí)信息，從而在芝加哥獲得合理的表現(xiàn)。

結(jié)論 (Conclusion)

Our speech processing pipeline with the custom deep neural network was broadly applicable to police audio from major US cities. It discovered critical safety incidents from the audio, allowing Citizen to expand rapidly into cities across the country and serve the mission of keeping communities safe.

我們帶有定制深度神經(jīng)網(wǎng)絡(luò)的語音處理管道廣泛適用于美國(guó)主要城市的警察音頻。它從音頻中發(fā)現(xiàn)了嚴(yán)重的安全事件，使公民能夠Swift擴(kuò)展到全國(guó)的城市，并以維護(hù)社區(qū)安全為使命。

Picking computationally simple CNN architecture over RNN, LSTM, or Transformer and simplifying our labeling process were major breakthroughs that allowed us to outperform public speech-to-text models in a very short time and with limited resources.

在RNN，LSTM或Transformer上選擇計(jì)算簡(jiǎn)單的CNN架構(gòu)并簡(jiǎn)化我們的標(biāo)記過程是主要的突破，這使我們能夠在非常短的時(shí)間內(nèi)和有限的資源上勝過公共語音轉(zhuǎn)文本模型。

翻譯自: https://towardsdatascience.com/how-deep-learning-can-keep-you-safe-with-real-time-crime-alerts-95778aca5e8a