當前位置：首頁 > 人工智能 > pytorch >内容正文

pytorch

手语识别_使用深度学习进行手语识别

發布時間：2023/12/15 pytorch 38 豆豆

生活随笔收集整理的這篇文章主要介紹了手语识别_使用深度学习进行手语识别小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

手語識別

TL;DR It is presented a dual-cam first-vision translation system using convolutional neural networks. A prototype was developed to recognize 24 gestures. The vision system is composed of a head-mounted camera and a chest-mounted camera and the machine learning model is composed of two convolutional neural networks, one for each camera.

TL; DR 提出了一種使用卷積神經網絡的雙攝像頭第一視覺翻譯系統。開發了一個原型來識別24個手勢。視覺系統由頭戴式攝像頭和胸部安裝式攝像頭組成，機器學習模型由兩個卷積神經網絡組成，每個卷積神經網絡一個。

介紹 (Introduction)

Sign language recognition is a problem that has been addressed in research for years. However, we are still far from finding a complete solution available in our society.

手語識別是多年來研究中已經解決的問題。但是，我們離我們的社會還沒有找到一個完整的解決方案。

Among the works developed to address this problem, the majority of them have been based on basically two approaches: contact-based systems, such as sensor gloves; or vision-based systems, using only cameras. The latter is way cheaper and the boom of deep learning makes it more appealing.

在為解決這個問題而開發的作品中，大部分都基于兩種方法：基于接觸的系統，例如傳感器手套；或基于視覺的系統，僅使用攝像頭。后者更便宜，而深度學習的興起使其更具吸引力。

This post presents a prototype of a dual-cam first-person vision translation system for sign language using convolutional neural networks. The post is divided into three main parts: the system design, the dataset, and the deep learning model training and evaluation.

這篇文章介紹了使用卷積神經網絡的雙攝像頭第一人稱視覺翻譯系統的原型。該職位分為三個主要部分：系統設計，數據集以及深度學習模型的訓練和評估。

視覺系統 (VISION SYSTEM)

Vision is a key factor in sign language, and every sign language is intended to be understood by one person located in front of the other, from this perspective, a gesture can be completely observable. Viewing a gesture from another perspective makes it difficult or almost impossible to be understood since every finger position and movement will not be observable.

視覺是手語中的一個關鍵因素，每種手語都應由位于另一人面前的一個人理解，從這個角度來看，手勢可以完全觀察到。由于無法觀察到每個手指的位置和移動，因此從另一個角度查看手勢很難或幾乎無法理解。

Trying to understand sign language from a first-vision perspective has the same limitations, some gestures will end up looking the same way. But, this ambiguity can be solved by locating more cameras in different positions. In this way, what a camera can’t see, can be perfectly observable by another camera.

試圖從第一視覺的角度理解手語具有相同的局限性，某些手勢最終將以相同的方式出現。但是，可以通過在不同位置放置更多攝像機來解決這種歧義。這樣，一臺攝像機看不到的東西可以被另一臺攝像機完全觀察到。

The vision system is composed of two cameras: a head-mounted camera and a chest-mounted camera. With these two cameras we obtain two different views of a sign, a top-view, and a bottom-view, that works together to identify signs.

視覺系統由兩個攝像頭組成：頭戴式攝像頭和胸部安裝式攝像頭。使用這兩個攝像頭，我們可以獲得標牌的兩種不同視圖，即頂視圖和底視圖，它們可以一起識別標牌。

Sign corresponding to letter Q in the Panamanian Sign Language from a top view and a bottom view perspective (image by author)從頂視圖和底視圖的角度來看，對應于巴拿馬手語中字母Q的符號(作者提供的圖像)

Another benefit of this design is that the user will gain autonomy. Something that is not achieved in classical approaches, in which the user is not the person with disability but a third person that needs to take out a system with a camera and focus a signer while the signer is performing a sign.

這種設計的另一個好處是用戶將獲得自主權。傳統方法無法實現的某些功能，在該方法中，用戶不是殘疾人，而是需要在簽名人執行簽名時需要帶照相機的系統并對準簽名人的第三方。

數據集 (DATASET)

To develop the first prototype of this system is was used a dataset of 24 static signs from the Panamanian Manual Alphabet.

為了開發該系統的第一個原型，使用了來自巴拿馬手冊字母的24個靜態符號的數據集。

SENADIS, Lengua de Se?as Paname?as)SENADIS，Lengua deSe?asPaname?as )

To model this problem as an image recognition problem, dynamic gestures such as letter J, Z, RR, and ? were discarded because of the extra complexity they add to the solution.

為了將此問題建模為圖像識別問題，由于字母J，Z，RR和?之類的動態手勢會增加解決方案的復雜性，因此將其丟棄。

數據收集和預處理 (Data collection and preprocessing)

To collect the dataset it was asked to four users to wear the vision system and perform every gesture for 10 seconds while both cameras were recording in a 640x480 pixel resolution.

為了收集數據集，要求兩個用戶佩戴視覺系統并在每個相機都以640x480像素分辨率記錄的同時執行每個手勢10秒鐘。

It was requested to the users to perform this process in three different scenarios: indoors, outdoors, and in a green background scenario. For the indoors and outdoors scenarios the users were requested to move around while performing the gestures in order to obtain images with different backgrounds, light sources, and positions. The green background scenario was intended for a data augmentation process, we’ll describe later.

要求用戶在三種不同的情況下執行此過程：室內，室外和綠色背景。對于室內和室外場景，要求用戶在執行手勢時四處走動，以獲取具有不同背景，光源和位置的圖像。綠色背景方案是用于數據增強過程的，我們將在后面介紹。

After obtaining the videos, the frames were extracted and reduced to a 125x125 pixel resolution.

獲得視頻后，將幀提取并降低到125x125像素分辨率。

From left to right: green background scenario, indoors and outdoors (image by author)從左到右：室內和室外的綠色背景場景(作者提供)

資料擴充 (Data augmentation)

Since the preprocessing before going to the convolutional neural networks was simplified to just rescaling, the background will always get passed to the model. In this case, the model needs to be able to recognize a sign despite the different backgrounds it can have.

由于將卷積神經網絡之前的預處理簡化為僅縮放，因此背景將始終傳遞給模型。在這種情況下，盡管模型可能具有不同的背景，但它仍需要能夠識別符號。

To improve the generalization capability of the model it was artificially added more images with different backgrounds replacing the green backgrounds. This way it is obtained more data without investing too much time.

為了提高模型的泛化能力，它人為地添加了更多具有不同背景的圖像來代替綠色背景。這樣，無需花費太多時間即可獲取更多數據。

Images with new backgrounds (image by author)具有新背景的圖像(作者提供的圖像)

During the training, it was also added another data augmentation process consisting of performing some transformations, such as some rotations, changes in light intensity, and rescaling.

在訓練期間，它還添加了另一個數據增強過程，該過程包括執行一些轉換(例如某些旋轉，光強度的更改和縮放)。

Variations in rotation, light intensity and rescaling (image by author)旋轉，光強度和縮放比例的變化(作者提供的圖片)

These two data augmentation process were chosen to help improve the generalization capability of the model.

選擇這兩個數據增強過程可幫助提高模型的泛化能力。

頂視圖和底視圖數據集 (Top view and bottom view datasets)

This problem was model as a multiclass classification problem with 24 classes, and the problem itself was divided into two smaller multi-class classification problems.

該問題被建模為具有24個類的多類分類問題，并且該問題本身被分為兩個較小的多類分類問題。

The approach to decide which gestures would be classified whit the top view model and which ones with the bottom view model was to select all the gestures that were too similar from the bottom view perspective as gestures to be classified from the top view model and the rest of gestures were going to be classified by the bottom view model. So basically, the top view model was used to solved ambiguities.

決定將哪些手勢歸類為頂視圖模型以及哪些手勢與底視圖模型相結合的方法是選擇從底視圖角度來看太相似的所有手勢作為要從頂視圖模型中歸類的手勢，其余選擇為手勢將由仰視圖模型分類。因此，基本上，頂視圖模型用于解決歧義。

As a result, the dataset was divided into two parts, one for each model as shown in the following table.

結果，數據集分為兩部分，每個模型一個，如下表所示。

深度學習模型 (DEEP LEARNING MODEL)

As state-of-the-art technology, convolutional neural networks was the option chosen for facing this problem. It was trained two models: one model for the top view and one for the bottom view.

作為最先進的技術，卷積神經網絡是解決這一問題的選擇。它訓練了兩種模型：一種用于頂視圖的模型，一種用于底視圖的模型。

建筑 (Architecture)

The same convolutional neural network architecture was used for both, the top view and the bottom view models, the only difference is the number of output units.

頂視圖模型和底視圖模型都使用了相同的卷積神經網絡體系結構，唯一的區別是輸出單元的數量。

The architecture of the convolutional neural networks is shown in the following figure.

下圖顯示了卷積神經網絡的體系結構。

Convolutional neural network architecture卷積神經網絡架構

To improve the generalization capability of the models it was used dropout techniques between layers in the fully connected layer to improve model performance.

為了提高模型的泛化能力，在完全連接的層中的層之間使用了丟棄技術來提高模型性能。

評價 (Evaluation)

The models were evaluated in a test set with data corresponding to a normal use of the system in indoors, in other words, in the background it appears a person acting as the observer, similar to the input image in the figure above (Convolutional neural networks architecture). The results are shown below.

在測試集中對模型進行了評估，并使用了與室內正常使用系統相對應的數據，換句話說，在背景中看起來像是人在充當觀察者，類似于上圖中的輸入圖像( 卷積神經網絡)建筑 )。結果如下所示。

Although the model learned to classify some signs, such as Q, R, H; in general, the results are kind of discouraging. It seems that the generalization capability of the models wasn’t too good. However, the model was also tested with real-time data showing the potential of the system.

盡管模型學會了對一些符號進行分類，例如Q，R，H；一般來說，結果令人沮喪。這些模型的泛化能力似乎不太好。但是，該模型還通過顯示系統潛力的實時數據進行了測試。

The bottom view model was tested with real-time video with a green uniform background. I wore the chest-mounted camera capturing video at 5 frames per second while I was running the bottom view model in my laptop and try to fingerspell the word fútbol (my favorite sport in Spanish). The entries for every letter were emulated by a click. The results are shown in the following video.

底視圖模型已通過具有綠色統一背景的實時視頻進行了測試。當我在筆記本電腦中運行底視圖模型時，我戴著胸部攝像頭以每秒5幀的速度捕獲視頻，并嘗試拼寫fútbol(我最喜歡的西班牙語運動)一詞。通過單擊可以模擬每個字母的條目。結果顯示在以下視頻中。

Demo video of the bottom view model running with real-time video與實時視頻一起運行的底視圖模型的演示視頻

Note: Due to the model performance, I had to repeat it several times until I ended up with a good demo video.

注意：由于模型的性能，我不得不重復幾次，直到獲得了不錯的演示視頻。

結論 (Conclusions)

Sign language recognition is a hard problem if we consider all the possible combinations of gestures that a system of this kind needs to understand and translate. That being said, probably the best way to solve this problem is to divide it into simpler problems, and the system presented here would correspond to a possible solution to one of them.

如果我們考慮這類系統需要理解和翻譯的手勢的所有可能組合，則手語識別將是一個難題。話雖這么說，解決這個問題的最好方法可能是將其分為更簡單的問題，此處介紹的系統將對應于其中一個的可能解決方案。

The system didn’t perform too well but it was demonstrated that it can be built a first-person sign language translation system using only cameras and convolutional neural networks.

該系統的性能不是很好，但是證明了可以僅使用相機和卷積神經網絡構建第一人稱手語翻譯系統。

It was observed that the model tends to confuse several signs with each other, such as U and W. But thinking a bit about it, maybe it doesn’t need to have a perfect performance since using an orthography corrector or a word predictor would increase the translation accuracy.

據觀察，該模型傾向于使多個符號相互混淆，例如U和W。但是，仔細考慮一下，也許它不需要具有完美的性能，因為使用正交拼寫校正器或單詞預測器會增加翻譯準確性。

The next step is to analyze the solution and study ways to improve the system. Some improvements could be carrying by collecting more quality data, trying more convolutional neural network architectures, or redesigning the vision system.

下一步是分析解決方案，并研究改進系統的方法。通過收集更多質量的數據，嘗試更多的卷積神經網絡體系結構或重新設計視覺系統，可以帶來一些改進。

結束語 (End words)

I developed this project as part of my thesis work in university and I was motivated by the feeling of working in something new. Although the results weren’t too great, I think it can be a good starting point to make a better and biggest system.

作為大學論文工作的一部分，我開發了這個項目，并且受到在新事物中工作的感覺的激勵。盡管效果不是很好，但我認為這可能是構建更好，最大的系統的良好起點。

If you are interested in this work, here is the link to my thesis (it is written in Spanish)

如果您對此工作感興趣，請點擊此處鏈接至我的論文 (用西班牙語撰寫)

Thanks for reading!

謝謝閱讀！

翻譯自: https://towardsdatascience.com/sign-language-recognition-using-deep-learning-6549268c60bd