當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

论文翻译阅读——Facial Emotion RecognitionUsing Deep Learning:Review And Insights

發(fā)布時(shí)間：2023/12/14 编程问答 29 豆豆

生活随笔收集整理的這篇文章主要介紹了论文翻译阅读——Facial Emotion RecognitionUsing Deep Learning:Review And Insights 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

文章目錄

- - - Abstract
    - Introduction
    - Facial Available Databases
    - Facial Emotion Recognition Using Deep Learning
    - Discussion and Comparison
    - Conclusion and future work：
    - 分析與總結(jié)

Abstract

Automatic emotion recognition based on facial expression is an interesting research field, which has presented and applied in several areas such as safety, health and in human machine interfaces. Researchers in this field are interested in developing techniques to interpret, code facial expressions and extract these features in order to have a better prediction by computer. With the remarkable success of deep learning, the different types of architectures of this technique are exploited to achieve a better performance. The purpose of this paper is to make a study on recent works on automatic facial emotion recognition FER via deep learning. We underline on these contributions treated, the architecture and the databases used and we present the progress made by comparing the proposed methods and the results obtained. The interest of this paper is to serve and guide researchers by review recent works and providing insights to make improvements to this field.
基于面部表情得自動(dòng)表情識(shí)別是一個(gè)有趣得研究領(lǐng)域，其已經(jīng)出現(xiàn)和被使用在很多得領(lǐng)域，比如說(shuō)安全、健康，除此之外，被應(yīng)用于人機(jī)界面。這個(gè)領(lǐng)域的研究者興趣在于能夠開(kāi)發(fā)出技術(shù)能夠?qū)⒈砬榉g和編碼，并且還能夠提取出臉部特征，從而能夠讓電腦更好的預(yù)測(cè)。隨著深度學(xué)習(xí)的顯著成功，關(guān)于深度學(xué)習(xí)的不同類型的架構(gòu)已經(jīng)開(kāi)發(fā)了很多，并且都有很好的效果。這篇文章的目的在于認(rèn)知和了解一下最近基于深度學(xué)習(xí)自動(dòng)表情識(shí)別的成果。我們的重點(diǎn)在于其所做的貢獻(xiàn)，架構(gòu)和使用的數(shù)據(jù)庫(kù)，我們通過(guò)將提出來(lái)的方法相互比較來(lái)顯示進(jìn)展，還要展示這些方法已經(jīng)取得的成果。這篇文章的好處在于能夠通過(guò)回顧往期的工作來(lái)服務(wù)和引導(dǎo)研究者，并對(duì)其提供深刻的見(jiàn)解以對(duì)這個(gè)領(lǐng)域進(jìn)行改良。
總結(jié)：這篇文章主要講使用深度學(xué)習(xí)的基于面部的表情識(shí)別，會(huì)比較幾個(gè)提出來(lái)的方法。

Introduction

Automatic emotion recognition is a large and important research area that addresses two different subjects, which are psychological human emotion recognition and artificial intelligence (AI). The emotional state of humans can obtain from verbal and non-verbal information captured by the various sensors, for example from facial changes [1], tone of voice [2] and physiological signals [3]. In 1967, Mehrabian [4] showed that 55% of emotional information were visual, 38% vocal and 7% verbal. Face changes during a communication are the first signs that transmit the emotional state, which is why most researchers are very interested by this modality.
自動(dòng)表情識(shí)別是一個(gè)很大的并且很重要的研究領(lǐng)域，他們主要是解決和討論兩個(gè)不同的課題，一個(gè)是心理學(xué)上的人類表情識(shí)別，另外一個(gè)是的AI。人類的表情的可以通過(guò)不同感官獲取的言語(yǔ)信息或者非言語(yǔ)信息來(lái)獲得，比如說(shuō)面部表情變化，說(shuō)話音調(diào)的變化的和生理信號(hào)等等。在1967年，Mehrabian就得出結(jié)論：55%的表情信息是可視化的，38%是體現(xiàn)在聲音上的，7%體現(xiàn)在語(yǔ)言上的。在交流中面部變化是傳遞表情狀態(tài)的第一信號(hào)，這就是為什么大部分的研究者都十分感興趣于這個(gè)形式。
詞組：
- address two subjects 解決兩個(gè)課題
- visual 可視化 vocal 聲音的 verbal 語(yǔ)言的
總結(jié)：關(guān)注臉部表情識(shí)別的原因，臉部表情傳達(dá)了55%的心里表情。
Extracting features from one face to another is a difficult and sensitive task in order to have a better classification. In 1978 Ekman and Freisen [5] are among the first scientific interested in facial expression which are developed FACS (Facial Action Coding System) in which facial movements are described by Action Units AUs, they are broken down the human face into 46 AUs action units each AU is coded with one or more facial muscles.
從一張臉到另外一張臉提取特征是十分困難和棘手的，其目的是通過(guò)提取特征實(shí)現(xiàn)一個(gè)更好分類。在1978年，Ekman和 Freisen是第一批對(duì)面部表情感興趣的科學(xué)家之一，他們開(kāi)發(fā)了FACS(面部動(dòng)作編碼系統(tǒng))，在其中面部動(dòng)作被描述為動(dòng)作單元（AUs），他們將人的面部劃分為46個(gè)AUs動(dòng)作單元，每一個(gè)Aus都可以使用一個(gè)或者多個(gè)面部肌肉進(jìn)行編碼。
大意：AUs單元，將人臉劃分為46個(gè)Aus動(dòng)作，使用肌肉對(duì)每一個(gè)AUs動(dòng)作進(jìn)行編碼。
The automatic FER is the most studied by researchers compared to other modalities to statistics which made by philipp et al.[6], but it is task that is not easy because each person presents his emotion by his way. Several obstacles and challenges are present in this area that one should not neglect like the variation of head pose, luminosity, age, gender and the background, as well as the problem of occlusion caused by Sunglasses, scarf, skin illness…etc.
比起別的philipp等人做的數(shù)據(jù)統(tǒng)計(jì)形式，研究自動(dòng)表情識(shí)別的人最多。但是這本身就不是一個(gè)簡(jiǎn)單的任務(wù)，因?yàn)槊恳粋€(gè)人都有自己表達(dá)感情的方式。在這個(gè)領(lǐng)域很多的障礙和挑戰(zhàn)，比如說(shuō)一個(gè)人，是不能忽略頭部姿態(tài)，光度，年齡，性別和背景等因素的變化，除此之外還有由遮陽(yáng)眼鏡、圍巾、皮膚裝飾等遮擋物的導(dǎo)致的問(wèn)題。
詞組：
- modalities to statistics 數(shù)據(jù)統(tǒng)計(jì)的形式
- occlusion 遮擋物
大意：面部表情識(shí)別的挑戰(zhàn)：每一個(gè)人都有自己表情識(shí)別方式，而且還有很多不確定的因素。
Several traditional methods exist are used for the extraction facial features such as geometric and texture features for example local binary patterns LBP [7], facial action units FAC [5], local directional patterns LDA [8], Gabor wavelet [9]. In recent years, deep learning has been very successful and efficient approach thanks to the result obtained with its architectures which allow the automatic extraction of features and classification such as the convolutional neural network CNN and the recurrent neural network RNN; here what prompted researchers to start using this technique to recognize human emotions. Several efforts are made by researchers on the development of deep neural network architectures, which produce very satisfactory results in this area.
一些現(xiàn)存的傳統(tǒng)方法都被用于例如幾何學(xué)和質(zhì)地等面部特征的提取，比如說(shuō)局部二值模式（LBP），面部活動(dòng)單元（FAC），局部方向模式（LDA），Gabor小波。在近幾年，深度學(xué)習(xí)已經(jīng)有了很多成功和有效的方法，比如說(shuō)卷積神經(jīng)網(wǎng)絡(luò)RNN和遞歸神經(jīng)網(wǎng)絡(luò)RNN，正是因?yàn)槠渥陨斫Y(jié)構(gòu)獲得的成功，使得自動(dòng)特征提取和分類成為現(xiàn)實(shí)。故而，很多研究者開(kāi)始使用這些技術(shù)去識(shí)別人類表情。在深度神經(jīng)網(wǎng)絡(luò)結(jié)構(gòu)的發(fā)展上，研究者們，已經(jīng)做了很多的努力，這在表情識(shí)別方面已經(jīng)產(chǎn)生了很多的令人滿意的結(jié)果。
詞組：
- exist：是existing的縮寫，修飾methods，表示現(xiàn)存的方法，就是existing
大意：列舉了一些現(xiàn)存的一些表情特征提取的方法，深度學(xué)習(xí)促進(jìn)了表情識(shí)別的發(fā)展。
In this paper, we provide a review of recent advances in sensing emotions by recognizing facial expressions using different deep learning architectures. We present recent results from 2016 to 2019 with an interpretation of the problems and contributions. It is organized as follows: in section two, we introduce some available public databases, section three; we present a recent state of the art on the FER using deep learning and we end in section four and five with a discussion and comparisons then a general conclusion with the future works.
在這篇文章中，我們提供了近些年在感知情緒方面的最新進(jìn)展的回顧，主要基于不同的深度學(xué)習(xí)架構(gòu)識(shí)別面部表情來(lái)感知情緒的進(jìn)展。我們列出從2016年到2019年之間的實(shí)驗(yàn)結(jié)果和相關(guān)問(wèn)題和貢獻(xiàn)的解釋。他們時(shí)使用如下的方式進(jìn)行組織的：在第二部分，我們介紹一些可用的公共數(shù)據(jù)庫(kù)，在第三部分，我們列一些最先進(jìn)的深度學(xué)習(xí)的表情識(shí)別現(xiàn)狀，我們將以討論和對(duì)比的方式在第四部分和第五部分結(jié)束整篇文章，最后是一個(gè)對(duì)未來(lái)工作的整體概括。
詞組：
- provide a review of 提供對(duì)什么的回顧
- with an interpretation of the problems 對(duì)問(wèn)題的解釋
大意：文章的整體結(jié)構(gòu)：第二部分，是可用的公共數(shù)據(jù)庫(kù)，第三部分是最先進(jìn)的表情識(shí)別技術(shù)現(xiàn)狀，四五部分是對(duì)文章的對(duì)比和討論，最后一部分是對(duì)未來(lái)工作的總結(jié)。

Facial Available Databases

One of the success factors of deep learning is the training the neuron network with examples, several FER databases now available to researchers to accomplish this task, each one different from the others in term of the number and size of images and videos, variations of the illumination, population and face pose. Some presented in the Table.1 in which we will note its presence in the works cited in the following section.
深度學(xué)習(xí)的一個(gè)成功因素就是使用樣例去訓(xùn)練神經(jīng)元網(wǎng)絡(luò)，一些表情識(shí)別的數(shù)據(jù)庫(kù)可以讓研究者利用來(lái)完成這個(gè)任務(wù)，每一個(gè)在數(shù)量和圖片以及視頻的大小上都是彼此不同的，他們的遮擋度，人口密度和面部姿態(tài)也是不同的。下面的table1展示了一些，在后面的幾個(gè)部分中，在我們引用的時(shí)候可以注意到他們的存在。

Facial Emotion Recognition Using Deep Learning

Despite the notable success of traditional facial recognition methods through the extracted of handcrafted features, over the past decade researchers have directed to the deep learning approach due to its high automatic recognition capacity. In this context, we will present some recent studies in FER, which show proposed methods of deep learning in order to obtain better detection. Train and test on several static or sequential databases.
盡管傳統(tǒng)的面部表情識(shí)別方式通過(guò)手工特征的提取已經(jīng)取得了很顯著的成就，但是在過(guò)去幾十年之間，研究者們將方向轉(zhuǎn)到了深度學(xué)習(xí)的研究方向，這主要是因?yàn)樗淖詣?dòng)識(shí)別的性能很高。在這個(gè)背景下，我們將列舉一些最近的在表情識(shí)別方面的研究，為了能夠獲得更好的檢測(cè)結(jié)果，我們主要展示一些被推薦的方法。他們訓(xùn)練和測(cè)試的過(guò)程，主要是基于一些靜態(tài)或者連續(xù)的數(shù)據(jù)庫(kù)。
生詞和詞組：
- direct to the deep learning approach 調(diào)整方向到深度學(xué)習(xí)
- in this context 在這個(gè)背景下
大意：深度學(xué)習(xí)的自動(dòng)表情識(shí)別的性能很高，故稱為研究者的研究方向。
Mollahosseini et al. [23] propose deep CNN for FER across several available databases. After extracting the facial landmarks from the data, the images reduced to 48x 48 pixels. Then, they applied the augmentation data technique. The architecture used consist of two convolution-pooling layers, then add two inception styles modules, which contains convolutional layers size 1x1, 3x3 and 5x5. They present the ability to use technique the network-in-network, which allow increasing local performance due to the convolution layers applied locally, and this technique also make it possible to reduce the over-fitting problem.
Mollahosseini等人提出通過(guò)常規(guī)的數(shù)據(jù)庫(kù)，使用深度卷積網(wǎng)絡(luò)進(jìn)行表情識(shí)別的想法。從數(shù)據(jù)庫(kù)中提取了明顯的標(biāo)志之后，圖片就被減少到4848像素大小。然后，他們使用信息增強(qiáng)技術(shù)，對(duì)圖片進(jìn)行增強(qiáng)。他們使用的網(wǎng)絡(luò)結(jié)構(gòu)是由兩個(gè)卷積池化層，兩個(gè)起始模塊構(gòu)成的，這兩個(gè)其實(shí)模塊都包含了的11，33，和55的卷積層。他們體現(xiàn)出了使用內(nèi)網(wǎng)的能力，正是因?yàn)榈木植繎?yīng)用卷積層，使得他們能夠提高局部的效果，除此之外，這個(gè)技術(shù)也減少了過(guò)擬合問(wèn)題的出現(xiàn)。
生詞：
- across several available databases 使用多個(gè)常見(jiàn)可獲得數(shù)據(jù)庫(kù)
- augmentation 增強(qiáng)
大意：使用卷積網(wǎng)路實(shí)現(xiàn)表情識(shí)別，其網(wǎng)絡(luò)的構(gòu)造是兩個(gè)卷積池化層和兩個(gè)初始模式模塊，每一個(gè)初始模式模塊都包含了三個(gè)卷積層，分別是11，33，5*5。其使用內(nèi)網(wǎng)的技術(shù)，使得整個(gè)網(wǎng)絡(luò)不會(huì)出現(xiàn)過(guò)擬合的情況。
注意：希望重點(diǎn)了解一下Mollahossenii的卷積層實(shí)現(xiàn)表情識(shí)別的方法。
Lopes et al. [24] Studied the impact of data pre-processing before the training the network in order to have a better emotion classification. Data augmentation, rotation correction, cropping, down sampling with 32x32 pixels and intensity normalisation are the steps that were applied before CNN, which consist of two convolution-pooling layers ending with two fully connected with 256 and 7 neurons. The best weight gained at the training stage are used at the test stage. This experience was evaluated in three accessible databases: CK+, JAFFE, BU-3DFE. Researchers shows that combining all of these pre-processing steps is more effective than applying them separately.
為了有一個(gè)更好的情緒分類，Lopes等人特地研究了在訓(xùn)練網(wǎng)絡(luò)之前對(duì)數(shù)據(jù)進(jìn)行預(yù)處理的效果。在訓(xùn)練卷積網(wǎng)絡(luò)之前，一定要經(jīng)歷數(shù)據(jù)增強(qiáng)，旋轉(zhuǎn)矯正，剪裁矯正，基于32*32像素的下采樣和強(qiáng)度標(biāo)準(zhǔn)化等預(yù)處理步驟。卷積網(wǎng)絡(luò)是由兩個(gè)卷積池化層和兩個(gè)全連接的神經(jīng)元層構(gòu)成得，兩個(gè)層的神經(jīng)元分別是256和7個(gè)。在訓(xùn)練階段獲得的最好的權(quán)重被用于測(cè)試階段。研究者總共在三個(gè)常見(jiàn)的數(shù)據(jù)庫(kù)CK+，JAFFE，和BU-3DFE上進(jìn)行了測(cè)試。他們的研究表明，比起分開(kāi)單獨(dú)應(yīng)用，整合了這些預(yù)處理之后，識(shí)別得效果更佳。
生詞和詞組：
- data augmentation 數(shù)據(jù)增強(qiáng)
- rotation correction 旋轉(zhuǎn)矯正
- cropping 剪裁矯正
- down sampling 下采樣
- intensity normalisation 強(qiáng)度標(biāo)準(zhǔn)化
大意：Lopes等人也是使用卷積網(wǎng)絡(luò)，不同的是他在訓(xùn)練網(wǎng)絡(luò)之前，增加對(duì)圖片數(shù)據(jù)的預(yù)處理，效果顯著提高和增加了。
These pre-processing techniques also implemented by Mohammadpour et al. [25]. They propose a novel CNN for detecting AUs of the face. For the network, they use two convolution layers, each followed by a max pooling and ending with two fully connected layers that indicate the numbers of AUs activated.
這些數(shù)據(jù)預(yù)處理技術(shù)被Mohammadpour等人廣泛應(yīng)用。他們提出一種新穎的CNN網(wǎng)絡(luò)，主要用于檢測(cè)臉上的AUs。對(duì)于網(wǎng)絡(luò)而言，他們使用兩個(gè)卷積層，每一個(gè)后面都跟隨一個(gè)最大的池化層，整個(gè)網(wǎng)絡(luò)是以兩個(gè)全連接層結(jié)束的，他們的作用是顯示正在活動(dòng)的Aus的數(shù)目。
生詞和詞組：
- novel 作為形容詞是指新穎的
大意：Mohammadpour等人提出專門用來(lái)檢測(cè)AUs的卷積網(wǎng)絡(luò)，實(shí)時(shí)統(tǒng)計(jì)AUs的數(shù)目。
In 2018, for the disappearance or explosion gradient problem Cai et al. [26] propose a novel architecture CNN with Sparse Batch normalization SBP. The property of this network is to use two convolution layers successive at the beginning, followed by max pooling then SBP, and to reduce the over-fitting problem, the dropout applied in the middle of three fully connected. For the facial occlusion problem Li et al. [27] present a new method of CNN, firstly the data introduced into VGGNet network, then they apply the technique of CNN with attention mechanism ACNN. This architecture trained and tested in three large databases FED-RO, RAF-DB and AffectNet.
在2018年，針對(duì)梯度消失和爆炸問(wèn)題，Cai等人使用稀疏性標(biāo)準(zhǔn)性處理，提出一種新型架構(gòu)的CNN網(wǎng)絡(luò)。這個(gè)網(wǎng)絡(luò)的特性就是就是在一開(kāi)始使用兩個(gè)連續(xù)的卷積層，然后最大池化層，緊接著是SBP，最后，為了減少過(guò)擬合問(wèn)題的出現(xiàn)，在兩個(gè)全連接層之間使用了dropout層。對(duì)于面部遮擋的問(wèn)題，Li等人提出了一種新的CNN網(wǎng)絡(luò)，首先數(shù)據(jù)被引入到VGGNet網(wǎng)絡(luò)中，然后他們使用了增加注意力機(jī)制的CNN的網(wǎng)絡(luò)，這個(gè)架構(gòu)使用FED-RO, RAF-DB 和AffectNet三個(gè)數(shù)據(jù)集進(jìn)行訓(xùn)練和測(cè)試。
生詞和詞組：
- attention machanism 注意力機(jī)制
大意：
- Cai等人使用稀疏標(biāo)準(zhǔn)性處理，他的網(wǎng)絡(luò)架構(gòu)是兩個(gè)連續(xù)的卷積層，然后最大池化層，緊接著是SBP，最后，為了減少過(guò)擬合問(wèn)題的出現(xiàn)，在兩個(gè)全連接層之間使用了dropout層。
- Li等人使用新的CNN網(wǎng)絡(luò)，將數(shù)據(jù)引入到VGGNet網(wǎng)絡(luò)，是用來(lái)增加注意力機(jī)制的CNN網(wǎng)絡(luò)。
Detection of the essential parts of the face was proposed by Yolcu et al. [28].They used three CNN with same architecture each one detect a part of the face such as eyebrow, eye and mouth. Before introducing the images into CNN, they go through the crop stage and the detection of key-point facial. The iconic face obtained combined with the raw image was introduced into second type of CNN to detect facial expression. Researchers show that this method offers better accuracy than the use raw images or iconize face alone (See Fig.1.a).
對(duì)于臉部基本信息探索和發(fā)現(xiàn)是由的Yolcu等人提出的。他們使用三個(gè)具有相同結(jié)構(gòu)的CNN層，每一個(gè)網(wǎng)絡(luò)都能夠檢測(cè)出臉的一部分，比如說(shuō)眼睫毛、眼睛和嘴巴。在將圖片輸入到卷積網(wǎng)絡(luò)之前，他們會(huì)經(jīng)過(guò)剪裁矯正和關(guān)鍵點(diǎn)面部檢測(cè)的過(guò)程。和原始圖片混合之后圖標(biāo)化臉被輸入第二個(gè)CNN網(wǎng)絡(luò)，用來(lái)檢測(cè)面部表情。研究者發(fā)現(xiàn)，比起直接使用原圖片或者單單像素化臉部，這種方式能夠有更好的效果。
大意：Yolcu et al. [28]等人將提取特征的圖片和原圖片一塊輸入到卷積網(wǎng)絡(luò)中，比單純的只輸入原圖片或者提取的特征的效果要好多了。他的網(wǎng)絡(luò)架構(gòu)：三個(gè)的一摸一樣的卷積層，每一個(gè)網(wǎng)絡(luò)都特地針對(duì)臉上某一個(gè)部位。他們通過(guò)剪裁矯正和關(guān)鍵面部提取獲取臉部的特征。
In 2019, Agrawal et Mittal [29] make a study of the influence variation of the CNN parameters on the recognition rate using FER2013 database. First, all the images are all defined at 64x64 pixels, and they make a variation in size and number of filters also the type of optimizer chosen (adam, SGD, adadelta) on a simple CNN, which contain two successive convolution layers, the second layer play the role the max pooling, then a softmax function for classification. According to these studies, researchers create two novel models of CNN achieve average 65.23% and 65.77% of accuracy, the particularity of these models is that they do not contain fully connected layers dropout, and the same filter size remains in the network.
在2019年，Agrawal et Mittal [29] 使用FER2013測(cè)試集，做了卷積網(wǎng)絡(luò)中的參數(shù)對(duì)表情識(shí)別率的影響的研究。首先，所有的圖片都被設(shè)置為64*64像素，對(duì)圖片的大小進(jìn)行變化，對(duì)過(guò)濾器的數(shù)量進(jìn)行變化，同時(shí)還包括對(duì)于每一個(gè)卷積層的優(yōu)化器的選擇。他的網(wǎng)絡(luò)包含兩個(gè)連續(xù)的卷積層，第二個(gè)卷積層充當(dāng)著最大池化層的作用，最后連接的就是使用softmax的分類器的。根據(jù)他的研究，研究者們創(chuàng)建了兩個(gè)新型的CNN，他們分別實(shí)現(xiàn)了平均精確率是65.23%和65.77%，這些模型的特殊性在于他們并不會(huì)包含全連接層退出，并且整個(gè)網(wǎng)絡(luò)中相同過(guò)濾器仍舊存在。
生詞和詞組：
- fully connected layers dropout 全連接層退出
大意：
- Agrawal et Mittal [29]主要研究了卷積網(wǎng)絡(luò)參數(shù)對(duì)于識(shí)別效果的影響。在不包含全連接層退出的情況下，他們實(shí)現(xiàn)了65.23的平均精確率。
Deepak jain et al. [30] propose a novel deep CNN witch contain two residual blocks, each one contain four convolution layers. These model trains on JAFFE and CK+ databases after a pre-processing step, which allows cropping and normalizing the intensity of the images.
Deepak jain et al. [30] 提出了一種新型的深度卷積網(wǎng)絡(luò)，其包含了的兩個(gè)殘余的模塊，每一個(gè)都包含了四個(gè)卷積層。這些模型是使用JAFFE和CK+進(jìn)行訓(xùn)練的，而且要對(duì)數(shù)據(jù)有預(yù)處理，主要是cropping和圖片的密度的標(biāo)準(zhǔn)化。
生詞和詞組：
- residual 殘余的
Deepak jain et al. [30] 提出一種新的網(wǎng)絡(luò)結(jié)構(gòu)，使用兩個(gè)的后綴模塊。
Kim et al. [31] studies variation facial expression during emotional state, they propose a spatio-temporal architect with a combination between CNN and LSTM. At first time, CNN learn the spatial features of the facial expression in all the frames of the emotional state followed by an LSTM applied to preserve the whole sequence of these spatial features. Also Yu et al. [32] Present a novel architecture called Spatio-Temporal Convolutional with Nested LSTM (STC-NLSTM), this architecture based on three deep learning sub network such as: 3DCNN for extraction spatiotemporal features followed by temporal T-LSTM to preserve the temporal dynamic, then the convolutional C-LSTM for modelled the multi-level features.
Kim et al. [31]在表情狀態(tài)中面部表情的變化，他們提出了一種時(shí)空上的結(jié)構(gòu)，其是CNN和LSTM的組合。首先，CNN學(xué)習(xí)表情狀態(tài)中的所有分支的空間特征，然后LSTM保存的所有空間特征的完整序列。
同時(shí)，Yu et al. [32] 也提出了一種新的結(jié)構(gòu)，叫做帶有內(nèi)嵌的LSTM的時(shí)空卷積層。她是基于三個(gè)深度學(xué)習(xí)子網(wǎng)絡(luò)。使用3個(gè)卷積網(wǎng)絡(luò)用來(lái)提取時(shí)空特征，然后使用T-LSTM保存時(shí)間上的動(dòng)態(tài)，最后是帶有卷積特性的C-LSTM，這個(gè)主要是用來(lái)制作多層特征。
生詞和詞組
- spatial feature 空間上的特征
- spatio-temporal 時(shí)空的
大意：CNN和LSTM結(jié)合主要用來(lái)提取表情狀態(tài)的時(shí)空特征。
Deep convolutional BiLSTM architecture was proposed by Liang et al. [33], they create two DCNN, one of which is designated for spatial features and the other for extracting temporal features in facial expression sequences, these features fused at level on a vector with 256 dimensions, and for the classification into one of the six basic emotions, researchers used BiLTSM network. For the pre-processing stage, they used the Multitask cascade convolutional network for detecting the face, then applied the technique of data augmentation to broaden database (See Fig.1.b).
Liang et al. [33]提出深度卷積的BiLSTM網(wǎng)絡(luò)結(jié)構(gòu)。他們創(chuàng)建了兩個(gè)DCNN，一個(gè)都被專門設(shè)計(jì)用來(lái)提取表情序列中空間特征，另外一個(gè)用于提取表情序列中時(shí)間特征。他們的特征在同層中被混合到一個(gè)256個(gè)維度的向量中，然后被分類為6種基本的表情。而這個(gè)網(wǎng)絡(luò)結(jié)構(gòu)，在數(shù)據(jù)預(yù)處理階段，他們使用多任務(wù)級(jí)聯(lián)的卷積網(wǎng)絡(luò)，來(lái)檢測(cè)臉部，然后使用數(shù)據(jù)增強(qiáng)技術(shù)拓寬數(shù)據(jù)庫(kù)。
All of the researchers cited previously classifying the basic emotions: happiness, disgust, surprise, anger, fear, sadness and neutral, Fig 3. Present some different architecture proposed by the researchers who mentioned above.
上文所有被引用的研究者都是將表情分為六種基本的表情：高興，厭惡，驚喜，生氣，恐懼，悲傷和中性。圖片三，列出了上文提到的研究者們使用的不同的架構(gòu)。

Discussion and Comparison

In this paper, we clearly noted the significant interest of researchers in FER via deep learning over recent years. The automatic FER task goes through different steps like: data processing, proposed model architecture and finally emotion recognition.
在這篇文章中，我們清除地指出了近些年，在表情識(shí)別領(lǐng)域的研究者對(duì)于深度學(xué)習(xí)的巨大興趣。自動(dòng)化的表情識(shí)別任務(wù)需要經(jīng)歷不同的幾個(gè)步驟：數(shù)據(jù)預(yù)處理，提出模型結(jié)構(gòu)和最終進(jìn)行表情識(shí)別。
The preprocessing is an important step, which was present in all the papers cited in this review, that consist several techniques such as resized and cropped images to reduce the time of training, normalization spatial and intensity pixels and the data augmentation to increase the diversity of the images and eliminate the over-fitting problem. All these techniques are well presented by lopes et al. [24].
數(shù)據(jù)預(yù)處理是十分重要的一個(gè)步驟，在這篇回顧中所有被引用的文章，都用到了數(shù)據(jù)預(yù)處理。數(shù)據(jù)預(yù)處理一般是由幾個(gè)不同的技術(shù)構(gòu)成的，比如說(shuō)調(diào)整大小，建材圖片，以減少訓(xùn)練的時(shí)間，標(biāo)準(zhǔn)化空間和密集像素，數(shù)據(jù)增強(qiáng)以增加圖片的多樣性和消除過(guò)擬合問(wèn)題。這些技術(shù)在 lopes et al. [24]的文章中有被詳細(xì)地提及。
生詞和短語(yǔ)：
- the diversity of the images 圖片的多樣性
- eliminate the over-fitting problems 消除過(guò)擬合問(wèn)題
大意：數(shù)據(jù)增強(qiáng)可以增加訓(xùn)練集的多樣性和減少過(guò)擬合問(wèn)題的出現(xiàn)
Several methods and contributions presented in this review was achieved high accuracy. Mollahosseini et al. [23] showed the important performance by adding inception layers in the networks, . Mohammadpour et al. [25] prefer to extract AU from the face than the classification directly the emotions, Li et al. [27] is interested in the study the problem of occlusion images, also for to get network deeper, Deepak et al. [30] propose adding the residual blocks. Yolcu et al. [28] shows the advantage of adding the iconized face in the input of the network, enhance compared with the training just with the raw images. For Agrawal et Mittal.[29] offers two new CNN architecture after an in-depth study the impact of CNN parameters on the recognition rate. Most of these methods presented competitive results over than 90%. (See Table.2)
這篇文章中提到的幾個(gè)方法和他們所做的貢獻(xiàn)，都已經(jīng)達(dá)到了很高的精度。 Mollahosseini et al. [23]在網(wǎng)絡(luò)中添加起始層發(fā)揮出十分重要的作用。比起直接對(duì)表情做分類，Mohammadpour et al. [25]更喜歡從面部表情中提取AUs。 Li et al. [27] 的興趣在于研究圖片遮擋問(wèn)題，除此之外，要比誰(shuí)的網(wǎng)絡(luò)更深，Deepak et al. [30]提出在網(wǎng)絡(luò)最后加上一個(gè)冗余層。比起直接使用原圖像進(jìn)行訓(xùn)練，Yolcu et al. [28]的研究發(fā)現(xiàn)在網(wǎng)絡(luò)的輸入層中直接輸入標(biāo)識(shí)化的臉部訓(xùn)練的優(yōu)勢(shì)。在深入研究了卷積網(wǎng)絡(luò)參數(shù)對(duì)于識(shí)別率的影響之后，Agrawal et Mittal.[29]提出了兩種新的CNN架構(gòu)。這些方法中的大部分都展現(xiàn)了比較好的結(jié)果，大部分都是在的90%以上。
生詞和詞組：
- occlusion image 遮擋的圖片
- residual blocks 附加模塊，剩余模塊
大意：這里列舉和對(duì)比了所有相關(guān)文章的特長(zhǎng)。
For extraction the spatio-temporal features researchers proposed different structures of deep learning such as a combination of CNN-LSTM, 3DCNN, and a Deep CNN. According to the results obtained, the methods proposed by Yu et al. [32] and Liang et al. [33] achieve better precision compared to the method used by Kim et al. [31]. With a rate higher than 99%.
對(duì)于時(shí)空特征的提取，研究者提出了深度學(xué)習(xí)的不同架構(gòu)，比如說(shuō)CNN-LSTM, 3DCNN,和Deep CNN的組合。根據(jù)獲得結(jié)果，比起Kim et al. [31].使用的方法， Yu et al. [32] 和 Liang et al. [33]使用的方法達(dá)到了更高的精度，高于99%。
Researchers achieve high precision in FER by applying CNN networks with spatial data and for sequential data, researchers used the combination between CNN-RNN especially LSTM network, this indicate that CNN is the network basic of deep learning for FER. For the CNN parameters, the Softmax function and Adam optimization algorithm are the most used by researchers. We also note in order to test the effectiveness of the proposed neural network architecture, researchers trained and tested their model in several databases, and we clearly see that the recognition rate varies from one database to another with the same DL model (See Table.2).
將空間數(shù)據(jù)和連續(xù)的數(shù)據(jù)輸入給CNN網(wǎng)絡(luò)之后，研究者能夠獲得更高的精度。他們使用的網(wǎng)絡(luò)是CNN-RNN，尤其是LSTM網(wǎng)絡(luò)，這表明CNN是深度學(xué)習(xí)進(jìn)行表情識(shí)別的網(wǎng)絡(luò)基礎(chǔ)。對(duì)于卷積網(wǎng)絡(luò)的參數(shù)，Softmax參數(shù)和Adam優(yōu)化器算法是被大多數(shù)研究者采用的。為了能夠測(cè)試提出的神經(jīng)網(wǎng)絡(luò)的效果，我們說(shuō)明了研究者載訓(xùn)練和測(cè)試過(guò)程中使用的數(shù)據(jù)庫(kù)，我們也清楚地看到了即使使用相同的深度學(xué)習(xí)框架，最終的識(shí)別率也是不盡相同的。

Conclusion and future work：

This paper presented recent research on FER, allowed us to know the latest developments in this area. We have described different architectures of CNN and CNN-LSTM recently proposed by different researchers, and presented some different database containing spontaneous images collected from the real world and others formed in laboratories (SeeTable.1), in order to have and achieve an accurate detection of human emotions. We also present a discussion that shows the high rate obtained by researchers that is what highlight that machines today will be more capable of interpreting emotions, which implies that the interaction human machine becomes more and more natural.
這篇文章列出了關(guān)于表情識(shí)別近期的研究，使得我們能知道在這個(gè)領(lǐng)域的最新發(fā)展。我們也描述了近些年不同研究者提出來(lái)的不用的CNN和CNN-LSTM架構(gòu)。為了能夠獲實(shí)現(xiàn)對(duì)人類表情的精確識(shí)別，我們也列出了不用的數(shù)據(jù)庫(kù)，其圖片都是從真實(shí)世界收集到的。除了數(shù)據(jù)庫(kù)中的圖片，別的都是實(shí)驗(yàn)室生成的。我們也提出了討論，表明研究者獲得高識(shí)別率是十分重要的，這體現(xiàn)出今天的機(jī)器將更加準(zhǔn)確地理解情緒，這意味著人家交互變得越來(lái)越自然。
生詞和詞組：
- spontaneous 自發(fā)的，自然的，天然產(chǎn)生的，無(wú)意識(shí)的
- spontaneous images 自然的圖片，天然的圖片
- highlight v.突出，強(qiáng)調(diào)，醒目 n. 最好的部分
FER are one of the most important ways of providing information about the emotional state, but they are always limited by learning only the six-basic emotion plus neutral. It conflicts with what is present in everyday life, which has emotions that are more complex. This will push researchers in the future work to build larger databases and create powerful deep learning architectures to recognize all basic and secondary emotions. Moreover, today emotion recognition has passed from unimodal analysis to complex system multimodal. Pantic et Rothkrantz [36] show that multimodality is one of the condition for having an ideal detection of human emotion. Researchers are now pushing their research to create and offer powerful multimodal deep learning architectures and databases, for example the fusion of audio and visual studied by Zhang et al. [37] and Ringeval et al. [38] for audio-visual and physiological modalities.
表情識(shí)別是提供有關(guān)情緒狀態(tài)信息的一個(gè)十分重要的渠道。但是他們往往受學(xué)習(xí)表情數(shù)量的限制，表情數(shù)量往往只有六種還有一種中立表情。這就和人們每天生活所表現(xiàn)出來(lái)的相違背，因?yàn)榍榫w往往是更加復(fù)雜的。這將推動(dòng)研究者在未來(lái)工作中創(chuàng)建更大的數(shù)據(jù)庫(kù)和更有效的學(xué)習(xí)框架，去識(shí)別所有基礎(chǔ)表情和衍生的表情。除此之外，今天的表情識(shí)別已經(jīng)由單模態(tài)分析過(guò)渡到了復(fù)雜系統(tǒng)多模態(tài)。Pantic et Rothkrantz [36] 指出，多模態(tài)是人類情緒理想化識(shí)別的一種情況。研究者現(xiàn)在正在推進(jìn)他們的研究去創(chuàng)建和提供一個(gè)更有效的多模態(tài)的深度表情識(shí)別模型和數(shù)據(jù)庫(kù)。比如說(shuō)， Zhang et al. [37] 研究的音頻和視頻的融合，Ringeval et al. [38]研究的音視頻和生理學(xué)相融合的模式。
生詞和詞組：
- unimodal analyssi 單模態(tài)分析
- complex system multimodal 復(fù)雜系統(tǒng)多態(tài)
- the fusion of audio 音頻的融合
- physiological 生理學(xué)
- psychology 心理學(xué)

分析與總結(jié)

看完這篇綜述論文，我又覺(jué)得我行了，但是還是有問(wèn)題，很多文章中引用的一些最先進(jìn)的文章，并沒(méi)有進(jìn)一步看。還需要再細(xì)致地看第二遍。

總結(jié)

以上是生活随笔為你收集整理的论文翻译阅读——Facial Emotion RecognitionUsing Deep Learning:Review And Insights的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： Tensorflow 自然语言处理
下一篇： mysql 时间计算器