11 research outputs found
ΠΠ½Π°Π»ΠΈΡΠΈΡΠ΅ΡΠΊΠΈΠΉ ΠΎΠ±Π·ΠΎΡ Π°ΡΠ΄ΠΈΠΎΠ²ΠΈΠ·ΡΠ°Π»ΡΠ½ΡΡ ΡΠΈΡΡΠ΅ΠΌ Π΄Π»Ρ ΠΎΠΏΡΠ΅Π΄Π΅Π»Π΅Π½ΠΈΡ ΡΡΠ΅Π΄ΡΡΠ² ΠΈΠ½Π΄ΠΈΠ²ΠΈΠ΄ΡΠ°Π»ΡΠ½ΠΎΠΉ Π·Π°ΡΠΈΡΡ Π½Π° Π»ΠΈΡΠ΅ ΡΠ΅Π»ΠΎΠ²Π΅ΠΊΠ°
ΠΠ°ΡΠΈΠ½Π°Ρ Ρ 2019 Π³ΠΎΠ΄Π° Π²ΡΠ΅ ΡΡΡΠ°Π½Ρ ΠΌΠΈΡΠ° ΡΡΠΎΠ»ΠΊΠ½ΡΠ»ΠΈΡΡ ΡΠΎ ΡΡΡΠ΅ΠΌΠΈΡΠ΅Π»ΡΠ½ΡΠΌ ΡΠ°ΡΠΏΡΠΎΡΡΡΠ°Π½Π΅Π½ΠΈΠ΅ΠΌ ΠΏΠ°Π½Π΄Π΅ΠΌΠΈΠΈ, Π²ΡΠ·Π²Π°Π½Π½ΠΎΠΉ ΠΊΠΎΡΠΎΠ½Π°Π²ΠΈΡΡΡΠ½ΠΎΠΉ ΠΈΠ½ΡΠ΅ΠΊΡΠΈΠ΅ΠΉ COVID-19, Π±ΠΎΡΡΠ±Π° Ρ ΠΊΠΎΡΠΎΡΠΎΠΉ ΠΏΡΠΎΠ΄ΠΎΠ»ΠΆΠ°Π΅ΡΡΡ ΠΌΠΈΡΠΎΠ²ΡΠΌ ΡΠΎΠΎΠ±ΡΠ΅ΡΡΠ²ΠΎΠΌ ΠΈ ΠΏΠΎ Π½Π°ΡΡΠΎΡΡΠ΅Π΅ Π²ΡΠ΅ΠΌΡ. ΠΠ΅ΡΠΌΠΎΡΡΡ Π½Π° ΠΎΡΠ΅Π²ΠΈΠ΄Π½ΡΡ ΡΡΡΠ΅ΠΊΡΠΈΠ²Π½ΠΎΡΡΡ ΡΡΠ΅Π΄ΡΡΠ² ΠΈΠ½Π΄ΠΈΠ²ΠΈΠ΄ΡΠ°Π»ΡΠ½ΠΎΠΉ Π·Π°ΡΠΈΡΡ ΠΎΡΠ³Π°Π½ΠΎΠ² Π΄ΡΡ
Π°Π½ΠΈΡ ΠΎΡ Π·Π°ΡΠ°ΠΆΠ΅Π½ΠΈΡ ΠΊΠΎΡΠΎΠ½Π°Π²ΠΈΡΡΡΠ½ΠΎΠΉ ΠΈΠ½ΡΠ΅ΠΊΡΠΈΠ΅ΠΉ, ΠΌΠ½ΠΎΠ³ΠΈΠ΅ Π»ΡΠ΄ΠΈ ΠΏΡΠ΅Π½Π΅Π±ΡΠ΅Π³Π°ΡΡ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠ΅ΠΌ Π·Π°ΡΠΈΡΠ½ΡΡ
ΠΌΠ°ΡΠΎΠΊ Π΄Π»Ρ Π»ΠΈΡΠ° Π² ΠΎΠ±ΡΠ΅ΡΡΠ²Π΅Π½Π½ΡΡ
ΠΌΠ΅ΡΡΠ°Ρ
. ΠΠΎΡΡΠΎΠΌΡ Π΄Π»Ρ ΠΊΠΎΠ½ΡΡΠΎΠ»Ρ ΠΈ ΡΠ²ΠΎΠ΅Π²ΡΠ΅ΠΌΠ΅Π½Π½ΠΎΠ³ΠΎ Π²ΡΡΠ²Π»Π΅Π½ΠΈΡ Π½Π°ΡΡΡΠΈΡΠ΅Π»Π΅ΠΉ ΠΎΠ±ΡΠ΅ΡΡΠ²Π΅Π½Π½ΡΡ
ΠΏΡΠ°Π²ΠΈΠ» Π·Π΄ΡΠ°Π²ΠΎΠΎΡ
ΡΠ°Π½Π΅Π½ΠΈΡ Π½Π΅ΠΎΠ±Ρ
ΠΎΠ΄ΠΈΠΌΠΎ ΠΏΡΠΈΠΌΠ΅Π½ΡΡΡ ΡΠΎΠ²ΡΠ΅ΠΌΠ΅Π½Π½ΡΠ΅ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΎΠ½Π½ΡΠ΅ ΡΠ΅Ρ
Π½ΠΎΠ»ΠΎΠ³ΠΈΠΈ, ΠΊΠΎΡΠΎΡΡΠ΅ Π±ΡΠ΄ΡΡ Π΄Π΅ΡΠ΅ΠΊΡΠΈΡΠΎΠ²Π°ΡΡ Π·Π°ΡΠΈΡΠ½ΡΠ΅ ΠΌΠ°ΡΠΊΠΈ Π½Π° Π»ΠΈΡΠ°Ρ
Π»ΡΠ΄Π΅ΠΉ ΠΏΠΎ Π²ΠΈΠ΄Π΅ΠΎ- ΠΈ Π°ΡΠ΄ΠΈΠΎΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΈ. Π ΡΡΠ°ΡΡΠ΅ ΠΏΡΠΈΠ²Π΅Π΄Π΅Π½ Π°Π½Π°Π»ΠΈΡΠΈΡΠ΅ΡΠΊΠΈΠΉ ΠΎΠ±Π·ΠΎΡ ΡΡΡΠ΅ΡΡΠ²ΡΡΡΠΈΡ
ΠΈ ΡΠ°Π·ΡΠ°Π±Π°ΡΡΠ²Π°Π΅ΠΌΡΡ
ΠΈΠ½ΡΠ΅Π»Π»Π΅ΠΊΡΡΠ°Π»ΡΠ½ΡΡ
ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΎΠ½Π½ΡΡ
ΡΠ΅Ρ
Π½ΠΎΠ»ΠΎΠ³ΠΈΠΉ Π±ΠΈΠΌΠΎΠ΄Π°Π»ΡΠ½ΠΎΠ³ΠΎ Π°Π½Π°Π»ΠΈΠ·Π° Π³ΠΎΠ»ΠΎΡΠΎΠ²ΡΡ
ΠΈ Π»ΠΈΡΠ΅Π²ΡΡ
Ρ
Π°ΡΠ°ΠΊΡΠ΅ΡΠΈΡΡΠΈΠΊ ΡΠ΅Π»ΠΎΠ²Π΅ΠΊΠ° Π² ΠΌΠ°ΡΠΊΠ΅. Π‘ΡΡΠ΅ΡΡΠ²ΡΠ΅Ρ ΠΌΠ½ΠΎΠ³ΠΎ ΠΈΡΡΠ»Π΅Π΄ΠΎΠ²Π°Π½ΠΈΠΉ Π½Π° ΡΠ΅ΠΌΡ ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ ΠΌΠ°ΡΠΎΠΊ ΠΏΠΎ Π²ΠΈΠ΄Π΅ΠΎΠΈΠ·ΠΎΠ±ΡΠ°ΠΆΠ΅Π½ΠΈΡΠΌ, ΡΠ°ΠΊΠΆΠ΅ Π² ΠΎΡΠΊΡΡΡΠΎΠΌ Π΄ΠΎΡΡΡΠΏΠ΅ ΠΌΠΎΠΆΠ½ΠΎ Π½Π°ΠΉΡΠΈ Π·Π½Π°ΡΠΈΡΠ΅Π»ΡΠ½ΠΎΠ΅ ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎ ΠΊΠΎΡΠΏΡΡΠΎΠ², ΡΠΎΠ΄Π΅ΡΠΆΠ°ΡΠΈΡ
ΠΈΠ·ΠΎΠ±ΡΠ°ΠΆΠ΅Π½ΠΈΡ Π»ΠΈΡ ΠΊΠ°ΠΊ Π±Π΅Π· ΠΌΠ°ΡΠΎΠΊ, ΡΠ°ΠΊ ΠΈ Π² ΠΌΠ°ΡΠΊΠ°Ρ
, ΠΏΠΎΠ»ΡΡΠ΅Π½Π½ΡΡ
ΡΠ°Π·Π»ΠΈΡΠ½ΡΠΌΠΈ ΡΠΏΠΎΡΠΎΠ±Π°ΠΌΠΈ. ΠΡΡΠ»Π΅Π΄ΠΎΠ²Π°Π½ΠΈΠΉ ΠΈ ΡΠ°Π·ΡΠ°Π±ΠΎΡΠΎΠΊ, Π½Π°ΠΏΡΠ°Π²Π»Π΅Π½Π½ΡΡ
Π½Π° Π΄Π΅ΡΠ΅ΠΊΡΠΈΡΠΎΠ²Π°Π½ΠΈΠ΅ ΡΡΠ΅Π΄ΡΡΠ² ΠΈΠ½Π΄ΠΈΠ²ΠΈΠ΄ΡΠ°Π»ΡΠ½ΠΎΠΉ Π·Π°ΡΠΈΡΡ ΠΎΡΠ³Π°Π½ΠΎΠ² Π΄ΡΡ
Π°Π½ΠΈΡ ΠΏΠΎ Π°ΠΊΡΡΡΠΈΡΠ΅ΡΠΊΠΈΠΌ Ρ
Π°ΡΠ°ΠΊΡΠ΅ΡΠΈΡΡΠΈΠΊΠ°ΠΌ ΡΠ΅ΡΠΈ ΡΠ΅Π»ΠΎΠ²Π΅ΠΊΠ° ΠΏΠΎΠΊΠ° Π΄ΠΎΡΡΠ°ΡΠΎΡΠ½ΠΎ ΠΌΠ°Π»ΠΎ, ΡΠ°ΠΊ ΠΊΠ°ΠΊ ΡΡΠΎ Π½Π°ΠΏΡΠ°Π²Π»Π΅Π½ΠΈΠ΅ Π½Π°ΡΠ°Π»ΠΎ ΡΠ°Π·Π²ΠΈΠ²Π°ΡΡΡΡ ΡΠΎΠ»ΡΠΊΠΎ Π² ΠΏΠ΅ΡΠΈΠΎΠ΄ ΠΏΠ°Π½Π΄Π΅ΠΌΠΈΠΈ, Π²ΡΠ·Π²Π°Π½Π½ΠΎΠΉ ΠΊΠΎΡΠΎΠ½Π°Π²ΠΈΡΡΡΠ½ΠΎΠΉ ΠΈΠ½ΡΠ΅ΠΊΡΠΈΠ΅ΠΉ COVID-19. Π‘ΡΡΠ΅ΡΡΠ²ΡΡΡΠΈΠ΅ ΡΠΈΡΡΠ΅ΠΌΡ ΠΏΠΎΠ·Π²ΠΎΠ»ΡΡΡ ΠΏΡΠ΅Π΄ΠΎΡΠ²ΡΠ°ΡΠΈΡΡ ΡΠ°ΡΠΏΡΠΎΡΡΡΠ°Π½Π΅Π½ΠΈΠ΅ ΠΊΠΎΡΠΎΠ½Π°Π²ΠΈΡΡΡΠ½ΠΎΠΉ ΠΈΠ½ΡΠ΅ΠΊΡΠΈΠΈ Ρ ΠΏΠΎΠΌΠΎΡΡΡ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ Π½Π°Π»ΠΈΡΠΈΡ/ΠΎΡΡΡΡΡΡΠ²ΠΈΡ ΠΌΠ°ΡΠΎΠΊ Π½Π° Π»ΠΈΡΠ΅, ΡΠ°ΠΊΠΆΠ΅ Π΄Π°Π½Π½ΡΠ΅ ΡΠΈΡΡΠ΅ΠΌΡ ΠΏΠΎΠΌΠΎΠ³Π°ΡΡ Π² Π΄ΠΈΡΡΠ°Π½ΡΠΈΠΎΠ½Π½ΠΎΠΌ Π΄ΠΈΠ°Π³Π½ΠΎΡΡΠΈΡΠΎΠ²Π°Π½ΠΈΠΈ COVID-19 Ρ ΠΏΠΎΠΌΠΎΡΡΡ ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ ΠΏΠ΅ΡΠ²ΡΡ
ΡΠΈΠΌΠΏΡΠΎΠΌΠΎΠ² Π²ΠΈΡΡΡΠ½ΠΎΠΉ ΠΈΠ½ΡΠ΅ΠΊΡΠΈΠΈ ΠΏΠΎ Π°ΠΊΡΡΡΠΈΡΠ΅ΡΠΊΠΈΠΌ Ρ
Π°ΡΠ°ΠΊΡΠ΅ΡΠΈΡΡΠΈΠΊΠ°ΠΌ. ΠΠ΄Π½Π°ΠΊΠΎ, Π½Π° ΡΠ΅Π³ΠΎΠ΄Π½ΡΡΠ½ΠΈΠΉ Π΄Π΅Π½Ρ ΡΡΡΠ΅ΡΡΠ²ΡΠ΅Ρ ΡΡΠ΄ Π½Π΅ΡΠ΅ΡΠ΅Π½Π½ΡΡ
ΠΏΡΠΎΠ±Π»Π΅ΠΌ Π² ΠΎΠ±Π»Π°ΡΡΠΈ Π°Π²ΡΠΎΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠ³ΠΎ Π΄ΠΈΠ°Π³Π½ΠΎΡΡΠΈΡΠΎΠ²Π°Π½ΠΈΡ ΡΠΈΠΌΠΏΡΠΎΠΌΠΎΠ² COVID-19 ΠΈ Π½Π°Π»ΠΈΡΠΈΡ/ΠΎΡΡΡΡΡΡΠ²ΠΈΡ ΠΌΠ°ΡΠΎΠΊ Π½Π° Π»ΠΈΡΠ°Ρ
Π»ΡΠ΄Π΅ΠΉ. Π ΠΏΠ΅ΡΠ²ΡΡ ΠΎΡΠ΅ΡΠ΅Π΄Ρ ΡΡΠΎ Π½ΠΈΠ·ΠΊΠ°Ρ ΡΠΎΡΠ½ΠΎΡΡΡ ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ ΠΌΠ°ΡΠΎΠΊ ΠΈ ΠΊΠΎΡΠΎΠ½Π°Π²ΠΈΡΡΡΠ½ΠΎΠΉ ΠΈΠ½ΡΠ΅ΠΊΡΠΈΠΈ, ΡΡΠΎ Π½Π΅ ΠΏΠΎΠ·Π²ΠΎΠ»ΡΠ΅Ρ ΠΎΡΡΡΠ΅ΡΡΠ²Π»ΡΡΡ Π°Π²ΡΠΎΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΡΡ Π΄ΠΈΠ°Π³Π½ΠΎΡΡΠΈΠΊΡ Π±Π΅Π· ΠΏΡΠΈΡΡΡΡΡΠ²ΠΈΡ ΡΠΊΡΠΏΠ΅ΡΡΠΎΠ² (ΠΌΠ΅Π΄ΠΈΡΠΈΠ½ΡΠΊΠΎΠ³ΠΎ ΠΏΠ΅ΡΡΠΎΠ½Π°Π»Π°). ΠΠ½ΠΎΠ³ΠΈΠ΅ ΡΠΈΡΡΠ΅ΠΌΡ Π½Π΅ ΡΠΏΠΎΡΠΎΠ±Π½Ρ ΡΠ°Π±ΠΎΡΠ°ΡΡ Π² ΡΠ΅ΠΆΠΈΠΌΠ΅ ΡΠ΅Π°Π»ΡΠ½ΠΎΠ³ΠΎ Π²ΡΠ΅ΠΌΠ΅Π½ΠΈ, ΠΈΠ·-Π·Π° ΡΠ΅Π³ΠΎ Π½Π΅Π²ΠΎΠ·ΠΌΠΎΠΆΠ½ΠΎ ΠΏΡΠΎΠΈΠ·Π²ΠΎΠ΄ΠΈΡΡ ΠΊΠΎΠ½ΡΡΠΎΠ»Ρ ΠΈ ΠΌΠΎΠ½ΠΈΡΠΎΡΠΈΠ½Π³ Π½ΠΎΡΠ΅Π½ΠΈΡ Π·Π°ΡΠΈΡΠ½ΡΡ
ΠΌΠ°ΡΠΎΠΊ Π² ΠΎΠ±ΡΠ΅ΡΡΠ²Π΅Π½Π½ΡΡ
ΠΌΠ΅ΡΡΠ°Ρ
. Π’Π°ΠΊΠΆΠ΅ Π±ΠΎΠ»ΡΡΠΈΠ½ΡΡΠ²ΠΎ ΡΡΡΠ΅ΡΡΠ²ΡΡΡΠΈΡ
ΡΠΈΡΡΠ΅ΠΌ Π½Π΅Π²ΠΎΠ·ΠΌΠΎΠΆΠ½ΠΎ Π²ΡΡΡΠΎΠΈΡΡ Π² ΡΠΌΠ°ΡΡΡΠΎΠ½, ΡΡΠΎΠ±Ρ ΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΠ΅Π»ΠΈ ΠΌΠΎΠ³Π»ΠΈ Π² Π»ΡΠ±ΠΎΠΌ ΠΌΠ΅ΡΡΠ΅ ΠΏΡΠΎΠΈΠ·Π²Π΅ΡΡΠΈ Π΄ΠΈΠ°Π³Π½ΠΎΡΡΠΈΡΠΎΠ²Π°Π½ΠΈΠ΅ Π½Π°Π»ΠΈΡΠΈΡ ΠΊΠΎΡΠΎΠ½Π°Π²ΠΈΡΡΡΠ½ΠΎΠΉ ΠΈΠ½ΡΠ΅ΠΊΡΠΈΠΈ. ΠΡΠ΅ ΠΎΠ΄Π½ΠΎΠΉ ΠΎΡΠ½ΠΎΠ²Π½ΠΎΠΉ ΠΏΡΠΎΠ±Π»Π΅ΠΌΠΎΠΉ ΡΠ²Π»ΡΠ΅ΡΡΡ ΡΠ±ΠΎΡ Π΄Π°Π½Π½ΡΡ
ΠΏΠ°ΡΠΈΠ΅Π½ΡΠΎΠ², Π·Π°ΡΠ°ΠΆΠ΅Π½Π½ΡΡ
COVID-19, ΡΠ°ΠΊ ΠΊΠ°ΠΊ ΠΌΠ½ΠΎΠ³ΠΈΠ΅ Π»ΡΠ΄ΠΈ Π½Π΅ ΡΠΎΠ³Π»Π°ΡΠ½Ρ ΡΠ°ΡΠΏΡΠΎΡΡΡΠ°Π½ΡΡΡ ΠΊΠΎΠ½ΡΠΈΠ΄Π΅Π½ΡΠΈΠ°Π»ΡΠ½ΡΡ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΡ
Neural network-based method for visual recognition of driverβs voice commands using attention mechanism
Visual speech recognition or automated lip-reading systems actively apply to speech-to-text translation. Video data
proves to be useful in multimodal speech recognition systems, particularly when using acoustic data is difficult or
not available at all. The main purpose of this study is to improve driver command recognition by analyzing visual
information to reduce touch interaction with various vehicle systems (multimedia and navigation systems, phone calls,
etc.) while driving. We propose a method of automated lip-reading the driverβs speech while driving based on a deep
neural network of 3DResNet18 architecture. Using neural network architecture with bi-directional LSTM model and
attention mechanism allows achieving higher recognition accuracy with a slight decrease in performance. Two different
variants of neural network architectures for visual speech recognition are proposed and investigated. When using the
first neural network architecture, the result of voice recognition of the driver was 77.68 %, which was lower by 5.78 %
than when using the second one the accuracy of which was 83.46 %. Performance of the system which is determined
by a real-time indicator RTF in the case of the first neural network architecture is equal to 0.076, and the second β
RTF is 0.183 which is more than two times higher. The proposed method was tested on the data of multimodal corpus
RUSAVIC recorded in the car. Results of the study can be used in systems of audio-visual speech recognition which
is recommended in high noise conditions, for example, when driving a vehicle. In addition, the analysis performed
allows us to choose the optimal neural network model of visual speech recognition for subsequent incorporation into
the assistive system based on a mobile device
ΠΠ½Π°Π»ΠΈΠ· ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΎΠ½Π½ΠΎΠ³ΠΎ ΠΈ ΠΌΠ°ΡΠ΅ΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠ³ΠΎ ΠΎΠ±Π΅ΡΠΏΠ΅ΡΠ΅Π½ΠΈΡ Π΄Π»Ρ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ Π°ΡΡΠ΅ΠΊΡΠΈΠ²Π½ΡΡ ΡΠΎΡΡΠΎΡΠ½ΠΈΠΉ ΡΠ΅Π»ΠΎΠ²Π΅ΠΊΠ°
Π ΡΡΠ°ΡΡΠ΅ ΠΏΡΠ΅Π΄ΡΡΠ°Π²Π»Π΅Π½ Π°Π½Π°Π»ΠΈΡΠΈΡΠ΅ΡΠΊΠΈΠΉ ΠΎΠ±Π·ΠΎΡ ΠΈΡΡΠ»Π΅Π΄ΠΎΠ²Π°Π½ΠΈΠΉ Π² ΠΎΠ±Π»Π°ΡΡΠΈ Π°ΡΡΠ΅ΠΊΡΠΈΠ²Π½ΡΡ
Π²ΡΡΠΈΡΠ»Π΅Π½ΠΈΠΉ. ΠΡΠΎ Π½Π°ΠΏΡΠ°Π²Π»Π΅Π½ΠΈΠ΅ ΡΠ²Π»ΡΠ΅ΡΡΡ ΡΠΎΡΡΠ°Π²Π»ΡΡΡΠ΅ΠΉ ΠΈΡΠΊΡΡΡΡΠ²Π΅Π½Π½ΠΎΠ³ΠΎ ΠΈΠ½ΡΠ΅Π»Π»Π΅ΠΊΡΠ°, ΠΈ ΠΈΠ·ΡΡΠ°Π΅Ρ ΠΌΠ΅ΡΠΎΠ΄Ρ, Π°Π»Π³ΠΎΡΠΈΡΠΌΡ ΠΈ ΡΠΈΡΡΠ΅ΠΌΡ Π΄Π»Ρ Π°Π½Π°Π»ΠΈΠ·Π° Π°ΡΡΠ΅ΠΊΡΠΈΠ²Π½ΡΡ
ΡΠΎΡΡΠΎΡΠ½ΠΈΠΉ ΡΠ΅Π»ΠΎΠ²Π΅ΠΊΠ° ΠΏΡΠΈ Π΅Π³ΠΎ Π²Π·Π°ΠΈΠΌΠΎΠ΄Π΅ΠΉΡΡΠ²ΠΈΠΈ Ρ Π΄ΡΡΠ³ΠΈΠΌΠΈ Π»ΡΠ΄ΡΠΌΠΈ, ΠΊΠΎΠΌΠΏΡΡΡΠ΅ΡΠ½ΡΠΌΠΈ ΡΠΈΡΡΠ΅ΠΌΠ°ΠΌΠΈ ΠΈΠ»ΠΈ ΡΠΎΠ±ΠΎΡΠ°ΠΌΠΈ. Π ΠΎΠ±Π»Π°ΡΡΠΈ ΠΈΠ½ΡΠ΅Π»Π»Π΅ΠΊΡΡΠ°Π»ΡΠ½ΠΎΠ³ΠΎ Π°Π½Π°Π»ΠΈΠ·Π° Π΄Π°Π½Π½ΡΡ
ΠΏΠΎΠ΄ Π°ΡΡΠ΅ΠΊΡΠΎΠΌ ΠΏΠΎΠ΄ΡΠ°Π·ΡΠΌΠ΅Π²Π°Π΅ΡΡΡ ΠΏΡΠΎΡΠ²Π»Π΅Π½ΠΈΠ΅ ΠΏΡΠΈΡ
ΠΎΠ»ΠΎΠ³ΠΈΡΠ΅ΡΠΊΠΈΡ
ΡΠ΅Π°ΠΊΡΠΈΠΉ Π½Π° Π²ΠΎΠ·Π±ΡΠΆΠ΄Π°Π΅ΠΌΠΎΠ΅ ΡΠΎΠ±ΡΡΠΈΠ΅, ΠΊΠΎΡΠΎΡΠΎΠ΅ ΠΌΠΎΠΆΠ΅Ρ ΠΏΡΠΎΡΠ΅ΠΊΠ°ΡΡ ΠΊΠ°ΠΊ Π² ΠΊΡΠ°ΡΠΊΠΎΡΡΠΎΡΠ½ΠΎΠΌ, ΡΠ°ΠΊ ΠΈ Π² Π΄ΠΎΠ»Π³ΠΎΡΡΠΎΡΠ½ΠΎΠΌ ΠΏΠ΅ΡΠΈΠΎΠ΄Π΅, Π° ΡΠ°ΠΊΠΆΠ΅ ΠΈΠΌΠ΅ΡΡ ΡΠ°Π·Π»ΠΈΡΠ½ΡΡ ΠΈΠ½ΡΠ΅Π½ΡΠΈΠ²Π½ΠΎΡΡΡ ΠΏΠ΅ΡΠ΅ΠΆΠΈΠ²Π°Π½ΠΈΠΉ. ΠΡΡΠ΅ΠΊΡΡ Π² ΡΠ°ΡΡΠΌΠ°ΡΡΠΈΠ²Π°Π΅ΠΌΠΎΠΉ ΠΎΠ±Π»Π°ΡΡΠΈ ΡΠ°Π·Π΄Π΅Π»Π΅Π½Ρ Π½Π° 4 Π²ΠΈΠ΄Π°: Π°ΡΡΠ΅ΠΊΡΠΈΠ²Π½ΡΠ΅ ΡΠΌΠΎΡΠΈΠΈ, Π±Π°Π·ΠΎΠ²ΡΠ΅ ΡΠΌΠΎΡΠΈΠΈ, Π½Π°ΡΡΡΠΎΠ΅Π½ΠΈΠ΅ ΠΈ Π°ΡΡΠ΅ΠΊΡΠΈΠ²Π½ΡΠ΅ ΡΠ°ΡΡΡΡΠΎΠΉΡΡΠ²Π°. ΠΡΠΎΡΠ²Π»Π΅Π½ΠΈΠ΅ Π°ΡΡΠ΅ΠΊΡΠΈΠ²Π½ΡΡ
ΡΠΎΡΡΠΎΡΠ½ΠΈΠΉ ΠΎΡΡΠ°ΠΆΠ°Π΅ΡΡΡ Π² Π²Π΅ΡΠ±Π°Π»ΡΠ½ΡΡ
Π΄Π°Π½Π½ΡΡ
ΠΈ Π½Π΅Π²Π΅ΡΠ±Π°Π»ΡΠ½ΡΡ
Ρ
Π°ΡΠ°ΠΊΡΠ΅ΡΠΈΡΡΠΈΠΊΠ°Ρ
ΠΏΠΎΠ²Π΅Π΄Π΅Π½ΠΈΡ: Π°ΠΊΡΡΡΠΈΡΠ΅ΡΠΊΠΈΡ
ΠΈ Π»ΠΈΠ½Π³Π²ΠΈΡΡΠΈΡΠ΅ΡΠΊΠΈΡ
Ρ
Π°ΡΠ°ΠΊΡΠ΅ΡΠΈΡΡΠΈΠΊΠ°Ρ
ΡΠ΅ΡΠΈ, ΠΌΠΈΠΌΠΈΠΊΠ΅, ΠΆΠ΅ΡΡΠ°Ρ
ΠΈ ΠΏΠΎΠ·Π°Ρ
ΡΠ΅Π»ΠΎΠ²Π΅ΠΊΠ°. Π ΠΎΠ±Π·ΠΎΡΠ΅ ΠΏΡΠΈΠ²ΠΎΠ΄ΠΈΡΡΡ ΡΡΠ°Π²Π½ΠΈΡΠ΅Π»ΡΠ½ΡΠΉ Π°Π½Π°Π»ΠΈΠ· ΡΡΡΠ΅ΡΡΠ²ΡΡΡΠ΅Π³ΠΎ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΎΠ½Π½ΠΎΠ³ΠΎ ΠΎΠ±Π΅ΡΠΏΠ΅ΡΠ΅Π½ΠΈΡ Π΄Π»Ρ Π°Π²ΡΠΎΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠ³ΠΎ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ Π°ΡΡΠ΅ΠΊΡΠΈΠ²Π½ΡΡ
ΡΠΎΡΡΠΎΡΠ½ΠΈΠΉ ΡΠ΅Π»ΠΎΠ²Π΅ΠΊΠ° Π½Π° ΠΏΡΠΈΠΌΠ΅ΡΠ΅ ΡΠΌΠΎΡΠΈΠΉ, ΡΠ΅Π½ΡΠΈΠΌΠ΅Π½ΡΠ°, Π°Π³ΡΠ΅ΡΡΠΈΠΈ ΠΈ Π΄Π΅ΠΏΡΠ΅ΡΡΠΈΠΈ. ΠΠ΅ΠΌΠ½ΠΎΠ³ΠΎΡΠΈΡΠ»Π΅Π½Π½ΡΠ΅ ΡΡΡΡΠΊΠΎΡΠ·ΡΡΠ½ΡΠ΅ Π°ΡΡΠ΅ΠΊΡΠΈΠ²Π½ΡΠ΅ Π±Π°Π·Ρ Π΄Π°Π½Π½ΡΡ
ΠΏΠΎΠΊΠ° ΡΡΡΠ΅ΡΡΠ²Π΅Π½Π½ΠΎ ΡΡΡΡΠΏΠ°ΡΡ ΠΏΠΎ ΠΎΠ±ΡΠ΅ΠΌΡ ΠΈ ΠΊΠ°ΡΠ΅ΡΡΠ²Ρ ΡΠ»Π΅ΠΊΡΡΠΎΠ½Π½ΡΠΌ ΡΠ΅ΡΡΡΡΠ°ΠΌ Π½Π° Π΄ΡΡΠ³ΠΈΡ
ΠΌΠΈΡΠΎΠ²ΡΡ
ΡΠ·ΡΠΊΠ°Ρ
, ΡΡΠΎ ΠΎΠ±ΡΡΠ»Π°Π²Π»ΠΈΠ²Π°Π΅Ρ Π½Π΅ΠΎΠ±Ρ
ΠΎΠ΄ΠΈΠΌΠΎΡΡΡ ΡΠ°ΡΡΠΌΠΎΡΡΠ΅Π½ΠΈΡ ΡΠΈΡΠΎΠΊΠΎΠ³ΠΎ ΡΠΏΠ΅ΠΊΡΡΠ° Π΄ΠΎΠΏΠΎΠ»Π½ΠΈΡΠ΅Π»ΡΠ½ΡΡ
ΠΏΠΎΠ΄Ρ
ΠΎΠ΄ΠΎΠ², ΠΌΠ΅ΡΠΎΠ΄ΠΎΠ² ΠΈ Π°Π»Π³ΠΎΡΠΈΡΠΌΠΎΠ², ΠΏΡΠΈΠΌΠ΅Π½ΡΠ΅ΠΌΡΡ
Π² ΡΡΠ»ΠΎΠ²ΠΈΡΡ
ΠΎΠ³ΡΠ°Π½ΠΈΡΠ΅Π½Π½ΠΎΠ³ΠΎ ΠΎΠ±ΡΠ΅ΠΌΠ° ΠΎΠ±ΡΡΠ°ΡΡΠΈΡ
ΠΈ ΡΠ΅ΡΡΠΎΠ²ΡΡ
Π΄Π°Π½Π½ΡΡ
, ΠΈ ΡΡΠ°Π²ΠΈΡ Π·Π°Π΄Π°ΡΡ ΡΠ°Π·ΡΠ°Π±ΠΎΡΠΊΠΈ Π½ΠΎΠ²ΡΡ
ΠΏΠΎΠ΄Ρ
ΠΎΠ΄ΠΎΠ² ΠΊ Π°ΡΠ³ΠΌΠ΅Π½ΡΠ°ΡΠΈΠΈ Π΄Π°Π½Π½ΡΡ
, ΠΏΠ΅ΡΠ΅Π½ΠΎΡΡ ΠΎΠ±ΡΡΠ΅Π½ΠΈΡ ΠΌΠΎΠ΄Π΅Π»Π΅ΠΉ ΠΈ Π°Π΄Π°ΠΏΡΠ°ΡΠΈΠΈ ΠΈΠ½ΠΎΡΠ·ΡΡΠ½ΡΡ
ΡΠ΅ΡΡΡΡΠΎΠ². Π ΡΡΠ°ΡΡΠ΅ ΠΏΡΠΈΠ²ΠΎΠ΄ΠΈΡΡΡ ΠΎΠΏΠΈΡΠ°Π½ΠΈΠ΅ ΠΌΠ΅ΡΠΎΠ΄ΠΎΠ² Π°Π½Π°Π»ΠΈΠ·Π° ΠΎΠ΄Π½ΠΎΠΌΠΎΠ΄Π°Π»ΡΠ½ΠΎΠΉ Π²ΠΈΠ·ΡΠ°Π»ΡΠ½ΠΎΠΉ, Π°ΠΊΡΡΡΠΈΡΠ΅ΡΠΊΠΎΠΉ ΠΈ Π»ΠΈΠ½Π³Π²ΠΈΡΡΠΈΡΠ΅ΡΠΊΠΎΠΉ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΈ, Π° ΡΠ°ΠΊΠΆΠ΅ ΠΌΠ½ΠΎΠ³ΠΎΠΌΠΎΠ΄Π°Π»ΡΠ½ΡΡ
ΠΏΠΎΠ΄Ρ
ΠΎΠ΄ΠΎΠ² ΠΊ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ Π°ΡΡΠ΅ΠΊΡΠΈΠ²Π½ΡΡ
ΡΠΎΡΡΠΎΡΠ½ΠΈΠΉ. ΠΠ½ΠΎΠ³ΠΎΠΌΠΎΠ΄Π°Π»ΡΠ½ΡΠΉ ΠΏΠΎΠ΄Ρ
ΠΎΠ΄ ΠΊ Π°Π²ΡΠΎΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠΌΡ Π°Π½Π°Π»ΠΈΠ·Ρ Π°ΡΡΠ΅ΠΊΡΠΈΠ²Π½ΡΡ
ΡΠΎΡΡΠΎΡΠ½ΠΈΠΉ ΠΏΠΎΠ·Π²ΠΎΠ»ΡΠ΅Ρ ΠΏΠΎΠ²ΡΡΠΈΡΡ ΡΠΎΡΠ½ΠΎΡΡΡ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ ΡΠ°ΡΡΠΌΠ°ΡΡΠΈΠ²Π°Π΅ΠΌΡΡ
ΡΠ²Π»Π΅Π½ΠΈΠΉ ΠΎΡΠ½ΠΎΡΠΈΡΠ΅Π»ΡΠ½ΠΎ ΠΎΠ΄Π½ΠΎΠΌΠΎΠ΄Π°Π»ΡΠ½ΡΡ
ΡΠ΅ΡΠ΅Π½ΠΈΠΉ. Π ΠΎΠ±Π·ΠΎΡΠ΅ ΠΎΡΠΌΠ΅ΡΠ΅Π½Π° ΡΠ΅Π½Π΄Π΅Π½ΡΠΈΡ ΡΠΎΠ²ΡΠ΅ΠΌΠ΅Π½Π½ΡΡ
ΠΈΡΡΠ»Π΅Π΄ΠΎΠ²Π°Π½ΠΈΠΉ, Π·Π°ΠΊΠ»ΡΡΠ°ΡΡΠ°ΡΡΡ Π² ΡΠΎΠΌ, ΡΡΠΎ Π½Π΅ΠΉΡΠΎΡΠ΅ΡΠ΅Π²ΡΠ΅ ΠΌΠ΅ΡΠΎΠ΄Ρ ΠΏΠΎΡΡΠ΅ΠΏΠ΅Π½Π½ΠΎ Π²ΡΡΠ΅ΡΠ½ΡΡΡ ΠΊΠ»Π°ΡΡΠΈΡΠ΅ΡΠΊΠΈΠ΅ Π΄Π΅ΡΠ΅ΡΠΌΠΈΠ½ΠΈΡΠΎΠ²Π°Π½Π½ΡΠ΅ ΠΌΠ΅ΡΠΎΠ΄Ρ Π±Π»Π°Π³ΠΎΠ΄Π°ΡΡ Π»ΡΡΡΠ΅ΠΌΡ ΠΊΠ°ΡΠ΅ΡΡΠ²Ρ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ ΡΠΎΡΡΠΎΡΠ½ΠΈΠΉ ΠΈ ΠΎΠΏΠ΅ΡΠ°ΡΠΈΠ²Π½ΠΎΠΉ ΠΎΠ±ΡΠ°Π±ΠΎΡΠΊΠ΅ Π±ΠΎΠ»ΡΡΠΎΠ³ΠΎ ΠΎΠ±ΡΠ΅ΠΌΠ° Π΄Π°Π½Π½ΡΡ
. Π ΡΡΠ°ΡΡΠ΅ ΡΠ°ΡΡΠΌΠ°ΡΡΠΈΠ²Π°ΡΡΡΡ ΠΌΠ΅ΡΠΎΠ΄Ρ Π°Π½Π°Π»ΠΈΠ·Π° Π°ΡΡΠ΅ΠΊΡΠΈΠ²Π½ΡΡ
ΡΠΎΡΡΠΎΡΠ½ΠΈΠΉ. ΠΡΠ΅ΠΈΠΌΡΡΠ΅ΡΡΠ²ΠΎΠΌ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΡ ΠΌΠ½ΠΎΠ³ΠΎΠ·Π°Π΄Π°ΡΠ½ΡΡ
ΠΈΠ΅ΡΠ°ΡΡ
ΠΈΡΠ΅ΡΠΊΠΈΡ
ΠΏΠΎΠ΄Ρ
ΠΎΠ΄ΠΎΠ² ΡΠ²Π»ΡΠ΅ΡΡΡ Π²ΠΎΠ·ΠΌΠΎΠΆΠ½ΠΎΡΡΡ ΠΈΠ·Π²Π»Π΅ΠΊΠ°ΡΡ Π½ΠΎΠ²ΡΠ΅ ΡΠΈΠΏΡ Π·Π½Π°Π½ΠΈΠΉ, Π² ΡΠΎΠΌ ΡΠΈΡΠ»Π΅ ΠΎ Π²Π»ΠΈΡΠ½ΠΈΠΈ, ΠΊΠΎΡΡΠ΅Π»ΡΡΠΈΠΈ ΠΈ Π²Π·Π°ΠΈΠΌΠΎΠ΄Π΅ΠΉΡΡΠ²ΠΈΠΈ Π½Π΅ΡΠΊΠΎΠ»ΡΠΊΠΈΡ
Π°ΡΡΠ΅ΠΊΡΠΈΠ²Π½ΡΡ
ΡΠΎΡΡΠΎΡΠ½ΠΈΠΉ Π΄ΡΡΠ³ Π½Π° Π΄ΡΡΠ³Π°, ΡΡΠΎ ΠΏΠΎΡΠ΅Π½ΡΠΈΠ°Π»ΡΠ½ΠΎ Π²Π»Π΅ΡΠ΅Ρ ΠΊ ΡΠ»ΡΡΡΠ΅Π½ΠΈΡ ΠΊΠ°ΡΠ΅ΡΡΠ²Π° ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ. ΠΡΠΈΠ²ΠΎΠ΄ΡΡΡΡ ΠΏΠΎΡΠ΅Π½ΡΠΈΠ°Π»ΡΠ½ΡΠ΅ ΡΡΠ΅Π±ΠΎΠ²Π°Π½ΠΈΡ ΠΊ ΡΠ°Π·ΡΠ°Π±Π°ΡΡΠ²Π°Π΅ΠΌΡΠΌ ΡΠΈΡΡΠ΅ΠΌΠ°ΠΌ Π°Π½Π°Π»ΠΈΠ·Π° Π°ΡΡΠ΅ΠΊΡΠΈΠ²Π½ΡΡ
ΡΠΎΡΡΠΎΡΠ½ΠΈΠΉ ΠΈ ΠΎΡΠ½ΠΎΠ²Π½ΡΠ΅ Π½Π°ΠΏΡΠ°Π²Π»Π΅Π½ΠΈΡ Π΄Π°Π»ΡΠ½Π΅ΠΉΡΠΈΡ
ΠΈΡΡΠ»Π΅Π΄ΠΎΠ²Π°Π½ΠΈΠΉ
A Multimodal User Interface for an Assistive Robotic Shopping Cart
This paper presents the research and development of the prototype of the assistive mobile information robot (AMIR). The main features of the presented prototype are voice and gesture-based interfaces with Russian speech and sign language recognition and synthesis techniques and a high degree of robot autonomy. AMIR prototype’s aim is to be used as a robotic cart for shopping in grocery stores and/or supermarkets. Among the main topics covered in this paper are the presentation of the interface (three modalities), the single-handed gesture recognition system (based on a collected database of Russian sign language elements), as well as the technical description of the robotic platform (architecture, navigation algorithm). The use of multimodal interfaces, namely the speech and gesture modalities, make human-robot interaction natural and intuitive, as well as sign language recognition allows hearing-impaired people to use this robotic cart. AMIR prototype has promising perspectives for real usage in supermarkets, both due to its assistive capabilities and its multimodal user interface
Audio-Visual Speech and Gesture Recognition by Sensors of Mobile Devices
Audio-visual speech recognition (AVSR) is one of the most promising solutions for reliable speech recognition, particularly when audio is corrupted by noise. Additional visual information can be used for both automatic lip-reading and gesture recognition. Hand gestures are a form of non-verbal communication and can be used as a very important part of modern human–computer interaction systems. Currently, audio and video modalities are easily accessible by sensors of mobile devices. However, there is no out-of-the-box solution for automatic audio-visual speech and gesture recognition. This study introduces two deep neural network-based model architectures: one for AVSR and one for gesture recognition. The main novelty regarding audio-visual speech recognition lies in fine-tuning strategies for both visual and acoustic features and in the proposed end-to-end model, which considers three modality fusion approaches: prediction-level, feature-level, and model-level. The main novelty in gesture recognition lies in a unique set of spatio-temporal features, including those that consider lip articulation information. As there are no available datasets for the combined task, we evaluated our methods on two different large-scale corpora—LRW and AUTSL—and outperformed existing methods on both audio-visual speech recognition and gesture recognition tasks. We achieved AVSR accuracy for the LRW dataset equal to 98.76% and gesture recognition rate for the AUTSL dataset equal to 98.56%. The results obtained demonstrate not only the high performance of the proposed methodology, but also the fundamental possibility of recognizing audio-visual speech and gestures by sensors of mobile devices
InteligentnΓ uΕΎivatelskΓ© rozhranΓ zaloΕΎenΓ© na gestech pro ovlΓ‘dΓ‘nΓ asistenΔnΓho mobilnΓho informaΔnΓho robota
Tento ΔlΓ‘nek pΕedstavuje uΕΎivatelskΓ© rozhranΓ zaloΕΎenΓ© na gestech pro robotickΓ½ nΓ‘kupnΓ vozΓk. VozΓk je navrΕΎen jako mobilnΓ robotickΓ‘ platforma, kterΓ‘ pomΓ‘hΓ‘ zΓ‘kaznΓkΕ―m v obchodech a supermarketech. Mezi hlavnΓ funkce patΕΓ: navigace v obchodΔ, poskytovΓ‘nΓ informacΓ o dostupnosti a umΓstΔnΓ a pΕeprava zakoupenΓ©ho zboΕΎΓ. Jednou z dΕ―leΕΎitΓ½ch vlastnostΓ vyvinutΓ©ho rozhranΓ je gestickΓ‘ modalita, pΕesnΔji ΕeΔeno ruskΓ½ systΓ©m rozpoznΓ‘vΓ‘nΓ prvkΕ― znakovΓ©ho jazyka. Pojem design rozhranΓ, stejnΔ jako strategie interakce, jsou prezentovΓ‘ny ve vΓ½vojovΓ½ch diagramech, byl uΔinΔn pokus demonstrovat gestickou modalitu jako pΕirozenou souΔΓ‘st pomocnΓ©ho informaΔnΓho robota. KromΔ toho je v ΔlΓ‘nku uveden krΓ‘tkΓ½ pΕehled mobilnΓch robotΕ― a je poskytnuta technika rozpoznΓ‘vΓ‘nΓ gest zaloΕΎenΓ‘ na CNN. MoΕΎnost rozpoznΓ‘vΓ‘nΓ ruskΓ©ho znakovΓ©ho jazyka mΓ‘ velkΓ½ vΓ½znam kvΕ―li relativnΔ velkΓ©mu poΔtu rodilΓ½ch mluvΔΓch.This article presents a gesture-based user interface for a robotic shopping trolley. The trolley is designed as a mobile robotic platform helping customers in shops and supermarkets. Among the main functions are: navigating through the store, providing information on availability and location, and transporting the items bought. One of important features of the developed interface is the gestural modality, or, more precisely, Russian sign language elements recognition system. The notion of the interface design, as well as interaction strategy, are presented in flowcharts, it was made an attempt to demonstrate the gestural modality as a natural part of an assistive information robot. Besides, a short overview of mobile robots is given in the paper, and CNN-based technique of gesture recognition is provided. The Russian sign language recognition option is of high importance due to a relatively large number of native speakers (signers). Β© 2020, Springer Nature Switzerland AG
A Review of Recent Advances on Deep Learning Methods for Audio-Visual Speech Recognition
This article provides a detailed review of recent advances in audio-visual speech recognition (AVSR) methods that have been developed over the last decade (2013β2023). Despite the recent success of audio speech recognition systems, the problem of audio-visual (AV) speech decoding remains challenging. In comparison to the previous surveys, we mainly focus on the important progress brought with the introduction of deep learning (DL) to the field and skip the description of long-known traditional βhand-craftedβ methods. In addition, we also discuss the recent application of DL toward AV speech fusion and recognition. We first discuss the main AV datasets used in the literature for AVSR experiments since we consider it a data-driven machine learning (ML) task. We then consider the methodology used for visual speech recognition (VSR). Subsequently, we also consider recent AV methodology advances. We then separately discuss the evolution of the core AVSR methods, pre-processing and augmentation techniques, and modality fusion strategies. We conclude the article with a discussion on the current state of AVSR and provide our vision for future research
EMOLIPS: Towards Reliable Emotional Speech Lip-Reading
In this article, we present a novel approach for emotional speech lip-reading (EMOLIPS). This two-level approach to emotional speech to text recognition based on visual data processing is motivated by human perception and the recent developments in multimodal deep learning. The proposed approach uses visual speech data to determine the type of speech emotion. The speech data are then processed using one of the emotional lip-reading models trained from scratch. This essentially resolves the multi-emotional lip-reading issue associated with most real-life scenarios. We implemented these models as a combination of EMO-3DCNN-GRU architecture for emotion recognition and 3DCNN-BiLSTM architecture for automatic lip-reading. We evaluated the models on the CREMA-D and RAVDESS emotional speech corpora. In addition, this article provides a detailed review of recent advances in automated lip-reading and emotion recognition that have been developed over the last 5 years (2018β2023). In comparison to existing research, we mainly focus on the valuable progress brought with the introduction of deep learning to the field and skip the description of traditional approaches. The EMOLIPS approach significantly improves the state-of-the-art accuracy for phrase recognition due to considering emotional features of the pronounced audio-visual speech up to 91.9% and 90.9% for RAVDESS and CREMA-D, respectively. Moreover, we present an extensive experimental investigation that demonstrates how different emotions (happiness, anger, disgust, fear, sadness, and neutral), valence (positive, neutral, and negative) and binary (emotional and neutral) affect automatic lip-reading
Multimodal Personality Traits Assessment (MuPTA) Corpus: The Impact of Spontaneous and Read Speech
Automatic personality traits assessment (PTA) provides high-level, intelligible predictive inputs for subsequent critical downstream tasks, such as job interview recommendations and mental healthcare monitoring. In this work, we introduce a novel Multimodal Personality Traits Assessment (MuPTA) corpus. Our MuPTA corpus is unique in that it contains both spontaneous and read speech collected in the midly-resourced Russian language. We present a novel audio-visual approach for PTA that is used in order to set up baseline results on this corpus. We further analyze the impact of both spontaneous and read speech types on the PTA predictive performance. We find that for the audio modality, the PTA predictive performances on short signals are almost equal regardless of the speech type, while PTA using video modality is more accurate with spontaneous speech compared to read one regardless of the signal length
A Multimodal User Interface for an Assistive Robotic Shopping Cart
This paper presents the research and development of the prototype of the assistive mobile information robot (AMIR). The main features of the presented prototype are voice and gesture-based interfaces with Russian speech and sign language recognition and synthesis techniques and a high degree of robot autonomy. AMIR prototypeβs aim is to be used as a robotic cart for shopping in grocery stores and/or supermarkets. Among the main topics covered in this paper are the presentation of the interface (three modalities), the single-handed gesture recognition system (based on a collected database of Russian sign language elements), as well as the technical description of the robotic platform (architecture, navigation algorithm). The use of multimodal interfaces, namely the speech and gesture modalities, make human-robot interaction natural and intuitive, as well as sign language recognition allows hearing-impaired people to use this robotic cart. AMIR prototype has promising perspectives for real usage in supermarkets, both due to its assistive capabilities and its multimodal user interface.Peer reviewe