Search CORE

11 research outputs found

Аналитический обзор аудиовизуальных систем для определения средств индивидуальной защиты на лице человека

Author: Alexey Karpov
Anastasia Dvoynikova
Dmitry Ryumin
Elena Ryumina
Maxim Markitantov
Publication venue: 'SPIIRAS'
Publication date: 01/10/2021
Field of study

Начиная с 2019 года все страны мира столкнулись со стремительным распространением пандемии, вызванной коронавирусной инфекцией COVID-19, борьба с которой продолжается мировым сообществом и по настоящее время. Несмотря на очевидную эффективность средств индивидуальной защиты органов дыхания от заражения коронавирусной инфекцией, многие люди пренебрегают использованием защитных масок для лица в общественных местах. Поэтому для контроля и своевременного выявления нарушителей общественных правил здравоохранения необходимо применять современные информационные технологии, которые будут детектировать защитные маски на лицах людей по видео- и аудиоинформации. В статье приведен аналитический обзор существующих и разрабатываемых интеллектуальных информационных технологий бимодального анализа голосовых и лицевых характеристик человека в маске. Существует много исследований на тему обнаружения масок по видеоизображениям, также в открытом доступе можно найти значительное количество корпусов, содержащих изображения лиц как без масок, так и в масках, полученных различными способами. Исследований и разработок, направленных на детектирование средств индивидуальной защиты органов дыхания по акустическим характеристикам речи человека пока достаточно мало, так как это направление начало развиваться только в период пандемии, вызванной коронавирусной инфекцией COVID-19. Существующие системы позволяют предотвратить распространение коронавирусной инфекции с помощью распознавания наличия/отсутствия масок на лице, также данные системы помогают в дистанционном диагностировании COVID-19 с помощью обнаружения первых симптомов вирусной инфекции по акустическим характеристикам. Однако, на сегодняшний день существует ряд нерешенных проблем в области автоматического диагностирования симптомов COVID-19 и наличия/отсутствия масок на лицах людей. В первую очередь это низкая точность обнаружения масок и коронавирусной инфекции, что не позволяет осуществлять автоматическую диагностику без присутствия экспертов (медицинского персонала). Многие системы не способны работать в режиме реального времени, из-за чего невозможно производить контроль и мониторинг ношения защитных масок в общественных местах. Также большинство существующих систем невозможно встроить в смартфон, чтобы пользователи могли в любом месте произвести диагностирование наличия коронавирусной инфекции. Еще одной основной проблемой является сбор данных пациентов, зараженных COVID-19, так как многие люди не согласны распространять конфиденциальную информацию

Directory of Open Access Journals

Neural network-based method for visual recognition of driver’s voice commands using attention mechanism

Author: Alexandr A. Axyonov
Alexey A. Karpov
Denis V. Ivanko
Dmitry A. Ryumin
Elena V. Ryumina
Publication venue: Saint Petersburg National Research University of Information Technologies, Mechanics and Optics (ITMO University)
Publication date: 01/08/2023
Field of study

Visual speech recognition or automated lip-reading systems actively apply to speech-to-text translation. Video data proves to be useful in multimodal speech recognition systems, particularly when using acoustic data is difficult or not available at all. The main purpose of this study is to improve driver command recognition by analyzing visual information to reduce touch interaction with various vehicle systems (multimedia and navigation systems, phone calls, etc.) while driving. We propose a method of automated lip-reading the driver’s speech while driving based on a deep neural network of 3DResNet18 architecture. Using neural network architecture with bi-directional LSTM model and attention mechanism allows achieving higher recognition accuracy with a slight decrease in performance. Two different variants of neural network architectures for visual speech recognition are proposed and investigated. When using the first neural network architecture, the result of voice recognition of the driver was 77.68 %, which was lower by 5.78 % than when using the second one the accuracy of which was 83.46 %. Performance of the system which is determined by a real-time indicator RTF in the case of the first neural network architecture is equal to 0.076, and the second — RTF is 0.183 which is more than two times higher. The proposed method was tested on the data of multimodal corpus RUSAVIC recorded in the car. Results of the study can be used in systems of audio-visual speech recognition which is recommended in high noise conditions, for example, when driving a vehicle. In addition, the analysis performed allows us to choose the optimal neural network model of visual speech recognition for subsequent incorporation into the assistive system based on a mobile device

Directory of Open Access Journals

Анализ информационного и математического обеспечения для распознавания аффективных состояний человека

Author: Alena Velichko
Alexey Karpov
Anastasia Dvoynikova
Dmitry Ryumin
Elena Lyakso
Elena Ryumina
Maxim Markitantov
Mikhail Uzdiaev
Publication venue: 'SPIIRAS'
Publication date: 01/11/2022
Field of study

В статье представлен аналитический обзор исследований в области аффективных вычислений. Это направление является составляющей искусственного интеллекта, и изучает методы, алгоритмы и системы для анализа аффективных состояний человека при его взаимодействии с другими людьми, компьютерными системами или роботами. В области интеллектуального анализа данных под аффектом подразумевается проявление психологических реакций на возбуждаемое событие, которое может протекать как в краткосрочном, так и в долгосрочном периоде, а также иметь различную интенсивность переживаний. Аффекты в рассматриваемой области разделены на 4 вида: аффективные эмоции, базовые эмоции, настроение и аффективные расстройства. Проявление аффективных состояний отражается в вербальных данных и невербальных характеристиках поведения: акустических и лингвистических характеристиках речи, мимике, жестах и позах человека. В обзоре приводится сравнительный анализ существующего информационного обеспечения для автоматического распознавания аффективных состояний человека на примере эмоций, сентимента, агрессии и депрессии. Немногочисленные русскоязычные аффективные базы данных пока существенно уступают по объему и качеству электронным ресурсам на других мировых языках, что обуславливает необходимость рассмотрения широкого спектра дополнительных подходов, методов и алгоритмов, применяемых в условиях ограниченного объема обучающих и тестовых данных, и ставит задачу разработки новых подходов к аугментации данных, переносу обучения моделей и адаптации иноязычных ресурсов. В статье приводится описание методов анализа одномодальной визуальной, акустической и лингвистической информации, а также многомодальных подходов к распознаванию аффективных состояний. Многомодальный подход к автоматическому анализу аффективных состояний позволяет повысить точность распознавания рассматриваемых явлений относительно одномодальных решений. В обзоре отмечена тенденция современных исследований, заключающаяся в том, что нейросетевые методы постепенно вытесняют классические детерминированные методы благодаря лучшему качеству распознавания состояний и оперативной обработке большого объема данных. В статье рассматриваются методы анализа аффективных состояний. Преимуществом использования многозадачных иерархических подходов является возможность извлекать новые типы знаний, в том числе о влиянии, корреляции и взаимодействии нескольких аффективных состояний друг на друга, что потенциально влечет к улучшению качества распознавания. Приводятся потенциальные требования к разрабатываемым системам анализа аффективных состояний и основные направления дальнейших исследований

Directory of Open Access Journals

A Multimodal User Interface for an Assistive Robotic Shopping Cart

Author: Dmitry Ryumin
Publication venue: 'MDPI AG'
Publication date: 08/12/2020
Field of study

This paper presents the research and development of the prototype of the assistive mobile information robot (AMIR). The main features of the presented prototype are voice and gesture-based interfaces with Russian speech and sign language recognition and synthesis techniques and a high degree of robot autonomy. AMIR prototype’s aim is to be used as a robotic cart for shopping in grocery stores and/or supermarkets. Among the main topics covered in this paper are the presentation of the interface (three modalities), the single-handed gesture recognition system (based on a collected database of Russian sign language elements), as well as the technical description of the robotic platform (architecture, navigation algorithm). The use of multimodal interfaces, namely the speech and gesture modalities, make human-robot interaction natural and intuitive, as well as sign language recognition allows hearing-impaired people to use this robotic cart. AMIR prototype has promising perspectives for real usage in supermarkets, both due to its assistive capabilities and its multimodal user interface

Multidisciplinary Digital Publishing Institute

Audio-Visual Speech and Gesture Recognition by Sensors of Mobile Devices

Author: Dmitry Ryumin
Publication venue: 'MDPI AG'
Publication date: 17/02/2023
Field of study

Audio-visual speech recognition (AVSR) is one of the most promising solutions for reliable speech recognition, particularly when audio is corrupted by noise. Additional visual information can be used for both automatic lip-reading and gesture recognition. Hand gestures are a form of non-verbal communication and can be used as a very important part of modern human–computer interaction systems. Currently, audio and video modalities are easily accessible by sensors of mobile devices. However, there is no out-of-the-box solution for automatic audio-visual speech and gesture recognition. This study introduces two deep neural network-based model architectures: one for AVSR and one for gesture recognition. The main novelty regarding audio-visual speech recognition lies in fine-tuning strategies for both visual and acoustic features and in the proposed end-to-end model, which considers three modality fusion approaches: prediction-level, feature-level, and model-level. The main novelty in gesture recognition lies in a unique set of spatio-temporal features, including those that consider lip articulation information. As there are no available datasets for the combined task, we evaluated our methods on two different large-scale corpora—LRW and AUTSL—and outperformed existing methods on both audio-visual speech recognition and gesture recognition tasks. We achieved AVSR accuracy for the LRW dataset equal to 98.76% and gesture recognition rate for the AUTSL dataset equal to 98.56%. The results obtained demonstrate not only the high performance of the proposed methodology, but also the fundamental possibility of recognizing audio-visual speech and gestures by sensors of mobile devices

Multidisciplinary Digital Publishing Institute

Inteligentní uživatelské rozhraní založené na gestech pro ovládání asistenčního mobilního informačního robota

Author: Kagirov Ildar
Ryumin Dmitry
Železný Miloš
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

Tento článek představuje uživatelské rozhraní založené na gestech pro robotický nákupní vozík. Vozík je navržen jako mobilní robotická platforma, která pomáhá zákazníkům v obchodech a supermarketech. Mezi hlavní funkce patří: navigace v obchodě, poskytování informací o dostupnosti a umístění a přeprava zakoupeného zboží. Jednou z důležitých vlastností vyvinutého rozhraní je gestická modalita, přesněji řečeno ruský systém rozpoznávání prvků znakového jazyka. Pojem design rozhraní, stejně jako strategie interakce, jsou prezentovány ve vývojových diagramech, byl učiněn pokus demonstrovat gestickou modalitu jako přirozenou součást pomocného informačního robota. Kromě toho je v článku uveden krátký přehled mobilních robotů a je poskytnuta technika rozpoznávání gest založená na CNN. Možnost rozpoznávání ruského znakového jazyka má velký význam kvůli relativně velkému počtu rodilých mluvčích.This article presents a gesture-based user interface for a robotic shopping trolley. The trolley is designed as a mobile robotic platform helping customers in shops and supermarkets. Among the main functions are: navigating through the store, providing information on availability and location, and transporting the items bought. One of important features of the developed interface is the gestural modality, or, more precisely, Russian sign language elements recognition system. The notion of the interface design, as well as interaction strategy, are presented in flowcharts, it was made an attempt to demonstrate the gestural modality as a natural part of an assistive information robot. Besides, a short overview of mobile robots is given in the paper, and CNN-based technique of gesture recognition is provided. The Russian sign language recognition option is of high importance due to a relatively large number of native speakers (signers). © 2020, Springer Nature Switzerland AG

University of West Bohemia Digital Library

DSpace at University of West Bohemia

A Review of Recent Advances on Deep Learning Methods for Audio-Visual Speech Recognition

Author: Alexey Karpov
Denis Ivanko
Dmitry Ryumin
Publication venue: 'MDPI AG'
Publication date: 01/06/2023
Field of study

This article provides a detailed review of recent advances in audio-visual speech recognition (AVSR) methods that have been developed over the last decade (2013–2023). Despite the recent success of audio speech recognition systems, the problem of audio-visual (AV) speech decoding remains challenging. In comparison to the previous surveys, we mainly focus on the important progress brought with the introduction of deep learning (DL) to the field and skip the description of long-known traditional “hand-crafted” methods. In addition, we also discuss the recent application of DL toward AV speech fusion and recognition. We first discuss the main AV datasets used in the literature for AVSR experiments since we consider it a data-driven machine learning (ML) task. We then consider the methodology used for visual speech recognition (VSR). Subsequently, we also consider recent AV methodology advances. We then separately discuss the evolution of the core AVSR methods, pre-processing and augmentation techniques, and modality fusion strategies. We conclude the article with a discussion on the current state of AVSR and provide our vision for future research

Directory of Open Access Journals

EMOLIPS: Towards Reliable Emotional Speech Lip-Reading

Author: Denis Ivanko
Dmitry Ryumin
Elena Ryumina
Publication venue: MDPI AG
Publication date: 01/11/2023
Field of study

In this article, we present a novel approach for emotional speech lip-reading (EMOLIPS). This two-level approach to emotional speech to text recognition based on visual data processing is motivated by human perception and the recent developments in multimodal deep learning. The proposed approach uses visual speech data to determine the type of speech emotion. The speech data are then processed using one of the emotional lip-reading models trained from scratch. This essentially resolves the multi-emotional lip-reading issue associated with most real-life scenarios. We implemented these models as a combination of EMO-3DCNN-GRU architecture for emotion recognition and 3DCNN-BiLSTM architecture for automatic lip-reading. We evaluated the models on the CREMA-D and RAVDESS emotional speech corpora. In addition, this article provides a detailed review of recent advances in automated lip-reading and emotion recognition that have been developed over the last 5 years (2018–2023). In comparison to existing research, we mainly focus on the valuable progress brought with the introduction of deep learning to the field and skip the description of traditional approaches. The EMOLIPS approach significantly improves the state-of-the-art accuracy for phrase recognition due to considering emotional features of the pronounced audio-visual speech up to 91.9% and 90.9% for RAVDESS and CREMA-D, respectively. Moreover, we present an extensive experimental investigation that demonstrates how different emotions (happiness, anger, disgust, fear, sadness, and neutral), valence (positive, neutral, and negative) and binary (emotional and neutral) affect automatic lip-reading

Directory of Open Access Journals

Multimodal Personality Traits Assessment (MuPTA) Corpus: The Impact of Spontaneous and Read Speech

Author: Karpov Alexey
Kaya Heysem
Markitantov Maxim
Ryumin Dmitry
Ryumina Elena
Publication venue
Publication date: 20/08/2023
Field of study

Automatic personality traits assessment (PTA) provides high-level, intelligible predictive inputs for subsequent critical downstream tasks, such as job interview recommendations and mental healthcare monitoring. In this work, we introduce a novel Multimodal Personality Traits Assessment (MuPTA) corpus. Our MuPTA corpus is unique in that it contains both spontaneous and read speech collected in the midly-resourced Russian language. We present a novel audio-visual approach for PTA that is used in order to set up baseline results on this corpus. We further analyze the impact of both spontaneous and read speech types on the PTA predictive performance. We find that for the audio modality, the PTA predictive performances on short signals are almost equal regardless of the speech type, while PTA using video modality is more accurate with spontaneous speech compared to read one regardless of the signal length

Utrecht University Repository

A Multimodal User Interface for an Assistive Robotic Shopping Cart

Author: Axyonov Alexandr
Kagirov Ildar
Karpov Alexey
Kipyatkova Irina
Mporas Iosif
Pavlyuk Nikita
Ryumin Dmitry
Saveliev Anton
Zelezny Milos
Publication venue: 'MDPI AG'
Publication date: 08/12/2020
Field of study

University of Hertfordshire Research Archive