210 research outputs found
I hear you eat and speak: automatic recognition of eating condition and food type, use-cases, and impact on ASR performance
We propose a new recognition task in the area of computational paralinguistics: automatic recognition of eating conditions in speech, i. e., whether people are eating while speaking, and what they are eating. To this end, we introduce the audio-visual iHEARu-EAT database featuring 1.6 k utterances of 30 subjects (mean age: 26.1 years, standard deviation: 2.66 years, gender balanced, German speakers), six types of food (Apple, Nectarine, Banana, Haribo Smurfs, Biscuit, and Crisps), and read as well as spontaneous speech, which is made publicly available for research purposes. We start with demonstrating that for automatic speech recognition (ASR), it pays off to know whether speakers are eating or not. We also propose automatic classification both by brute-forcing of low-level acoustic features as well as higher-level features related to intelligibility, obtained from an Automatic Speech Recogniser. Prediction of the eating condition was performed with a Support Vector Machine (SVM) classifier employed in a leave-one-speaker-out evaluation framework. Results show that the binary prediction of eating condition (i. e., eating or not eating) can be easily solved independently of the speaking condition; the obtained average recalls are all above 90%. Low-level acoustic features provide the best performance on spontaneous speech, which reaches up to 62.3% average recall for multi-way classification of the eating condition, i. e., discriminating the six types of food, as well as not eating. The early fusion of features related to intelligibility with the brute-forced acoustic feature set improves the performance on read speech, reaching a 66.4% average recall for the multi-way classification task. Analysing features and classifier errors leads to a suitable ordinal scale for eating conditions, on which automatic regression can be performed with up to 56.2% determination coefficient
Computer audition for emotional wellbeing
This thesis is focused on the application of computer audition (i. e., machine listening) methodologies for monitoring states of emotional wellbeing. Computer audition is a growing field and has been successfully applied to an array of use cases in recent years. There are several advantages to audio-based computational analysis; for example, audio can be recorded non-invasively, stored economically, and can capture rich information on happenings in a given environment, e. g., human behaviour. With this in mind, maintaining emotional wellbeing is a challenge for humans and emotion-altering conditions, including stress and anxiety, have become increasingly common in recent years. Such conditions manifest in the body, inherently changing how we express ourselves. Research shows these alterations are perceivable within vocalisation, suggesting that speech-based audio monitoring may be valuable for developing artificially intelligent systems that target improved wellbeing. Furthermore, computer audition applies machine learning and other computational techniques to audio understanding, and so by combining computer audition with applications in the domain of computational paralinguistics and emotional wellbeing, this research concerns the broader field of empathy for Artificial Intelligence (AI). To this end, speech-based audio modelling that incorporates and understands paralinguistic wellbeing-related states may be a vital cornerstone for improving the degree of empathy that an artificial intelligence has.
To summarise, this thesis investigates the extent to which speech-based computer audition methodologies can be utilised to understand human emotional wellbeing. A fundamental background on the fields in question as they pertain to emotional wellbeing is first presented, followed by an outline of the applied audio-based methodologies. Next, detail is provided for several machine learning experiments focused on emotional wellbeing applications, including analysis and recognition of under-researched phenomena in speech, e. g., anxiety, and markers of stress. Core contributions from this thesis include the collection of several related datasets, hybrid fusion strategies for an emotional gold standard, novel machine learning strategies for data interpretation, and an in-depth acoustic-based computational evaluation of several human states. All of these contributions focus on ascertaining the advantage of audio in the context of modelling emotional wellbeing. Given the sensitive nature of human wellbeing, the ethical implications involved with developing and applying such systems are discussed throughout
A survey on perceived speaker traits: personality, likability, pathology, and the first challenge
The INTERSPEECH 2012 Speaker Trait Challenge aimed at a unified test-bed for perceived speaker traits β the first challenge of this kind: personality in the five OCEAN personality dimensions, likability of speakers, and intelligibility of pathologic speakers. In the present article, we give a brief overview of the state-of-the-art in these three fields of research and describe the three sub-challenges in terms of the challenge conditions, the baseline results provided by the organisers, and a new openSMILE feature set, which has been used for computing the baselines and which has been provided to the participants. Furthermore, we summarise the approaches and the results presented by the participants to show the various techniques that are currently applied to solve these classification tasks
The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing
Work on voice sciences over recent decades has led to a proliferation of acoustic parameters that are used quite selectively and are not always extracted in a similar fashion. With many independent teams working in different research areas, shared standards become an essential safeguard to ensure compliance with state-of-the-art methods allowing appropriate comparison of results across studies and potential integration and combination of extraction and recognition systems. In this paper we propose a basic standard acoustic parameter set for various areas of automatic voice analysis, such as paralinguistic or clinical speech analysis. In contrast to a large brute-force parameter set, we present a minimalistic set of voice parameters here. These were selected based on a) their potential to index affective physiological changes in voice production, b) their proven value in former studies as well as their automatic extractability, and c) their theoretical significance. The set is intended to provide a common baseline for evaluation of future research and eliminate differences caused by varying parameter sets or even different implementations of the same parameters. Our implementation is publicly available with the openSMILE toolkit. Comparative evaluations of the proposed feature set and large baseline feature sets of INTERSPEECH challenges show a high performance of the proposed set in relation to its size
ΠΠ½Π°Π»ΠΈΡΠΈΡΠ΅ΡΠΊΠΈΠΉ ΠΎΠ±Π·ΠΎΡ Π°ΡΠ΄ΠΈΠΎΠ²ΠΈΠ·ΡΠ°Π»ΡΠ½ΡΡ ΡΠΈΡΡΠ΅ΠΌ Π΄Π»Ρ ΠΎΠΏΡΠ΅Π΄Π΅Π»Π΅Π½ΠΈΡ ΡΡΠ΅Π΄ΡΡΠ² ΠΈΠ½Π΄ΠΈΠ²ΠΈΠ΄ΡΠ°Π»ΡΠ½ΠΎΠΉ Π·Π°ΡΠΈΡΡ Π½Π° Π»ΠΈΡΠ΅ ΡΠ΅Π»ΠΎΠ²Π΅ΠΊΠ°
ΠΠ°ΡΠΈΠ½Π°Ρ Ρ 2019 Π³ΠΎΠ΄Π° Π²ΡΠ΅ ΡΡΡΠ°Π½Ρ ΠΌΠΈΡΠ° ΡΡΠΎΠ»ΠΊΠ½ΡΠ»ΠΈΡΡ ΡΠΎ ΡΡΡΠ΅ΠΌΠΈΡΠ΅Π»ΡΠ½ΡΠΌ ΡΠ°ΡΠΏΡΠΎΡΡΡΠ°Π½Π΅Π½ΠΈΠ΅ΠΌ ΠΏΠ°Π½Π΄Π΅ΠΌΠΈΠΈ, Π²ΡΠ·Π²Π°Π½Π½ΠΎΠΉ ΠΊΠΎΡΠΎΠ½Π°Π²ΠΈΡΡΡΠ½ΠΎΠΉ ΠΈΠ½ΡΠ΅ΠΊΡΠΈΠ΅ΠΉ COVID-19, Π±ΠΎΡΡΠ±Π° Ρ ΠΊΠΎΡΠΎΡΠΎΠΉ ΠΏΡΠΎΠ΄ΠΎΠ»ΠΆΠ°Π΅ΡΡΡ ΠΌΠΈΡΠΎΠ²ΡΠΌ ΡΠΎΠΎΠ±ΡΠ΅ΡΡΠ²ΠΎΠΌ ΠΈ ΠΏΠΎ Π½Π°ΡΡΠΎΡΡΠ΅Π΅ Π²ΡΠ΅ΠΌΡ. ΠΠ΅ΡΠΌΠΎΡΡΡ Π½Π° ΠΎΡΠ΅Π²ΠΈΠ΄Π½ΡΡ ΡΡΡΠ΅ΠΊΡΠΈΠ²Π½ΠΎΡΡΡ ΡΡΠ΅Π΄ΡΡΠ² ΠΈΠ½Π΄ΠΈΠ²ΠΈΠ΄ΡΠ°Π»ΡΠ½ΠΎΠΉ Π·Π°ΡΠΈΡΡ ΠΎΡΠ³Π°Π½ΠΎΠ² Π΄ΡΡ
Π°Π½ΠΈΡ ΠΎΡ Π·Π°ΡΠ°ΠΆΠ΅Π½ΠΈΡ ΠΊΠΎΡΠΎΠ½Π°Π²ΠΈΡΡΡΠ½ΠΎΠΉ ΠΈΠ½ΡΠ΅ΠΊΡΠΈΠ΅ΠΉ, ΠΌΠ½ΠΎΠ³ΠΈΠ΅ Π»ΡΠ΄ΠΈ ΠΏΡΠ΅Π½Π΅Π±ΡΠ΅Π³Π°ΡΡ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠ΅ΠΌ Π·Π°ΡΠΈΡΠ½ΡΡ
ΠΌΠ°ΡΠΎΠΊ Π΄Π»Ρ Π»ΠΈΡΠ° Π² ΠΎΠ±ΡΠ΅ΡΡΠ²Π΅Π½Π½ΡΡ
ΠΌΠ΅ΡΡΠ°Ρ
. ΠΠΎΡΡΠΎΠΌΡ Π΄Π»Ρ ΠΊΠΎΠ½ΡΡΠΎΠ»Ρ ΠΈ ΡΠ²ΠΎΠ΅Π²ΡΠ΅ΠΌΠ΅Π½Π½ΠΎΠ³ΠΎ Π²ΡΡΠ²Π»Π΅Π½ΠΈΡ Π½Π°ΡΡΡΠΈΡΠ΅Π»Π΅ΠΉ ΠΎΠ±ΡΠ΅ΡΡΠ²Π΅Π½Π½ΡΡ
ΠΏΡΠ°Π²ΠΈΠ» Π·Π΄ΡΠ°Π²ΠΎΠΎΡ
ΡΠ°Π½Π΅Π½ΠΈΡ Π½Π΅ΠΎΠ±Ρ
ΠΎΠ΄ΠΈΠΌΠΎ ΠΏΡΠΈΠΌΠ΅Π½ΡΡΡ ΡΠΎΠ²ΡΠ΅ΠΌΠ΅Π½Π½ΡΠ΅ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΎΠ½Π½ΡΠ΅ ΡΠ΅Ρ
Π½ΠΎΠ»ΠΎΠ³ΠΈΠΈ, ΠΊΠΎΡΠΎΡΡΠ΅ Π±ΡΠ΄ΡΡ Π΄Π΅ΡΠ΅ΠΊΡΠΈΡΠΎΠ²Π°ΡΡ Π·Π°ΡΠΈΡΠ½ΡΠ΅ ΠΌΠ°ΡΠΊΠΈ Π½Π° Π»ΠΈΡΠ°Ρ
Π»ΡΠ΄Π΅ΠΉ ΠΏΠΎ Π²ΠΈΠ΄Π΅ΠΎ- ΠΈ Π°ΡΠ΄ΠΈΠΎΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΈ. Π ΡΡΠ°ΡΡΠ΅ ΠΏΡΠΈΠ²Π΅Π΄Π΅Π½ Π°Π½Π°Π»ΠΈΡΠΈΡΠ΅ΡΠΊΠΈΠΉ ΠΎΠ±Π·ΠΎΡ ΡΡΡΠ΅ΡΡΠ²ΡΡΡΠΈΡ
ΠΈ ΡΠ°Π·ΡΠ°Π±Π°ΡΡΠ²Π°Π΅ΠΌΡΡ
ΠΈΠ½ΡΠ΅Π»Π»Π΅ΠΊΡΡΠ°Π»ΡΠ½ΡΡ
ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΎΠ½Π½ΡΡ
ΡΠ΅Ρ
Π½ΠΎΠ»ΠΎΠ³ΠΈΠΉ Π±ΠΈΠΌΠΎΠ΄Π°Π»ΡΠ½ΠΎΠ³ΠΎ Π°Π½Π°Π»ΠΈΠ·Π° Π³ΠΎΠ»ΠΎΡΠΎΠ²ΡΡ
ΠΈ Π»ΠΈΡΠ΅Π²ΡΡ
Ρ
Π°ΡΠ°ΠΊΡΠ΅ΡΠΈΡΡΠΈΠΊ ΡΠ΅Π»ΠΎΠ²Π΅ΠΊΠ° Π² ΠΌΠ°ΡΠΊΠ΅. Π‘ΡΡΠ΅ΡΡΠ²ΡΠ΅Ρ ΠΌΠ½ΠΎΠ³ΠΎ ΠΈΡΡΠ»Π΅Π΄ΠΎΠ²Π°Π½ΠΈΠΉ Π½Π° ΡΠ΅ΠΌΡ ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ ΠΌΠ°ΡΠΎΠΊ ΠΏΠΎ Π²ΠΈΠ΄Π΅ΠΎΠΈΠ·ΠΎΠ±ΡΠ°ΠΆΠ΅Π½ΠΈΡΠΌ, ΡΠ°ΠΊΠΆΠ΅ Π² ΠΎΡΠΊΡΡΡΠΎΠΌ Π΄ΠΎΡΡΡΠΏΠ΅ ΠΌΠΎΠΆΠ½ΠΎ Π½Π°ΠΉΡΠΈ Π·Π½Π°ΡΠΈΡΠ΅Π»ΡΠ½ΠΎΠ΅ ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎ ΠΊΠΎΡΠΏΡΡΠΎΠ², ΡΠΎΠ΄Π΅ΡΠΆΠ°ΡΠΈΡ
ΠΈΠ·ΠΎΠ±ΡΠ°ΠΆΠ΅Π½ΠΈΡ Π»ΠΈΡ ΠΊΠ°ΠΊ Π±Π΅Π· ΠΌΠ°ΡΠΎΠΊ, ΡΠ°ΠΊ ΠΈ Π² ΠΌΠ°ΡΠΊΠ°Ρ
, ΠΏΠΎΠ»ΡΡΠ΅Π½Π½ΡΡ
ΡΠ°Π·Π»ΠΈΡΠ½ΡΠΌΠΈ ΡΠΏΠΎΡΠΎΠ±Π°ΠΌΠΈ. ΠΡΡΠ»Π΅Π΄ΠΎΠ²Π°Π½ΠΈΠΉ ΠΈ ΡΠ°Π·ΡΠ°Π±ΠΎΡΠΎΠΊ, Π½Π°ΠΏΡΠ°Π²Π»Π΅Π½Π½ΡΡ
Π½Π° Π΄Π΅ΡΠ΅ΠΊΡΠΈΡΠΎΠ²Π°Π½ΠΈΠ΅ ΡΡΠ΅Π΄ΡΡΠ² ΠΈΠ½Π΄ΠΈΠ²ΠΈΠ΄ΡΠ°Π»ΡΠ½ΠΎΠΉ Π·Π°ΡΠΈΡΡ ΠΎΡΠ³Π°Π½ΠΎΠ² Π΄ΡΡ
Π°Π½ΠΈΡ ΠΏΠΎ Π°ΠΊΡΡΡΠΈΡΠ΅ΡΠΊΠΈΠΌ Ρ
Π°ΡΠ°ΠΊΡΠ΅ΡΠΈΡΡΠΈΠΊΠ°ΠΌ ΡΠ΅ΡΠΈ ΡΠ΅Π»ΠΎΠ²Π΅ΠΊΠ° ΠΏΠΎΠΊΠ° Π΄ΠΎΡΡΠ°ΡΠΎΡΠ½ΠΎ ΠΌΠ°Π»ΠΎ, ΡΠ°ΠΊ ΠΊΠ°ΠΊ ΡΡΠΎ Π½Π°ΠΏΡΠ°Π²Π»Π΅Π½ΠΈΠ΅ Π½Π°ΡΠ°Π»ΠΎ ΡΠ°Π·Π²ΠΈΠ²Π°ΡΡΡΡ ΡΠΎΠ»ΡΠΊΠΎ Π² ΠΏΠ΅ΡΠΈΠΎΠ΄ ΠΏΠ°Π½Π΄Π΅ΠΌΠΈΠΈ, Π²ΡΠ·Π²Π°Π½Π½ΠΎΠΉ ΠΊΠΎΡΠΎΠ½Π°Π²ΠΈΡΡΡΠ½ΠΎΠΉ ΠΈΠ½ΡΠ΅ΠΊΡΠΈΠ΅ΠΉ COVID-19. Π‘ΡΡΠ΅ΡΡΠ²ΡΡΡΠΈΠ΅ ΡΠΈΡΡΠ΅ΠΌΡ ΠΏΠΎΠ·Π²ΠΎΠ»ΡΡΡ ΠΏΡΠ΅Π΄ΠΎΡΠ²ΡΠ°ΡΠΈΡΡ ΡΠ°ΡΠΏΡΠΎΡΡΡΠ°Π½Π΅Π½ΠΈΠ΅ ΠΊΠΎΡΠΎΠ½Π°Π²ΠΈΡΡΡΠ½ΠΎΠΉ ΠΈΠ½ΡΠ΅ΠΊΡΠΈΠΈ Ρ ΠΏΠΎΠΌΠΎΡΡΡ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ Π½Π°Π»ΠΈΡΠΈΡ/ΠΎΡΡΡΡΡΡΠ²ΠΈΡ ΠΌΠ°ΡΠΎΠΊ Π½Π° Π»ΠΈΡΠ΅, ΡΠ°ΠΊΠΆΠ΅ Π΄Π°Π½Π½ΡΠ΅ ΡΠΈΡΡΠ΅ΠΌΡ ΠΏΠΎΠΌΠΎΠ³Π°ΡΡ Π² Π΄ΠΈΡΡΠ°Π½ΡΠΈΠΎΠ½Π½ΠΎΠΌ Π΄ΠΈΠ°Π³Π½ΠΎΡΡΠΈΡΠΎΠ²Π°Π½ΠΈΠΈ COVID-19 Ρ ΠΏΠΎΠΌΠΎΡΡΡ ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ ΠΏΠ΅ΡΠ²ΡΡ
ΡΠΈΠΌΠΏΡΠΎΠΌΠΎΠ² Π²ΠΈΡΡΡΠ½ΠΎΠΉ ΠΈΠ½ΡΠ΅ΠΊΡΠΈΠΈ ΠΏΠΎ Π°ΠΊΡΡΡΠΈΡΠ΅ΡΠΊΠΈΠΌ Ρ
Π°ΡΠ°ΠΊΡΠ΅ΡΠΈΡΡΠΈΠΊΠ°ΠΌ. ΠΠ΄Π½Π°ΠΊΠΎ, Π½Π° ΡΠ΅Π³ΠΎΠ΄Π½ΡΡΠ½ΠΈΠΉ Π΄Π΅Π½Ρ ΡΡΡΠ΅ΡΡΠ²ΡΠ΅Ρ ΡΡΠ΄ Π½Π΅ΡΠ΅ΡΠ΅Π½Π½ΡΡ
ΠΏΡΠΎΠ±Π»Π΅ΠΌ Π² ΠΎΠ±Π»Π°ΡΡΠΈ Π°Π²ΡΠΎΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠ³ΠΎ Π΄ΠΈΠ°Π³Π½ΠΎΡΡΠΈΡΠΎΠ²Π°Π½ΠΈΡ ΡΠΈΠΌΠΏΡΠΎΠΌΠΎΠ² COVID-19 ΠΈ Π½Π°Π»ΠΈΡΠΈΡ/ΠΎΡΡΡΡΡΡΠ²ΠΈΡ ΠΌΠ°ΡΠΎΠΊ Π½Π° Π»ΠΈΡΠ°Ρ
Π»ΡΠ΄Π΅ΠΉ. Π ΠΏΠ΅ΡΠ²ΡΡ ΠΎΡΠ΅ΡΠ΅Π΄Ρ ΡΡΠΎ Π½ΠΈΠ·ΠΊΠ°Ρ ΡΠΎΡΠ½ΠΎΡΡΡ ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ ΠΌΠ°ΡΠΎΠΊ ΠΈ ΠΊΠΎΡΠΎΠ½Π°Π²ΠΈΡΡΡΠ½ΠΎΠΉ ΠΈΠ½ΡΠ΅ΠΊΡΠΈΠΈ, ΡΡΠΎ Π½Π΅ ΠΏΠΎΠ·Π²ΠΎΠ»ΡΠ΅Ρ ΠΎΡΡΡΠ΅ΡΡΠ²Π»ΡΡΡ Π°Π²ΡΠΎΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΡΡ Π΄ΠΈΠ°Π³Π½ΠΎΡΡΠΈΠΊΡ Π±Π΅Π· ΠΏΡΠΈΡΡΡΡΡΠ²ΠΈΡ ΡΠΊΡΠΏΠ΅ΡΡΠΎΠ² (ΠΌΠ΅Π΄ΠΈΡΠΈΠ½ΡΠΊΠΎΠ³ΠΎ ΠΏΠ΅ΡΡΠΎΠ½Π°Π»Π°). ΠΠ½ΠΎΠ³ΠΈΠ΅ ΡΠΈΡΡΠ΅ΠΌΡ Π½Π΅ ΡΠΏΠΎΡΠΎΠ±Π½Ρ ΡΠ°Π±ΠΎΡΠ°ΡΡ Π² ΡΠ΅ΠΆΠΈΠΌΠ΅ ΡΠ΅Π°Π»ΡΠ½ΠΎΠ³ΠΎ Π²ΡΠ΅ΠΌΠ΅Π½ΠΈ, ΠΈΠ·-Π·Π° ΡΠ΅Π³ΠΎ Π½Π΅Π²ΠΎΠ·ΠΌΠΎΠΆΠ½ΠΎ ΠΏΡΠΎΠΈΠ·Π²ΠΎΠ΄ΠΈΡΡ ΠΊΠΎΠ½ΡΡΠΎΠ»Ρ ΠΈ ΠΌΠΎΠ½ΠΈΡΠΎΡΠΈΠ½Π³ Π½ΠΎΡΠ΅Π½ΠΈΡ Π·Π°ΡΠΈΡΠ½ΡΡ
ΠΌΠ°ΡΠΎΠΊ Π² ΠΎΠ±ΡΠ΅ΡΡΠ²Π΅Π½Π½ΡΡ
ΠΌΠ΅ΡΡΠ°Ρ
. Π’Π°ΠΊΠΆΠ΅ Π±ΠΎΠ»ΡΡΠΈΠ½ΡΡΠ²ΠΎ ΡΡΡΠ΅ΡΡΠ²ΡΡΡΠΈΡ
ΡΠΈΡΡΠ΅ΠΌ Π½Π΅Π²ΠΎΠ·ΠΌΠΎΠΆΠ½ΠΎ Π²ΡΡΡΠΎΠΈΡΡ Π² ΡΠΌΠ°ΡΡΡΠΎΠ½, ΡΡΠΎΠ±Ρ ΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΠ΅Π»ΠΈ ΠΌΠΎΠ³Π»ΠΈ Π² Π»ΡΠ±ΠΎΠΌ ΠΌΠ΅ΡΡΠ΅ ΠΏΡΠΎΠΈΠ·Π²Π΅ΡΡΠΈ Π΄ΠΈΠ°Π³Π½ΠΎΡΡΠΈΡΠΎΠ²Π°Π½ΠΈΠ΅ Π½Π°Π»ΠΈΡΠΈΡ ΠΊΠΎΡΠΎΠ½Π°Π²ΠΈΡΡΡΠ½ΠΎΠΉ ΠΈΠ½ΡΠ΅ΠΊΡΠΈΠΈ. ΠΡΠ΅ ΠΎΠ΄Π½ΠΎΠΉ ΠΎΡΠ½ΠΎΠ²Π½ΠΎΠΉ ΠΏΡΠΎΠ±Π»Π΅ΠΌΠΎΠΉ ΡΠ²Π»ΡΠ΅ΡΡΡ ΡΠ±ΠΎΡ Π΄Π°Π½Π½ΡΡ
ΠΏΠ°ΡΠΈΠ΅Π½ΡΠΎΠ², Π·Π°ΡΠ°ΠΆΠ΅Π½Π½ΡΡ
COVID-19, ΡΠ°ΠΊ ΠΊΠ°ΠΊ ΠΌΠ½ΠΎΠ³ΠΈΠ΅ Π»ΡΠ΄ΠΈ Π½Π΅ ΡΠΎΠ³Π»Π°ΡΠ½Ρ ΡΠ°ΡΠΏΡΠΎΡΡΡΠ°Π½ΡΡΡ ΠΊΠΎΠ½ΡΠΈΠ΄Π΅Π½ΡΠΈΠ°Π»ΡΠ½ΡΡ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΡ
ΠΠ½Π°Π»ΠΈΡΠΈΡΠ΅ΡΠΊΠΈΠΉ ΠΎΠ±Π·ΠΎΡ Π°ΡΠ΄ΠΈΠΎΠ²ΠΈΠ·ΡΠ°Π»ΡΠ½ΡΡ ΡΠΈΡΡΠ΅ΠΌ Π΄Π»Ρ ΠΎΠΏΡΠ΅Π΄Π΅Π»Π΅Π½ΠΈΡ ΡΡΠ΅Π΄ΡΡΠ² ΠΈΠ½Π΄ΠΈΠ²ΠΈΠ΄ΡΠ°Π»ΡΠ½ΠΎΠΉ Π·Π°ΡΠΈΡΡ Π½Π° Π»ΠΈΡΠ΅ ΡΠ΅Π»ΠΎΠ²Π΅ΠΊΠ°
Since 2019 all countries of the world have faced the rapid spread of the pandemic caused by the COVID-19 coronavirus infection, the fight against which continues to the present day by the world community. Despite the obvious effectiveness of personal respiratory protection equipment against coronavirus infection, many people neglect the use of protective face masks in public places. Therefore, to control and timely identify violators of public health regulations, it is necessary to apply modern information technologies that will detect protective masks on people's faces using video and audio information. The article presents an analytical review of existing and developing intelligent information technologies for bimodal analysis of the voice and facial characteristics of a masked person. There are many studies on the topic of detecting masks from video images, and a significant number of cases containing images of faces both in and without masks obtained by various methods can also be found in the public access. Research and development aimed at detecting personal respiratory protection equipment by the acoustic characteristics of human speech is still quite small, since this direction began to develop only during the pandemic caused by the COVID-19 coronavirus infection. Existing systems allow to prevent the spread of coronavirus infection by recognizing the presence/absence of masks on the face, and these systems also help in remote diagnosis of COVID-19 by detecting the first symptoms of a viral infection by acoustic characteristics. However, to date, there is a number of unresolved problems in the field of automatic diagnosis of COVID-19 and the presence/absence of masks on people's faces. First of all, this is the low accuracy of detecting masks and coronavirus infection, which does not allow for performing automatic diagnosis without the presence of experts (medical personnel). Many systems are not able to operate in real time, which makes it impossible to control and monitor the wearing of protective masks in public places. Also, most of the existing systems cannot be built into a smartphone, so that users be able to diagnose the presence of coronavirus infection anywhere. Another major problem is the collection of data from patients infected with COVID-19, as many people do not agree to distribute confidential information.ΠΠ°ΡΠΈΠ½Π°Ρ Ρ 2019 Π³ΠΎΠ΄Π° Π²ΡΠ΅ ΡΡΡΠ°Π½Ρ ΠΌΠΈΡΠ° ΡΡΠΎΠ»ΠΊΠ½ΡΠ»ΠΈΡΡ ΡΠΎ ΡΡΡΠ΅ΠΌΠΈΡΠ΅Π»ΡΠ½ΡΠΌ ΡΠ°ΡΠΏΡΠΎΡΡΡΠ°Π½Π΅Π½ΠΈΠ΅ΠΌ ΠΏΠ°Π½Π΄Π΅ΠΌΠΈΠΈ, Π²ΡΠ·Π²Π°Π½Π½ΠΎΠΉ ΠΊΠΎΡΠΎΠ½Π°Π²ΠΈΡΡΡΠ½ΠΎΠΉ ΠΈΠ½ΡΠ΅ΠΊΡΠΈΠ΅ΠΉ COVID-19, Π±ΠΎΡΡΠ±Π° Ρ ΠΊΠΎΡΠΎΡΠΎΠΉ ΠΏΡΠΎΠ΄ΠΎΠ»ΠΆΠ°Π΅ΡΡΡ ΠΌΠΈΡΠΎΠ²ΡΠΌ ΡΠΎΠΎΠ±ΡΠ΅ΡΡΠ²ΠΎΠΌ ΠΈ ΠΏΠΎ Π½Π°ΡΡΠΎΡΡΠ΅Π΅ Π²ΡΠ΅ΠΌΡ. ΠΠ΅ΡΠΌΠΎΡΡΡ Π½Π° ΠΎΡΠ΅Π²ΠΈΠ΄Π½ΡΡ ΡΡΡΠ΅ΠΊΡΠΈΠ²Π½ΠΎΡΡΡ ΡΡΠ΅Π΄ΡΡΠ² ΠΈΠ½Π΄ΠΈΠ²ΠΈΠ΄ΡΠ°Π»ΡΠ½ΠΎΠΉ Π·Π°ΡΠΈΡΡ ΠΎΡΠ³Π°Π½ΠΎΠ² Π΄ΡΡ
Π°Π½ΠΈΡ ΠΎΡ Π·Π°ΡΠ°ΠΆΠ΅Π½ΠΈΡ ΠΊΠΎΡΠΎΠ½Π°Π²ΠΈΡΡΡΠ½ΠΎΠΉ ΠΈΠ½ΡΠ΅ΠΊΡΠΈΠ΅ΠΉ, ΠΌΠ½ΠΎΠ³ΠΈΠ΅ Π»ΡΠ΄ΠΈ ΠΏΡΠ΅Π½Π΅Π±ΡΠ΅Π³Π°ΡΡ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠ΅ΠΌ Π·Π°ΡΠΈΡΠ½ΡΡ
ΠΌΠ°ΡΠΎΠΊ Π΄Π»Ρ Π»ΠΈΡΠ° Π² ΠΎΠ±ΡΠ΅ΡΡΠ²Π΅Π½Π½ΡΡ
ΠΌΠ΅ΡΡΠ°Ρ
. ΠΠΎΡΡΠΎΠΌΡ Π΄Π»Ρ ΠΊΠΎΠ½ΡΡΠΎΠ»Ρ ΠΈ ΡΠ²ΠΎΠ΅Π²ΡΠ΅ΠΌΠ΅Π½Π½ΠΎΠ³ΠΎ Π²ΡΡΠ²Π»Π΅Π½ΠΈΡ Π½Π°ΡΡΡΠΈΡΠ΅Π»Π΅ΠΉ ΠΎΠ±ΡΠ΅ΡΡΠ²Π΅Π½Π½ΡΡ
ΠΏΡΠ°Π²ΠΈΠ» Π·Π΄ΡΠ°Π²ΠΎΠΎΡ
ΡΠ°Π½Π΅Π½ΠΈΡ Π½Π΅ΠΎΠ±Ρ
ΠΎΠ΄ΠΈΠΌΠΎ ΠΏΡΠΈΠΌΠ΅Π½ΡΡΡ ΡΠΎΠ²ΡΠ΅ΠΌΠ΅Π½Π½ΡΠ΅ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΎΠ½Π½ΡΠ΅ ΡΠ΅Ρ
Π½ΠΎΠ»ΠΎΠ³ΠΈΠΈ, ΠΊΠΎΡΠΎΡΡΠ΅ Π±ΡΠ΄ΡΡ Π΄Π΅ΡΠ΅ΠΊΡΠΈΡΠΎΠ²Π°ΡΡ Π·Π°ΡΠΈΡΠ½ΡΠ΅ ΠΌΠ°ΡΠΊΠΈ Π½Π° Π»ΠΈΡΠ°Ρ
Π»ΡΠ΄Π΅ΠΉ ΠΏΠΎ Π²ΠΈΠ΄Π΅ΠΎ- ΠΈ Π°ΡΠ΄ΠΈΠΎΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΈ. Π ΡΡΠ°ΡΡΠ΅ ΠΏΡΠΈΠ²Π΅Π΄Π΅Π½ Π°Π½Π°Π»ΠΈΡΠΈΡΠ΅ΡΠΊΠΈΠΉ ΠΎΠ±Π·ΠΎΡ ΡΡΡΠ΅ΡΡΠ²ΡΡΡΠΈΡ
ΠΈ ΡΠ°Π·ΡΠ°Π±Π°ΡΡΠ²Π°Π΅ΠΌΡΡ
ΠΈΠ½ΡΠ΅Π»Π»Π΅ΠΊΡΡΠ°Π»ΡΠ½ΡΡ
ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΎΠ½Π½ΡΡ
ΡΠ΅Ρ
Π½ΠΎΠ»ΠΎΠ³ΠΈΠΉ Π±ΠΈΠΌΠΎΠ΄Π°Π»ΡΠ½ΠΎΠ³ΠΎ Π°Π½Π°Π»ΠΈΠ·Π° Π³ΠΎΠ»ΠΎΡΠΎΠ²ΡΡ
ΠΈ Π»ΠΈΡΠ΅Π²ΡΡ
Ρ
Π°ΡΠ°ΠΊΡΠ΅ΡΠΈΡΡΠΈΠΊ ΡΠ΅Π»ΠΎΠ²Π΅ΠΊΠ° Π² ΠΌΠ°ΡΠΊΠ΅. Π‘ΡΡΠ΅ΡΡΠ²ΡΠ΅Ρ ΠΌΠ½ΠΎΠ³ΠΎ ΠΈΡΡΠ»Π΅Π΄ΠΎΠ²Π°Π½ΠΈΠΉ Π½Π° ΡΠ΅ΠΌΡ ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ ΠΌΠ°ΡΠΎΠΊ ΠΏΠΎ Π²ΠΈΠ΄Π΅ΠΎΠΈΠ·ΠΎΠ±ΡΠ°ΠΆΠ΅Π½ΠΈΡΠΌ, ΡΠ°ΠΊΠΆΠ΅ Π² ΠΎΡΠΊΡΡΡΠΎΠΌ Π΄ΠΎΡΡΡΠΏΠ΅ ΠΌΠΎΠΆΠ½ΠΎ Π½Π°ΠΉΡΠΈ Π·Π½Π°ΡΠΈΡΠ΅Π»ΡΠ½ΠΎΠ΅ ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎ ΠΊΠΎΡΠΏΡΡΠΎΠ², ΡΠΎΠ΄Π΅ΡΠΆΠ°ΡΠΈΡ
ΠΈΠ·ΠΎΠ±ΡΠ°ΠΆΠ΅Π½ΠΈΡ Π»ΠΈΡ ΠΊΠ°ΠΊ Π±Π΅Π· ΠΌΠ°ΡΠΎΠΊ, ΡΠ°ΠΊ ΠΈ Π² ΠΌΠ°ΡΠΊΠ°Ρ
, ΠΏΠΎΠ»ΡΡΠ΅Π½Π½ΡΡ
ΡΠ°Π·Π»ΠΈΡΠ½ΡΠΌΠΈ ΡΠΏΠΎΡΠΎΠ±Π°ΠΌΠΈ. ΠΡΡΠ»Π΅Π΄ΠΎΠ²Π°Π½ΠΈΠΉ ΠΈ ΡΠ°Π·ΡΠ°Π±ΠΎΡΠΎΠΊ, Π½Π°ΠΏΡΠ°Π²Π»Π΅Π½Π½ΡΡ
Π½Π° Π΄Π΅ΡΠ΅ΠΊΡΠΈΡΠΎΠ²Π°Π½ΠΈΠ΅ ΡΡΠ΅Π΄ΡΡΠ² ΠΈΠ½Π΄ΠΈΠ²ΠΈΠ΄ΡΠ°Π»ΡΠ½ΠΎΠΉ Π·Π°ΡΠΈΡΡ ΠΎΡΠ³Π°Π½ΠΎΠ² Π΄ΡΡ
Π°Π½ΠΈΡ ΠΏΠΎ Π°ΠΊΡΡΡΠΈΡΠ΅ΡΠΊΠΈΠΌ Ρ
Π°ΡΠ°ΠΊΡΠ΅ΡΠΈΡΡΠΈΠΊΠ°ΠΌ ΡΠ΅ΡΠΈ ΡΠ΅Π»ΠΎΠ²Π΅ΠΊΠ° ΠΏΠΎΠΊΠ° Π΄ΠΎΡΡΠ°ΡΠΎΡΠ½ΠΎ ΠΌΠ°Π»ΠΎ, ΡΠ°ΠΊ ΠΊΠ°ΠΊ ΡΡΠΎ Π½Π°ΠΏΡΠ°Π²Π»Π΅Π½ΠΈΠ΅ Π½Π°ΡΠ°Π»ΠΎ ΡΠ°Π·Π²ΠΈΠ²Π°ΡΡΡΡ ΡΠΎΠ»ΡΠΊΠΎ Π² ΠΏΠ΅ΡΠΈΠΎΠ΄ ΠΏΠ°Π½Π΄Π΅ΠΌΠΈΠΈ, Π²ΡΠ·Π²Π°Π½Π½ΠΎΠΉ ΠΊΠΎΡΠΎΠ½Π°Π²ΠΈΡΡΡΠ½ΠΎΠΉ ΠΈΠ½ΡΠ΅ΠΊΡΠΈΠ΅ΠΉ COVID-19. Π‘ΡΡΠ΅ΡΡΠ²ΡΡΡΠΈΠ΅ ΡΠΈΡΡΠ΅ΠΌΡ ΠΏΠΎΠ·Π²ΠΎΠ»ΡΡΡ ΠΏΡΠ΅Π΄ΠΎΡΠ²ΡΠ°ΡΠΈΡΡ ΡΠ°ΡΠΏΡΠΎΡΡΡΠ°Π½Π΅Π½ΠΈΠ΅ ΠΊΠΎΡΠΎΠ½Π°Π²ΠΈΡΡΡΠ½ΠΎΠΉ ΠΈΠ½ΡΠ΅ΠΊΡΠΈΠΈ Ρ ΠΏΠΎΠΌΠΎΡΡΡ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ Π½Π°Π»ΠΈΡΠΈΡ/ΠΎΡΡΡΡΡΡΠ²ΠΈΡ ΠΌΠ°ΡΠΎΠΊ Π½Π° Π»ΠΈΡΠ΅, ΡΠ°ΠΊΠΆΠ΅ Π΄Π°Π½Π½ΡΠ΅ ΡΠΈΡΡΠ΅ΠΌΡ ΠΏΠΎΠΌΠΎΠ³Π°ΡΡ Π² Π΄ΠΈΡΡΠ°Π½ΡΠΈΠΎΠ½Π½ΠΎΠΌ Π΄ΠΈΠ°Π³Π½ΠΎΡΡΠΈΡΠΎΠ²Π°Π½ΠΈΠΈ COVID-19 Ρ ΠΏΠΎΠΌΠΎΡΡΡ ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ ΠΏΠ΅ΡΠ²ΡΡ
ΡΠΈΠΌΠΏΡΠΎΠΌΠΎΠ² Π²ΠΈΡΡΡΠ½ΠΎΠΉ ΠΈΠ½ΡΠ΅ΠΊΡΠΈΠΈ ΠΏΠΎ Π°ΠΊΡΡΡΠΈΡΠ΅ΡΠΊΠΈΠΌ Ρ
Π°ΡΠ°ΠΊΡΠ΅ΡΠΈΡΡΠΈΠΊΠ°ΠΌ. ΠΠ΄Π½Π°ΠΊΠΎ, Π½Π° ΡΠ΅Π³ΠΎΠ΄Π½ΡΡΠ½ΠΈΠΉ Π΄Π΅Π½Ρ ΡΡΡΠ΅ΡΡΠ²ΡΠ΅Ρ ΡΡΠ΄ Π½Π΅ΡΠ΅ΡΠ΅Π½Π½ΡΡ
ΠΏΡΠΎΠ±Π»Π΅ΠΌ Π² ΠΎΠ±Π»Π°ΡΡΠΈ Π°Π²ΡΠΎΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠ³ΠΎ Π΄ΠΈΠ°Π³Π½ΠΎΡΡΠΈΡΠΎΠ²Π°Π½ΠΈΡ ΡΠΈΠΌΠΏΡΠΎΠΌΠΎΠ² COVID-19 ΠΈ Π½Π°Π»ΠΈΡΠΈΡ/ΠΎΡΡΡΡΡΡΠ²ΠΈΡ ΠΌΠ°ΡΠΎΠΊ Π½Π° Π»ΠΈΡΠ°Ρ
Π»ΡΠ΄Π΅ΠΉ. Π ΠΏΠ΅ΡΠ²ΡΡ ΠΎΡΠ΅ΡΠ΅Π΄Ρ ΡΡΠΎ Π½ΠΈΠ·ΠΊΠ°Ρ ΡΠΎΡΠ½ΠΎΡΡΡ ΠΎΠ±Π½Π°ΡΡΠΆΠ΅Π½ΠΈΡ ΠΌΠ°ΡΠΎΠΊ ΠΈ ΠΊΠΎΡΠΎΠ½Π°Π²ΠΈΡΡΡΠ½ΠΎΠΉ ΠΈΠ½ΡΠ΅ΠΊΡΠΈΠΈ, ΡΡΠΎ Π½Π΅ ΠΏΠΎΠ·Π²ΠΎΠ»ΡΠ΅Ρ ΠΎΡΡΡΠ΅ΡΡΠ²Π»ΡΡΡ Π°Π²ΡΠΎΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΡΡ Π΄ΠΈΠ°Π³Π½ΠΎΡΡΠΈΠΊΡ Π±Π΅Π· ΠΏΡΠΈΡΡΡΡΡΠ²ΠΈΡ ΡΠΊΡΠΏΠ΅ΡΡΠΎΠ² (ΠΌΠ΅Π΄ΠΈΡΠΈΠ½ΡΠΊΠΎΠ³ΠΎ ΠΏΠ΅ΡΡΠΎΠ½Π°Π»Π°). ΠΠ½ΠΎΠ³ΠΈΠ΅ ΡΠΈΡΡΠ΅ΠΌΡ Π½Π΅ ΡΠΏΠΎΡΠΎΠ±Π½Ρ ΡΠ°Π±ΠΎΡΠ°ΡΡ Π² ΡΠ΅ΠΆΠΈΠΌΠ΅ ΡΠ΅Π°Π»ΡΠ½ΠΎΠ³ΠΎ Π²ΡΠ΅ΠΌΠ΅Π½ΠΈ, ΠΈΠ·-Π·Π° ΡΠ΅Π³ΠΎ Π½Π΅Π²ΠΎΠ·ΠΌΠΎΠΆΠ½ΠΎ ΠΏΡΠΎΠΈΠ·Π²ΠΎΠ΄ΠΈΡΡ ΠΊΠΎΠ½ΡΡΠΎΠ»Ρ ΠΈ ΠΌΠΎΠ½ΠΈΡΠΎΡΠΈΠ½Π³ Π½ΠΎΡΠ΅Π½ΠΈΡ Π·Π°ΡΠΈΡΠ½ΡΡ
ΠΌΠ°ΡΠΎΠΊ Π² ΠΎΠ±ΡΠ΅ΡΡΠ²Π΅Π½Π½ΡΡ
ΠΌΠ΅ΡΡΠ°Ρ
. Π’Π°ΠΊΠΆΠ΅ Π±ΠΎΠ»ΡΡΠΈΠ½ΡΡΠ²ΠΎ ΡΡΡΠ΅ΡΡΠ²ΡΡΡΠΈΡ
ΡΠΈΡΡΠ΅ΠΌ Π½Π΅Π²ΠΎΠ·ΠΌΠΎΠΆΠ½ΠΎ Π²ΡΡΡΠΎΠΈΡΡ Π² ΡΠΌΠ°ΡΡΡΠΎΠ½, ΡΡΠΎΠ±Ρ ΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΠ΅Π»ΠΈ ΠΌΠΎΠ³Π»ΠΈ Π² Π»ΡΠ±ΠΎΠΌ ΠΌΠ΅ΡΡΠ΅ ΠΏΡΠΎΠΈΠ·Π²Π΅ΡΡΠΈ Π΄ΠΈΠ°Π³Π½ΠΎΡΡΠΈΡΠΎΠ²Π°Π½ΠΈΠ΅ Π½Π°Π»ΠΈΡΠΈΡ ΠΊΠΎΡΠΎΠ½Π°Π²ΠΈΡΡΡΠ½ΠΎΠΉ ΠΈΠ½ΡΠ΅ΠΊΡΠΈΠΈ. ΠΡΠ΅ ΠΎΠ΄Π½ΠΎΠΉ ΠΎΡΠ½ΠΎΠ²Π½ΠΎΠΉ ΠΏΡΠΎΠ±Π»Π΅ΠΌΠΎΠΉ ΡΠ²Π»ΡΠ΅ΡΡΡ ΡΠ±ΠΎΡ Π΄Π°Π½Π½ΡΡ
ΠΏΠ°ΡΠΈΠ΅Π½ΡΠΎΠ², Π·Π°ΡΠ°ΠΆΠ΅Π½Π½ΡΡ
COVID-19, ΡΠ°ΠΊ ΠΊΠ°ΠΊ ΠΌΠ½ΠΎΠ³ΠΈΠ΅ Π»ΡΠ΄ΠΈ Π½Π΅ ΡΠΎΠ³Π»Π°ΡΠ½Ρ ΡΠ°ΡΠΏΡΠΎΡΡΡΠ°Π½ΡΡΡ ΠΊΠΎΠ½ΡΠΈΠ΄Π΅Π½ΡΠΈΠ°Π»ΡΠ½ΡΡ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΡ
- β¦