23 research outputs found

    Разработка алгоритма распознавания эмоций человека с использованием сверточной нейронной сети на основе аудиоданных

    Get PDF
    Objectives. This article provides a description and experience of creating the algorithm for recognizing the emotional state of the subject.Methods. Image processing methods are used.Results. The proposed algorithm makes it possible to recognize the emotional states of the subject on the basis of an audio data set. It was possible to improve the accuracy of the algorithm by changing the data set supplied to the input of the neural network.The stages of training convolutional neural network on a pre-prepared set of audio data are described, and the structure of the algorithm is described. To validate the neural network different set of audio data, not participating in the training, was selected. As a result of the study, graphs were constructed demonstrating the accuracy of the proposed method.After receiving the initial data of the study, the analysis of the possibilities for improving the algorithm in terms of ergonomics and accuracy of operation was also carried out. The strategy was developed to achieve a better result and obtain a more accurate algorithm. Based on the conclusions presented in the article, the rationale for choosing the representation of the data set and the software package necessary for the implementation of the software part of the algorithm is given.Conclusion. The proposed algorithm has a high accuracy of operation and does not require large computational costs.Цели. Приведено описание и рассмотрен опыт создания алгоритма распознавания эмоционального состояния субъекта.Методы. Использованы методы обработки изображений.Результаты. Предложенный алгоритм позволяет распознавать эмоциональные состояния субъекта на основании звукового набора данных. Благодаря проведенному исследованию удалось улучшить точность работы алгоритма путем изменения подаваемого на вход нейронной сети набора данных.Описаны этапы обучения сверточной нейронной сети на заранее заготовленном наборе звуковых данных, а также структура алгоритма. Для валидации нейронной сети был отобран иной, не участвующийв тренировке, набор аудиоданных. В результате проведения исследования построены графики, демонстрирующие точность работы предлагаемого метода.После получения первоначальных данных сделан анализ возможностей улучшения алгоритма с точки зрения эргономики и точности его работы. Разработана стратегия, позволяющая добиться лучшего результата и получить более точный алгоритм. На основании заключений, изложенных в статье, приводится обоснование выбора представления набора данных и программного комплекса, необходимого для реализации программной части алгоритма.Заключение. Предложенный алгоритм обладает высокой точностью и не требует больших вычислительных затрат

    iMetricGAN: Intelligibility Enhancement for Speech-in-Noise using Generative Adversarial Network-based Metric Learning

    Full text link
    The intelligibility of natural speech is seriously degraded when exposed to adverse noisy environments. In this work, we propose a deep learning-based speech modification method to compensate for the intelligibility loss, with the constraint that the root mean square (RMS) level and duration of the speech signal are maintained before and after modifications. Specifically, we utilize an iMetricGAN approach to optimize the speech intelligibility metrics with generative adversarial networks (GANs). Experimental results show that the proposed iMetricGAN outperforms conventional state-of-the-art algorithms in terms of objective measures, i.e., speech intelligibility in bits (SIIB) and extended short-time objective intelligibility (ESTOI), under a Cafeteria noise condition. In addition, formal listening tests reveal significant intelligibility gains when both noise and reverberation exist.Comment: 5 pages, Submitted to INTERSPEECH 202

    CRNN-based joint azimuth and elevation localization with the Ambisonics intensity vector

    Get PDF
    International audienceWe present a source localization system for first-order Ambisonics (FOA) contents based on a stacked convolutional and recurrent neural network (CRNN). We propose to use as input to the CRNN the FOA acoustic intensity vector, which is easy to compute and closely linked to the sound direction of arrival (DoA). The system estimates the DoA of a point source in both azimuth and elevation. We conduct an experimental evaluation in configurations including reverberation, noise, and various speaker w.r.t. microphone orientations. The results show that the proposed architecture and input allow the network to return accurate location estimates in realistic conditions compared to another recent CRNN-based system

    Attention-Inspired Artificial Neural Networks for Speech Processing: A Systematic Review

    Get PDF
    Artificial Neural Networks (ANNs) were created inspired by the neural networks in the human brain and have been widely applied in speech processing. The application areas of ANN include: Speech recognition, speech emotion recognition, language identification, speech enhancement, and speech separation, amongst others. Likewise, given that speech processing performed by humans involves complex cognitive processes known as auditory attention, there has been a growing amount of papers proposing ANNs supported by deep learning algorithms in conjunction with some mechanism to achieve symmetry with the human attention process. However, while these ANN approaches include attention, there is no categorization of attention integrated into the deep learning algorithms and their relation with human auditory attention. Therefore, we consider it necessary to have a review of the different ANN approaches inspired in attention to show both academic and industry experts the available models for a wide variety of applications. Based on the PRISMA methodology, we present a systematic review of the literature published since 2000, in which deep learning algorithms are applied to diverse problems related to speech processing. In this paper 133 research works are selected and the following aspects are described: (i) Most relevant features, (ii) ways in which attention has been implemented, (iii) their hypothetical relationship with human attention, and (iv) the evaluation metrics used. Additionally, the four publications most related with human attention were analyzed and their strengths and weaknesses were determined

    Automatic Identification of Emotional Information in Spanish TV Debates and Human-Machine Interactions

    Get PDF
    Automatic emotion detection is a very attractive field of research that can help build more natural human–machine interaction systems. However, several issues arise when real scenarios are considered, such as the tendency toward neutrality, which makes it difficult to obtain balanced datasets, or the lack of standards for the annotation of emotional categories. Moreover, the intrinsic subjectivity of emotional information increases the difficulty of obtaining valuable data to train machine learning-based algorithms. In this work, two different real scenarios were tackled: human–human interactions in TV debates and human–machine interactions with a virtual agent. For comparison purposes, an analysis of the emotional information was conducted in both. Thus, a profiling of the speakers associated with each task was carried out. Furthermore, different classification experiments show that deep learning approaches can be useful for detecting speakers’ emotional information, mainly for arousal, valence, and dominance levels, reaching a 0.7F1-score.The research presented in this paper was conducted as part of the AMIC and EMPATHIC projects, which received funding from the Spanish Minister of Science under grants TIN2017-85854-C4-3-R and PDC2021-120846-C43 and from the European Union’s Horizon 2020 research and innovation program under grant agreement No. 769872. The first author also received a PhD scholarship from the University of the Basque Country UPV/EHU, PIF17/310

    Robust acoustic domain identification with its application to speaker diarization

    Get PDF
    International audienceWith the rise in multimedia content over the years, more variety is observed in the recording environments of audio. An audio processing system might benefit when it has a module to identify the acoustic domain at its front-end. In this paper, we demonstrate the idea of acoustic domain identification (ADI) for speaker diarization. For this, we first present a detailed study of the various domains of the third DIHARD challenge highlighting the factors that differentiated them from each other. Our main contribution is to develop a simple and efficient solution for ADI. In the present work, we explore speaker embeddings for this task. Next, we integrate the ADI module with the speaker diarization framework of the DIHARD III challenge. The performance substantially improved over that of the baseline when the thresholds for agglomerative hierarchical clustering were optimized according to the respective domains. We achieved a relative improvement of more than 5% and 8% in DER for core and full conditions, respectively, on Track 1 of the DIHARD III evaluation set

    Spatial Acoustic Vector Based Sound Field Reproduction

    Get PDF
    Spatial sound field reproduction aims to recreate an immersive sound field over a spatial region. The existing sound pressure based approaches to spatial sound field reproduction focus on the accurate approximation of original sound pressure over space, which ignores the perceptual accuracy of the reproduced sound field. The acoustic vectors of particle velocity and sound intensity appear to be closely linked with human perception of sound localization in literature. Therefore, in this thesis, we explore the spatial distributions of the acoustic vectors, and seek to develop algorithms to perceptually reproduce the original sound field over a continuous spatial region based on the vectors. A theory of spatial acoustic vectors is first developed, where the spatial distributions of particle velocity and sound intensity are derived from sound pressure. To extract the desired sound pressure from a mixed sound field environment, a 3D sound field separation technique is also formulated. Based on this theory, a series of reproduction techniques are proposed to improve the perceptual performance. The outcomes resulting from this theory are: (i) derivation of a particle velocity assisted 3D sound field reproduction technique which allows for non-uniform loudspeaker geometry with a limited number of loudspeakers, (ii) design of particle velocity based mixed-source sound field translation technique for binaural reproduction that can provide sound field translation with good perceptual experience over a large space, (iii) derivation of an intensity matching technique that can reproduce the desired sound field in a spherical region by controlling the sound intensity on the surface of the region, and (iv) two intensity based multizone sound field reproduction algorithms that can reproduce the desired sound field over multiple spatial zones. Finally, these techniques are evaluated by comparing to the conventional approaches through numerical simulations and real-world experiments
    corecore