22 research outputs found

    Towards a Classifier to Recognize Emotions Using Voice to Improve Recommendations

    Full text link
    [EN] The recognition of emotions in tone voice is currently a tool with a high potential when it comes to making recommendations, since it allows to personalize recommendations using the mood of the users as information. However, recognizing emotions using tone of voice is a complex task since it is necessary to pre-process the signal and subsequently recognize the emotion. Most of the current proposals use recurrent networks based on sequences with a temporal relationship. The disadvantage of these networks is that they have a high runtime, which makes it difficult to use in real-time applications. On the other hand, when defining this type of classifier, culture and language must be taken into account, since the tone of voice for the same emotion can vary depending on these cultural factors. In this work we propose a culturally adapted model for recognizing emotions from the voice tone using convolutional neural networks. This type of network has a relatively short execution time allowing its use in real time applications. The results we have obtained improve the current state of the art, reaching 93.6% success over the validation set.This work is partially supported by the Spanish Government project TIN2017-89156-R, GVA-CEICE project PROMETEO/2018/002, Generalitat Valenciana and European Social Fund FPI grant ACIF/2017/085, Universitat Politecnica de Valencia research grant (PAID-10-19), and by the Spanish Government (RTI2018-095390-B-C31).Fuentes-López, JM.; Taverner-Aparicio, JJ.; Rincón Arango, JA.; Botti Navarro, VJ. (2020). Towards a Classifier to Recognize Emotions Using Voice to Improve Recommendations. Springer. 218-225. https://doi.org/10.1007/978-3-030-51999-5_18S218225Balakrishnan, A., Rege, A.: Reading emotions from speech using deep neural networks. Technical report, Stanford University, Computer Science Department (2017)Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)Kerkeni, L., Serrestou, Y., Mbarki, M., Raoof, K., Mahjoub, M.: Speech emotion recognition: methods and cases study, pp. 175–182 (2018)McCluskey, K.W., Albas, D.C., Niemi, R.R., Cuevas, C., Ferrer, C.: Cross-cultural differences in the perception of the emotional content of speech: a study of the development of sensitivity in Canadian and Mexican children. Dev. Psychol. 11(5), 551 (1975)Paliwal, K.K.: Spectral subband centroid features for speech recognition. In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP 1998 (Cat. No. 98CH36181), vol. 2, pp. 617–620. IEEE (1998)Paulmann, S., Uskul, A.K.: Cross-cultural emotional prosody recognition: evidence from Chinese and British listeners. Cogn. Emot. 28(2), 230–244 (2014)Pépiot, E.: Voice, speech and gender: male-female acoustic differences and cross-language variation in English and French speakers. Corela Cogn. Représent. Lang. (HS-16) (2015)Picard, R.W., et al.: Affective computing. Perceptual Computing Section, Media Laboratory, Massachusetts Institute of Technology (1995)Rincon, J., de la Prieta, F., Zanardini, D., Julian, V., Carrascosa, C.: Influencing over people with a social emotional model. Neurocomputing 231, 47–54 (2017)Russell, J.A., Lewicka, M., Niit, T.: A cross-cultural study of a circumplex model of affect. J. Pers. Soc. Psychol. 57(5), 848 (1989)Schuller, B., Rigoll, G., Lang, M.: Hidden Markov model-based speech emotion recognition, vol. 2, pp. 401–404 (2003)Schuller, B., Villar, R., Rigoll, G., Lang, M.: Meta-classifiers in acoustic and linguistic feature fusion-based affect recognition, vol. 1, pp. 325–328 (2005)Thompson, W., Balkwill, L.-L.: Decoding speech prosody in five languages. Semiotica 2006, 407–424 (2006)Tyagi, V., Wellekens, C.: On desensitizing the Mel-cepstrum to spurious spectral components for robust speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. ICASSP 2005, vol. 1, pp. I–529. IEEE (2005)Ueda, M., Morishita, Y., Nakamura, T., Takata, N., Nakajima, S.: A recipe recommendation system that considers user’s mood. In: Proceedings of the 18th International Conference on Information Integration and Web-based Applications and Services, pp. 472–476. ACM (2016)Zhang, B., Quan, C., Ren, F.: Study on CNN in the recognition of emotion in audio and images. In: 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS), pp. 1–5, June 201

    MLP-based Log Spectral Energy Mapping for Robust Overlapping Speech Recognition

    Get PDF
    This paper investigates a multilayer perceptron (MLP) based acoustic feature mapping to extract robust features for automatic speech recognition (ASR) of overlapping speech. The MLP is trained to learn the mapping from log mel filter bank energies (MFBEs) extracted from the distant microphone recordings, including multiple overlapping speakers, to log MFBEs extracted from the clean speech signal. The outputs of the MLP are then used to generate mel filterbank cepstral coefficient (MFCC) acoustic features, that are subsequently used in acoustic model adaptation and system evaluation. The proposed approach is evaluated through extensive studies on the MONC corpus, which includes both non-overlapping single speaker and overlapping multi-speaker conditions. We demonstrate that by learning the mapping between log MFBEs extracted from noisy and clean signals the performance of ASR system can be significantly improved in overlapping multi-speaker condition compared a conventional delay-sum beamforming approach, while keeping the performance of the system on single non-overlapping speaker condition intact

    Reconocimiento de estados emocionales de personas analizando su voz: experiencia entre I+D y transferencia tecnológica

    Get PDF
    Los sistemas de reconocimiento de emociones del habla (SER) tienen como objetivo identificar el estado emocional de una persona analizando únicamente su voz, es decir, el sistema deberá seleccionar una clase (alegre, enojado, triste, miedo, sorpresa, etc.) aquella que sea más probable para el audio de entrada. SER es un tema de interés en el área de procesamiento digital de audio debido a las potenciales aplicaciones que se pueden desarrollar, por ejemplo: sistemas que interactúan con el humano en base a las emociones percibidas, asistentes en terapias psicológicas, detección de mentiras en interrogatorios, entre otros. En este trabajo se resume nuestra experiencia inicial en el desarrollo de un sistema SER producto de la vinculación entre el DI-UNSa y el SE-911, ambas instituciones de la provincia de Salta.Sociedad Argentina de Informática e Investigación Operativ

    Your password is music to my ears: cloud-based authentication using sound

    Get PDF
    This paper details the research in progress into identifying and addressing the threats faced by voice assistants and audio based digital systems. The popularity of these systems continues to grow as does the number of applications and scenarios they are used in. Smart speakers, smart home devices, mobile phones, telephone banking, and even vehicle controls all benefit from being able to be controlled to some extend by voice without diverting the attention of the user to a screen or having to use an input device such as a screen or keyboard. Whilst this removes barriers to use for those with accessibility challenges like visual impairment or motor skills issues and opens up a much more convenient user experience, a number of cyber security threats remain unanswered. This paper details a threat modeling exercise and suggests a model to address the key threats whilst retaining the usability associated with voice driven systems, by using an additional sound-based authentication factor

    Nonlinear Dynamic Invariants for Continuous Speech Recognition

    Get PDF
    In this work, nonlinear acoustic information is combined with traditional linear acoustic information in order to produce a noise-robust set of features for speech recognition. Classical acoustic modeling techniques for speech recognition have relied on a standard assumption of linear acoustics where signal processing is primarily performed in the signal\u27s frequency domain. While these conventional techniques have demonstrated good performance under controlled conditions, the performance of these systems suffers significant degradations when the acoustic data is contaminated with previously unseen noise. The objective of this thesis was to determine whether nonlinear dynamic invariants are able to boost speech recognition performance when combined with traditional acoustic features. Several sets of experiments are used to evaluate both clean and noisy speech data. The invariants resulted in a maximum relative increase of 11.1% for the clean evaluation set. However, an average relative decrease of 7.6% was observed for the noise-contaminated evaluation sets. The fact that recognition performance decreased with the use of dynamic invariants suggests that additional research is required for robust filtering of phase spaces constructed from noisy time series

    Improved time-frequency features and electrode placement for EEG-based biometric person recognition

    Get PDF
    This work introduces a novel feature extraction method for biometric recognition using EEG data and provides an analysis of the impact of electrode placements on performance. The feature extraction method is based on the wavelet transform of the raw EEG signal. The logarithms of wavelet coefficients are further processed using the discrete cosine transform (DCT). The DCT coefficients from each wavelet band are used to form the feature vectors for classification. As an application in the biometrics scenario, the effectiveness of the electrode locations on person recognition is also investigated, and suggestions are made for electrode positioning to improve performance. The effectiveness of the proposed feature was investigated in both identification and verification scenarios. Identification results of 98.24% and 93.28% were obtained using the EEG Motor Movement/Imagery Dataset (MM/I) and the UCI EEG Database Dataset respectively, which compares favorably with other published reports while using a significantly smaller number of electrodes. The performance of the proposed system also showed substantial improvements in the verification scenario when compared with some similar systems from the published literature. A multi-session analysis is simulated using with eyes open and eyes closed recordings from the MM/I database. It is found that the proposed feature is less influenced by time separation between training and testing compared with a conventional feature based on power spectral analysis
    corecore