15,826 research outputs found

    A comparison of features for large population speaker identification

    Get PDF
    Bibliography: leaves 95-104.Speech recognition systems all have one criterion in common; they perform better in a controlled environment using clean speech. Though performance can be excellent, even exceeding human capabilities for clean speech, systems fail when presented with speech data from more realistic environments such as telephone channels. The differences using a recognizer in clean and noisy environments are extreme, and this causes one of the major obstacles in producing commercial recognition systems to be used in normal environments. It is the lack of performance of speaker recognition systems with telephone channels that this work addresses. The human auditory system is a speech recognizer with excellent performance, especially in noisy environments. Since humans perform well at ignoring noise more than any machine, auditory-based methods are the promising approaches since they attempt to model the working of the human auditory system. These methods have been shown to outperform more conventional signal processing schemes for speech recognition, speech coding, word-recognition and phone classification tasks. Since speaker identification has received lot of attention in speech processing because of its waiting real-world applications, it is attractive to evaluate the performance using auditory models as features. Firstly, this study rums at improving the results for speaker identification. The improvements were made through the use of parameterized feature-sets together with the application of cepstral mean removal for channel equalization. The study is further extended to compare an auditory-based model, the Ensemble Interval Histogram, with mel-scale features, which was shown to perform almost error-free in clean speech. The previous studies of Elli to be more robust to noise were conducted on speaker dependent, small population, isolated words and now are extended to speaker independent, larger population, continuous speech. This study investigates whether the Elli representation is more resistant to telephone noise than mel-cepstrum as was shown in the previous studies, when now for the first time, it is applied for speaker identification task using the state-of-the-art Gaussian mixture model system

    Information Loss in the Human Auditory System

    Full text link
    From the eardrum to the auditory cortex, where acoustic stimuli are decoded, there are several stages of auditory processing and transmission where information may potentially get lost. In this paper, we aim at quantifying the information loss in the human auditory system by using information theoretic tools. To do so, we consider a speech communication model, where words are uttered and sent through a noisy channel, and then received and processed by a human listener. We define a notion of information loss that is related to the human word recognition rate. To assess the word recognition rate of humans, we conduct a closed-vocabulary intelligibility test. We derive upper and lower bounds on the information loss. Simulations reveal that the bounds are tight and we observe that the information loss in the human auditory system increases as the signal to noise ratio (SNR) decreases. Our framework also allows us to study whether humans are optimal in terms of speech perception in a noisy environment. Towards that end, we derive optimal classifiers and compare the human and machine performance in terms of information loss and word recognition rate. We observe a higher information loss and lower word recognition rate for humans compared to the optimal classifiers. In fact, depending on the SNR, the machine classifier may outperform humans by as much as 8 dB. This implies that for the speech-in-stationary-noise setup considered here, the human auditory system is sub-optimal for recognizing noisy words

    Closed-loop auditory-based representation for robust speech recognition

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Includes bibliographical references (p. 93-96).A closed-loop auditory based speech feature extraction algorithm is presented to address the problem of unseen noise for robust speech recognition. This closed-loop model is inspired by the possible role of the medial olivocochlear (MOC) efferent system of the human auditory periphery, which has been suggested in [6, 13, 42] to be important for human speech intelligibility in noisy environment. We propose that instead of using a fixed filter bank, the filters used in a feature extraction algorithm should be more flexible to adapt dynamically to different types of background noise. Therefore, in the closed-loop model, a feedback mechanism is designed to regulate the operating points of filters in the filter bank based on the background noise. The model is tested on a dataset created from TIDigits database. In this dataset, five kinds of noise are added to synthesize noisy speech. Compared with the standard MFCC extraction algorithm, the proposed closed-loop form of feature extraction algorithm provides 9.7%, 9.1% and 11.4% absolution word error rate reduction on average for three kinds of filter banks respectively.by Chia-ying Lee.S.M

    Feature extraction based on bio-inspired model for robust emotion recognition

    Get PDF
    Emotional state identification is an important issue to achieve more natural speech interactive systems. Ideally, these systems should also be able to work in real environments in which generally exist some kind of noise. Several bio-inspired representations have been applied to artificial systems for speech processing under noise conditions. In this work, an auditory signal representation is used to obtain a novel bio-inspired set of features for emotional speech signals. These characteristics, together with other spectral and prosodic features, are used for emotion recognition under noise conditions. Neural models were trained as classifiers and results were compared to the well-known mel-frequency cepstral coefficients. Results show that using the proposed representations, it is possible to significantly improve the robustness of an emotion recognition system. The results were also validated in a speaker independent scheme and with two emotional speech corpora.Fil: Albornoz, Enrique Marcelo. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Santa Fe. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de Ingeniería y Ciencias Hídricas. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional; ArgentinaFil: Milone, Diego Humberto. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Santa Fe. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de Ingeniería y Ciencias Hídricas. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional; ArgentinaFil: Rufiner, Hugo Leonardo. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Santa Fe. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de Ingeniería y Ciencias Hídricas. Instituto de Investigación en Señales, Sistemas e Inteligencia Computacional; Argentin

    Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially-Aware Language Acquisition

    Full text link
    This paper presents a self-supervised method for visual detection of the active speaker in a multi-person spoken interaction scenario. Active speaker detection is a fundamental prerequisite for any artificial cognitive system attempting to acquire language in social settings. The proposed method is intended to complement the acoustic detection of the active speaker, thus improving the system robustness in noisy conditions. The method can detect an arbitrary number of possibly overlapping active speakers based exclusively on visual information about their face. Furthermore, the method does not rely on external annotations, thus complying with cognitive development. Instead, the method uses information from the auditory modality to support learning in the visual domain. This paper reports an extensive evaluation of the proposed method using a large multi-person face-to-face interaction dataset. The results show good performance in a speaker dependent setting. However, in a speaker independent setting the proposed method yields a significantly lower performance. We believe that the proposed method represents an essential component of any artificial cognitive system or robotic platform engaging in social interactions.Comment: 10 pages, IEEE Transactions on Cognitive and Developmental System
    • …
    corecore