4 research outputs found

    ASR Systems in Noisy Environment: Analysis and Solutions for Increasing Noise Robustness

    Get PDF
    This paper deals with the analysis of Automatic Speech Recognition (ASR) suitable for usage within noisy environment and suggests optimum configuration under various noisy conditions. The behavior of standard parameterization techniques was analyzed from the viewpoint of robustness against background noise. It was done for Melfrequency cepstral coefficients (MFCC), Perceptual linear predictive (PLP) coefficients, and their modified forms combining main blocks of PLP and MFCC. The second part is devoted to the analysis and contribution of modified techniques containing frequency-domain noise suppression and voice activity detection. The above-mentioned techniques were tested with signals in real noisy environment within Czech digit recognition task and AURORA databases. Finally, the contribution of special VAD selective training and MLLR adaptation of acoustic models were studied for various signal features

    Reconocedor de habla basado en la extracción de características articulatorias

    Get PDF
    Los sistemas de reconocimiento automático de habla persiguen proporcionar un interfaz natural entre máquinas y humanos mediante el uso de la voz. En muchos casos, se adopta la estrategia de imitar en la medida de lo posible los mecanismos de comunicación entre humanos. La implementación del sistema es, pues, muy importante y debe tener en cuenta los diversos problemas a los que se enfrenta, como el ruido aditivo o la variabilidad del hablante. El trabajo realizado en este PFC tiene como objetivo ensayar nuevas técnicas de extracción de características haciendo uso de información articulatoria, para averiguar si el sistema resultante tiene mejores prestaciones. Para llevar a cabo dicha tarea, utilizaremos la extracción de las características articulatorias de la voz, utilizando como clasificador un modelo híbrido con redes neuronales (perceptrones multicapa). Para la extracción de las características se crearon 7 clasificadores (a los que luego se añadió un octavo) para cada uno de los 7 niveles articulatorios que definimos, donde cada uno de ellos tomará, a su vez, diferentes valores atendiendo a la naturaleza del sonido emitido. Se consideraron además las diferencias que existen entre un entorno ideal y uno real (añadiendo ruido aditivo), para evaluar la pérdida de prestaciones existente. Los resultados obtenidos no sólo nos dan una visión general del sistema en cuanto al rendimiento global del mismo, sino que también nos muestran qué características de la voz son más robustas frente a alteraciones procedentes del ruido ambiente.The systems of automatic speech recognition aim to provide a natural interface between machines and human beings by the use of the voice. The strategy of imitating the mechanisms of communication between human beings is adopted -as far as possible- in many cases. The implementation of the system is very important and has to take into account the different problems that it faces, like ear noise or the variation of the speaker’s voice. The work carried out on this Final Year Project aims to test new feature extraction techniques by using articulatory information, and so resolves if the resulting system has the best performance. To do this, we will extract the articulatory characteristics of the voice, using, as a sorter, a hybrid model with neuronal networks (multilayer perceptrons). For the extraction of the characteristics, 7 classifiers were created (then an eight one was added) for each of the 7 articulatory levels defined. Each of them will take different values relating to the nature of the sound issued. Also, the difference between an ideal surrounding and a real one (added noise) will be studied, in order to evaluate the losses of the existing benefits. The results obtained will not only give us a general vision of the system’s overall performance, but it will also show us which characteristics of the voice are more robust against changes in the transmission channel.Ingeniería Técnica en Sonido e Image

    The acoustics of place of articulation in English plosives

    Get PDF
    PhD ThesisThis thesis investigates certain aspects of the acoustics of plosives’ place of articulation that have not been addressed by most previous studies, namely: 1. To test the performance of a technique for collapsing F2onset and F2mid into a single attribute, termed F2R. Results: F2R distinguishes place with effectively the same accuracy as F2onset+F2mid, being within ±1 percentage point of F2onset+F2mid at its strongest over most of the conditions examined. 2. To compare the strength of burst-based attributes at distinguishing place of articulation with and without normalization by individual speaker. Results: Lobanov normalization on average boosted the classification of individual attributes by 1.4 percentage points, but this modest improvement shrank or disappeared when the normalized attributes were combined into a single classification. 3. To examine the effect of different spectral representations (Hz-dB, Bark-phon, and Bark-sone) on the accuracy of the burst attributes. The results are mixed but mostly suggest that the choice between these representations is not a major factor in the classification accuracy of the attributes (mean difference of 1 to 1.5 percentage points); the choice of frequency region in the burst (mid versus high) is a far more important factor (13 percentage-point difference in mean classification accuracy). 4. To compare the performance of some traditional-phonetic burst attributes with the first 12 coefficients of the discrete cosine transform (DCT). The motivation for this comparison is that phonetic science has a long tradition of developing burst attributes that are tailored to the specific task of extracting place-of-articulation information from the burst, whereas automatic speech recognition (ASR) has long used attributes that are theoretically expected to capture more of the variance in the burst. Results: the DCT coefficients yielded a higher burst classification accuracy than the traditional phonetic attributes, by 3 percentage points.Economic and Social Research Counc
    corecore