272 research outputs found

    Amélioration psychoacoustique du filtrage de Wiener : quelques approches récentes et une nouvelle méthode

    Get PDF
    *Bruit musical, distorsion, filtre deWiener, psychoacoustique, signal de parol

    A single channel speech enhancement technique exploiting human auditory masking properties

    Get PDF
    To enhance extreme corrupted speech signals, an Improved Psychoacoustically Motivated Spectral Weighting Rule (IPMSWR) is proposed, that controls the predefined residual noise level by a time-frequency dependent parameter. Unlike conventional Psychoacoustically Motivated Spectral Weighting Rules (PMSWR), the level of the residual noise is here varied throughout the enhanced speech based on the discrimination between the regions with speech presence and speech absence by means of segmental SNR within critical bands. Controlling in such a way the level of the residual noise in the noise only region avoids the unpleasant residual noise perceived at very low SNRs. To derive the gain coefficients, the computation of the masking curve and the estimation of the corrupting noise power are required. Since the clean speech is generally not available for a single channel speech enhancement technique, the rough clean speech components needed to compute the masking curve are here obtained using advanced spectral subtraction techniques. To estimate the corrupting noise, a new technique is employed, that relies on the noise power estimation using rapid adaptation and recursive smoothing principles. The performances of the proposed approach are objectively and subjectively compared to the conventional approaches to highlight the aforementioned improvement

    Morphologically filtered power-normalized cochleograms as robust, biologically inspired features for ASR

    Get PDF
    In this paper, we present advances in the modeling of the masking behavior of the human auditory system (HAS) to enhance the robustness of the feature extraction stage in automatic speech recognition (ASR). The solution adopted is based on a nonlinear filtering of a spectro-temporal representation applied simultaneously to both frequency and time domains-as if it were an image-using mathematical morphology operations. A particularly important component of this architecture is the so-called structuring element (SE) that in the present contribution is designed as a single three-dimensional pattern using physiological facts, in such a way that closely resembles the masking phenomena taking place in the cochlea. A proper choice of spectro-temporal representation lends validity to the model throughout the whole frequency spectrum and intensity spans assuming the variability of the masking properties of the HAS in these two domains. The best results were achieved with the representation introduced as part of the power normalized cepstral coefficients (PNCC) together with a spectral subtraction step. This method has been tested on Aurora 2, Wall Street Journal and ISOLET databases including both classical hidden Markov model (HMM) and hybrid artificial neural networks (ANN)-HMM back-ends. In these, the proposed front-end analysis provides substantial and significant improvements compared to baseline techniques: up to 39.5% relative improvement compared to MFCC, and 18.7% compared to PNCC in the Aurora 2 database.This contribution has been supported by an Airbus Defense and Space Grant (Open Innovation - SAVIER) and Spanish Government-CICYT projects TEC2014-53390-P and TEC2014-61729-EX

    ASR Feature Extraction with Morphologically-Filtered Power-Normalized Cochleograms

    Get PDF
    Proceedings of: 15th Annual Conference of the International Speech Communication Association. Singapore, September 14-18, 2014.In this paper we present advances in the modeling of the masking behavior of the Human Auditory System to enhance the robustness of the feature extraction stage in Automatic Speech Recognition. The solution adopted is based on a non-linear filtering of a spectro-temporal representation applied simultaneously on both the frequency and time domains, by processing it using mathematical morphology operations as if it were an image. A particularly important component of this architecture is the so called structuring element: biologically-based considerations are addressed in the present contribution to design an element that closely resembles the masking phenomena taking place in the cochlea. The second feature of this contribution is the choice of underlying spectro-temporal representation. The best results were achieved by the representation introduced as part of the Power Normalized Cepstral Coefficients together with a spectral subtraction step. On the Aurora 2 noisy continuous digits task, we report relative error reductions of 18.7% compared to PNCC and 39.5% compared to MFCC.This contribution has been supported by an Airbus Defense and Space Grant (Open Innovation - SAVIER) and Spanish Government-CICYT project 2011-26807/TEC.Publicad

    Loudness of the singing voice: A room acoustics perspective

    Get PDF
    This thesis is examining ectophonic (sounds created outside the human body) and autophonic (sound from one’s own voice) loudness perception for the operatic voice, within the context of room acoustics. Ectophonic loudness perception was modelled within the context of room acoustics for the operatic voice in chapter two. These models were then used to explore the loudness envelope of the messa di voce (MDV), where psychoacoustically based measures were shown to perform better than physical acoustic measures used in previous studies. The third chapter addressed autophonic loudness perception, while presenting limitations in modelling it in a manner similar to ectophonic loudness models. Some of these limitations were addressed in chapter four with two experiments where autophonic loudness of opera singers was explored using direct psychoacoustical scaling methods, within simulated room acoustic environments. In the first experiment, a power law relationship between autophonic loudness and the sound pressures produced was noticed for the magnitude production task, with different power law exponents for different phonemes. The contribution of room acoustics for autophonic loudness scaling was not statistically significant. Lombard slope, as it applies to autophonic perception and room acoustics was also studied, with some evidence found in support. The second experiment in chapter four explored autophonic loudness for more continuous vocalisations (crescendi, decrescendi, and MDV) using adapted direct scaling methods. The results showed that sensorimotor mechanisms seem to be more important than hearing and room acoustics in autophonic loudness perception, which is consistent with previous research. Overall, this thesis showed that the room acoustics effect on the loudness of the singing voice needs to be assessed based on the communication scenario. This has relevance for voice analysis, loudness perception in general, room acoustics simulation, and vocal pedagogy

    Discrimination of Speech From Non-Speech Based on Multiscale Spectro-Temporal Modulations

    Get PDF
    We describe a content-based audio classification algorithm based on novel multiscale spectrotemporal modulation features inspired by a model of auditory cortical processing. The task explored is to discriminate speech from non-speech consisting of animal vocalizations, music and environmental sounds. Although this is a relatively easy task for humans, it is still difficult to automate well, especially in noisy and reverberant environments. The auditory model captures basic processes occurring from the early cochlear stages to the central cortical areas. The model generates a multidimensional spectro-temporal representation of the sound, which is then analyzed by a multi-linear dimensionality reduction technique and classified by a Support Vector Machine (SVM). Generalization of the system to signals in high level of additive noise and reverberation is evaluated and compared to two existing approaches [1] [2]. The results demonstrate the advantages of the auditory model over the other two systems, especially at low SNRs and high reverberation

    Äänisisällön automaattisen luokittelun menetelmiä

    Get PDF
    This study presents an overview of different methods of digital signal processing and pattern recognition that are frequently applicable to automatic recognition, classification and description of audio content. Moreover, strategies for the combination of the said methods are discussed. Some of the published practical applications from different areas are cited to illustrate the use of the basic methods and the combined recognition strategies. A brief overview of human auditory perception is also given, with emphasis on the aspects that are important for audio recognition.Tässä työssä esitetään yleiskatsaus sellaisiin signaalinkäsittelyn ja hahmontunnistuksen menetelmiin, jotka ovat usein sovellettavissa äänisisällön automaattiseen tunnistamiseen, luokitteluun ja kuvaamiseen. Lisäksi työssä esitetään strategioita mainittujen menetelmien yhdistelyyn ja annetaan näihin ratkaisuihin liittyviä esimerkinomaisia viitteitä kirjallisuudesta löytyviin käytännön sovelluksiin eri sovellusalueilta. Työ sisältää myös suppean esityksen ihmisen kuulon toiminnan pääpiirteistä äänitunnistuksen kannalta

    Bio-motivated features and deep learning for robust speech recognition

    Get PDF
    Mención Internacional en el título de doctorIn spite of the enormous leap forward that the Automatic Speech Recognition (ASR) technologies has experienced over the last five years their performance under hard environmental condition is still far from that of humans preventing their adoption in several real applications. In this thesis the challenge of robustness of modern automatic speech recognition systems is addressed following two main research lines. The first one focuses on modeling the human auditory system to improve the robustness of the feature extraction stage yielding to novel auditory motivated features. Two main contributions are produced. On the one hand, a model of the masking behaviour of the Human Auditory System (HAS) is introduced, based on the non-linear filtering of a speech spectro-temporal representation applied simultaneously to both frequency and time domains. This filtering is accomplished by using image processing techniques, in particular mathematical morphology operations with an specifically designed Structuring Element (SE) that closely resembles the masking phenomena that take place in the cochlea. On the other hand, the temporal patterns of auditory-nerve firings are modeled. Most conventional acoustic features are based on short-time energy per frequency band discarding the information contained in the temporal patterns. Our contribution is the design of several types of feature extraction schemes based on the synchrony effect of auditory-nerve activity, showing that the modeling of this effect can indeed improve speech recognition accuracy in the presence of additive noise. Both models are further integrated into the well known Power Normalized Cepstral Coefficients (PNCC). The second research line addresses the problem of robustness in noisy environments by means of the use of Deep Neural Networks (DNNs)-based acoustic modeling and, in particular, of Convolutional Neural Networks (CNNs) architectures. A deep residual network scheme is proposed and adapted for our purposes, allowing Residual Networks (ResNets), originally intended for image processing tasks, to be used in speech recognition where the network input is small in comparison with usual image dimensions. We have observed that ResNets on their own already enhance the robustness of the whole system against noisy conditions. Moreover, our experiments demonstrate that their combination with the auditory motivated features devised in this thesis provide significant improvements in recognition accuracy in comparison to other state-of-the-art CNN-based ASR systems under mismatched conditions, while maintaining the performance in matched scenarios. The proposed methods have been thoroughly tested and compared with other state-of-the-art proposals for a variety of datasets and conditions. The obtained results prove that our methods outperform other state-of-the-art approaches and reveal that they are suitable for practical applications, specially where the operating conditions are unknown.El objetivo de esta tesis se centra en proponer soluciones al problema del reconocimiento de habla robusto; por ello, se han llevado a cabo dos líneas de investigación. En la primera líınea se han propuesto esquemas de extracción de características novedosos, basados en el modelado del comportamiento del sistema auditivo humano, modelando especialmente los fenómenos de enmascaramiento y sincronía. En la segunda, se propone mejorar las tasas de reconocimiento mediante el uso de técnicas de aprendizaje profundo, en conjunto con las características propuestas. Los métodos propuestos tienen como principal objetivo, mejorar la precisión del sistema de reconocimiento cuando las condiciones de operación no son conocidas, aunque el caso contrario también ha sido abordado. En concreto, nuestras principales propuestas son los siguientes: Simular el sistema auditivo humano con el objetivo de mejorar la tasa de reconocimiento en condiciones difíciles, principalmente en situaciones de alto ruido, proponiendo esquemas de extracción de características novedosos. Siguiendo esta dirección, nuestras principales propuestas se detallan a continuación: • Modelar el comportamiento de enmascaramiento del sistema auditivo humano, usando técnicas del procesado de imagen sobre el espectro, en concreto, llevando a cabo el diseño de un filtro morfológico que captura este efecto. • Modelar el efecto de la sincroní que tiene lugar en el nervio auditivo. • La integración de ambos modelos en los conocidos Power Normalized Cepstral Coefficients (PNCC). La aplicación de técnicas de aprendizaje profundo con el objetivo de hacer el sistema más robusto frente al ruido, en particular con el uso de redes neuronales convolucionales profundas, como pueden ser las redes residuales. Por último, la aplicación de las características propuestas en combinación con las redes neuronales profundas, con el objetivo principal de obtener mejoras significativas, cuando las condiciones de entrenamiento y test no coinciden.Programa Oficial de Doctorado en Multimedia y ComunicacionesPresidente: Javier Ferreiros López.- Secretario: Fernando Díaz de María.- Vocal: Rubén Solera Ureñ
    corecore