3,276 research outputs found

    An application of an auditory periphery model in speaker identification

    Get PDF
    The number of applications of automatic Speaker Identification (SID) is growing due to the advanced technologies for secure access and authentication in services and devices. In 2016, in a study, the Cascade of Asymmetric Resonators with Fast Acting Compression (CAR FAC) cochlear model achieved the best performance among seven recent cochlear models to fit a set of human auditory physiological data. Motivated by the performance of the CAR-FAC, I apply this cochlear model in an SID task for the first time to produce a similar performance to a human auditory system. This thesis investigates the potential of the CAR-FAC model in an SID task. I investigate the capability of the CAR-FAC in text-dependent and text-independent SID tasks. This thesis also investigates contributions of different parameters, nonlinearities, and stages of the CAR-FAC that enhance SID accuracy. The performance of the CAR-FAC is compared with another recent cochlear model called the Auditory Nerve (AN) model. In addition, three FFT-based auditory features – Mel frequency Cepstral Coefficient (MFCC), Frequency Domain Linear Prediction (FDLP), and Gammatone Frequency Cepstral Coefficient (GFCC), are also included to compare their performance with cochlear features. This comparison allows me to investigate a better front-end for a noise-robust SID system. Three different statistical classifiers: a Gaussian Mixture Model with Universal Background Model (GMM-UBM), a Support Vector Machine (SVM), and an I-vector were used to evaluate the performance. These statistical classifiers allow me to investigate nonlinearities in the cochlear front-ends. The performance is evaluated under clean and noisy conditions for a wide range of noise levels. Techniques to improve the performance of a cochlear algorithm are also investigated in this thesis. It was found that the application of a cube root and DCT on cochlear output enhances the SID accuracy substantially

    Cortical Spike Synchrony as a Measure of Input Familiarity

    Get PDF
    J.G.O. was supported by the Ministerio de Economia y Competividad and FEDER (Spain, project FIS2015-66503-C3-1-P) and the ICREA Academia programme. E.U. acknowledges support from the Scottish Universities Life Sciences Alliance (SULSA) and HPC-Europa2.Peer reviewedPostprin

    Sound source coding in the azimuthal plane: separating sounds via short-term interaural time difference estimations

    Get PDF
    The interaural time difference (ITD) is the main cue to perform sound localization for low-frequency sounds (below ~2kHz) in the azimuthal plane. The extractors for this cue are neurons of two nuclei of the mammalian auditory brainstem, the medial superior olive (MSO) and the low-frequency limb of the lateral superior olive (lLSO). The read-out mechanism on a population level is unknown as single neurons show different responses for frequency-varying stimuli. This poses a challenge especially for natural sound stimuli and complex auditory scenes which cover a wide range of frequencies, i.e., they have a very broad spectrum. To find an encoder of ITDs, we have developed so-called effective population models of the human MSO and lLSO. They are effective in the sense that the individual neurons are each identified by their three defining properties which determine their frequency-dependent ITD tuning: the best frequency (BF), the characteristic delay (CD) and the characteristic phase (CP). We have formulated an ITD decoding strategy in the 2d-space spanned by the membrane potentials of lLSO vs. MSO. From each hemisphere, a separate ITD can be decoded. These two estimations can be weighted and balanced accordingly to retrieve the location of sound sources in the horizontal plane. To this end, we make use of so-called short-term ITDs which are successive estimates in small time windows. Our results indicate that sound localization can be performed correctly in time windows as short as up to 1ms. To perform sound separation of stimuli within complex auditory scenes, we fit Gaussian Mixture Models to the short-term ITD estimate distributions. The results show that sound separation can be performed reliably when the long-term ITD estimation (which is a distribution of short-term ITDs) is made up of a time interval that is larger than 1s. Furthermore, we conclude that sounds can be separated and reconstructed from complex auditory scenes solely based on one auditory cue, the ITD

    Neuromorphic auditory computing: towards a digital, event-based implementation of the hearing sense for robotics

    Get PDF
    In this work, it is intended to advance on the development of the neuromorphic audio processing systems in robots through the implementation of an open-source neuromorphic cochlea, event-based models of primary auditory nuclei, and their potential use for real-time robotics applications. First, the main gaps when working with neuromorphic cochleae were identified. Among them, the accessibility and usability of such sensors can be considered as a critical aspect. Silicon cochleae could not be as flexible as desired for some applications. However, FPGA-based sensors can be considered as an alternative for fast prototyping and proof-of-concept applications. Therefore, a software tool was implemented for generating open-source, user-configurable Neuromorphic Auditory Sensor models that can be deployed in any FPGA, removing the aforementioned barriers for the neuromorphic research community. Next, the biological principles of the animals' auditory system were studied with the aim of continuing the development of the Neuromorphic Auditory Sensor. More specifically, the principles of binaural hearing were deeply studied for implementing event-based models to perform real-time sound source localization tasks. Two different approaches were followed to extract inter-aural time differences from event-based auditory signals. On the one hand, a digital, event-based design of the Jeffress model was implemented. On the other hand, a novel digital implementation of the Time Difference Encoder model was designed and implemented on FPGA. Finally, three different robotic platforms were used for evaluating the performance of the proposed real-time neuromorphic audio processing architectures. An audio-guided central pattern generator was used to control a hexapod robot in real-time using spiking neural networks on SpiNNaker. Then, a sensory integration application was implemented combining sound source localization and obstacle avoidance for autonomous robots navigation. Lastly, the Neuromorphic Auditory Sensor was integrated within the iCub robotic platform, being the first time that an event-based cochlea is used in a humanoid robot. Then, the conclusions obtained are presented and new features and improvements are proposed for future works.En este trabajo se pretende avanzar en el desarrollo de los sistemas de procesamiento de audio neuromórficos en robots a través de la implementación de una cóclea neuromórfica de código abierto, modelos basados en eventos de los núcleos auditivos primarios, y su potencial uso para aplicaciones de robótica en tiempo real. En primer lugar, se identificaron los principales problemas a la hora de trabajar con cócleas neuromórficas. Entre ellos, la accesibilidad y usabilidad de dichos sensores puede considerarse un aspecto crítico. Los circuitos integrados analógicos que implementan modelos cocleares pueden no pueden ser tan flexibles como se desea para algunas aplicaciones específicas. Sin embargo, los sensores basados en FPGA pueden considerarse una alternativa para el desarrollo rápido y flexible de prototipos y aplicaciones de prueba de concepto. Por lo tanto, en este trabajo se implementó una herramienta de software para generar modelos de sensores auditivos neuromórficos de código abierto y configurables por el usuario, que pueden desplegarse en cualquier FPGA, eliminando las barreras mencionadas para la comunidad de investigación neuromórfica. A continuación, se estudiaron los principios biológicos del sistema auditivo de los animales con el objetivo de continuar con el desarrollo del Sensor Auditivo Neuromórfico (NAS). Más concretamente, se estudiaron en profundidad los principios de la audición binaural con el fin de implementar modelos basados en eventos para realizar tareas de localización de fuentes sonoras en tiempo real. Se siguieron dos enfoques diferentes para extraer las diferencias temporales interaurales de las señales auditivas basadas en eventos. Por un lado, se implementó un diseño digital basado en eventos del modelo Jeffress. Por otro lado, se diseñó una novedosa implementación digital del modelo de codificador de diferencias temporales y se implementó en FPGA. Por último, se utilizaron tres plataformas robóticas diferentes para evaluar el rendimiento de las arquitecturas de procesamiento de audio neuromórfico en tiempo real propuestas. Se utilizó un generador central de patrones guiado por audio para controlar un robot hexápodo en tiempo real utilizando redes neuronales pulsantes en SpiNNaker. A continuación, se implementó una aplicación de integración sensorial que combina la localización de fuentes de sonido y la evitación de obstáculos para la navegación de robots autónomos. Por último, se integró el Sensor Auditivo Neuromórfico dentro de la plataforma robótica iCub, siendo la primera vez que se utiliza una cóclea basada en eventos en un robot humanoide. Por último, en este trabajo se presentan las conclusiones obtenidas y se proponen nuevas funcionalidades y mejoras para futuros trabajos

    Bio-motivated features and deep learning for robust speech recognition

    Get PDF
    Mención Internacional en el título de doctorIn spite of the enormous leap forward that the Automatic Speech Recognition (ASR) technologies has experienced over the last five years their performance under hard environmental condition is still far from that of humans preventing their adoption in several real applications. In this thesis the challenge of robustness of modern automatic speech recognition systems is addressed following two main research lines. The first one focuses on modeling the human auditory system to improve the robustness of the feature extraction stage yielding to novel auditory motivated features. Two main contributions are produced. On the one hand, a model of the masking behaviour of the Human Auditory System (HAS) is introduced, based on the non-linear filtering of a speech spectro-temporal representation applied simultaneously to both frequency and time domains. This filtering is accomplished by using image processing techniques, in particular mathematical morphology operations with an specifically designed Structuring Element (SE) that closely resembles the masking phenomena that take place in the cochlea. On the other hand, the temporal patterns of auditory-nerve firings are modeled. Most conventional acoustic features are based on short-time energy per frequency band discarding the information contained in the temporal patterns. Our contribution is the design of several types of feature extraction schemes based on the synchrony effect of auditory-nerve activity, showing that the modeling of this effect can indeed improve speech recognition accuracy in the presence of additive noise. Both models are further integrated into the well known Power Normalized Cepstral Coefficients (PNCC). The second research line addresses the problem of robustness in noisy environments by means of the use of Deep Neural Networks (DNNs)-based acoustic modeling and, in particular, of Convolutional Neural Networks (CNNs) architectures. A deep residual network scheme is proposed and adapted for our purposes, allowing Residual Networks (ResNets), originally intended for image processing tasks, to be used in speech recognition where the network input is small in comparison with usual image dimensions. We have observed that ResNets on their own already enhance the robustness of the whole system against noisy conditions. Moreover, our experiments demonstrate that their combination with the auditory motivated features devised in this thesis provide significant improvements in recognition accuracy in comparison to other state-of-the-art CNN-based ASR systems under mismatched conditions, while maintaining the performance in matched scenarios. The proposed methods have been thoroughly tested and compared with other state-of-the-art proposals for a variety of datasets and conditions. The obtained results prove that our methods outperform other state-of-the-art approaches and reveal that they are suitable for practical applications, specially where the operating conditions are unknown.El objetivo de esta tesis se centra en proponer soluciones al problema del reconocimiento de habla robusto; por ello, se han llevado a cabo dos líneas de investigación. En la primera líınea se han propuesto esquemas de extracción de características novedosos, basados en el modelado del comportamiento del sistema auditivo humano, modelando especialmente los fenómenos de enmascaramiento y sincronía. En la segunda, se propone mejorar las tasas de reconocimiento mediante el uso de técnicas de aprendizaje profundo, en conjunto con las características propuestas. Los métodos propuestos tienen como principal objetivo, mejorar la precisión del sistema de reconocimiento cuando las condiciones de operación no son conocidas, aunque el caso contrario también ha sido abordado. En concreto, nuestras principales propuestas son los siguientes: Simular el sistema auditivo humano con el objetivo de mejorar la tasa de reconocimiento en condiciones difíciles, principalmente en situaciones de alto ruido, proponiendo esquemas de extracción de características novedosos. Siguiendo esta dirección, nuestras principales propuestas se detallan a continuación: • Modelar el comportamiento de enmascaramiento del sistema auditivo humano, usando técnicas del procesado de imagen sobre el espectro, en concreto, llevando a cabo el diseño de un filtro morfológico que captura este efecto. • Modelar el efecto de la sincroní que tiene lugar en el nervio auditivo. • La integración de ambos modelos en los conocidos Power Normalized Cepstral Coefficients (PNCC). La aplicación de técnicas de aprendizaje profundo con el objetivo de hacer el sistema más robusto frente al ruido, en particular con el uso de redes neuronales convolucionales profundas, como pueden ser las redes residuales. Por último, la aplicación de las características propuestas en combinación con las redes neuronales profundas, con el objetivo principal de obtener mejoras significativas, cuando las condiciones de entrenamiento y test no coinciden.Programa Oficial de Doctorado en Multimedia y ComunicacionesPresidente: Javier Ferreiros López.- Secretario: Fernando Díaz de María.- Vocal: Rubén Solera Ureñ
    corecore