434 research outputs found

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Acoustic Approaches to Gender and Accent Identification

    Get PDF
    There has been considerable research on the problems of speaker and language recognition from samples of speech. A less researched problem is that of accent recognition. Although this is a similar problem to language identification, di�erent accents of a language exhibit more fine-grained di�erences between classes than languages. This presents a tougher problem for traditional classification techniques. In this thesis, we propose and evaluate a number of techniques for gender and accent classification. These techniques are novel modifications and extensions to state of the art algorithms, and they result in enhanced performance on gender and accent recognition. The first part of the thesis focuses on the problem of gender identification, and presents a technique that gives improved performance in situations where training and test conditions are mismatched. The bulk of this thesis is concerned with the application of the i-Vector technique to accent identification, which is the most successful approach to acoustic classification to have emerged in recent years. We show that it is possible to achieve high accuracy accent identification without reliance on transcriptions and without utilising phoneme recognition algorithms. The thesis describes various stages in the development of i-Vector based accent classification that improve the standard approaches usually applied for speaker or language identification, which are insu�cient. We demonstrate that very good accent identification performance is possible with acoustic methods by considering di�erent i-Vector projections, frontend parameters, i-Vector configuration parameters, and an optimised fusion of the resulting i-Vector classifiers we can obtain from the same data. We claim to have achieved the best accent identification performance on the test corpus for acoustic methods, with up to 90% identification rate. This performance is even better than previously reported acoustic-phonotactic based systems on the same corpus, and is very close to performance obtained via transcription based accent identification. Finally, we demonstrate that the utilization of our techniques for speech recognition purposes leads to considerably lower word error rates. Keywords: Accent Identification, Gender Identification, Speaker Identification, Gaussian Mixture Model, Support Vector Machine, i-Vector, Factor Analysis, Feature Extraction, British English, Prosody, Speech Recognition

    Bio-motivated features and deep learning for robust speech recognition

    Get PDF
    Mención Internacional en el título de doctorIn spite of the enormous leap forward that the Automatic Speech Recognition (ASR) technologies has experienced over the last five years their performance under hard environmental condition is still far from that of humans preventing their adoption in several real applications. In this thesis the challenge of robustness of modern automatic speech recognition systems is addressed following two main research lines. The first one focuses on modeling the human auditory system to improve the robustness of the feature extraction stage yielding to novel auditory motivated features. Two main contributions are produced. On the one hand, a model of the masking behaviour of the Human Auditory System (HAS) is introduced, based on the non-linear filtering of a speech spectro-temporal representation applied simultaneously to both frequency and time domains. This filtering is accomplished by using image processing techniques, in particular mathematical morphology operations with an specifically designed Structuring Element (SE) that closely resembles the masking phenomena that take place in the cochlea. On the other hand, the temporal patterns of auditory-nerve firings are modeled. Most conventional acoustic features are based on short-time energy per frequency band discarding the information contained in the temporal patterns. Our contribution is the design of several types of feature extraction schemes based on the synchrony effect of auditory-nerve activity, showing that the modeling of this effect can indeed improve speech recognition accuracy in the presence of additive noise. Both models are further integrated into the well known Power Normalized Cepstral Coefficients (PNCC). The second research line addresses the problem of robustness in noisy environments by means of the use of Deep Neural Networks (DNNs)-based acoustic modeling and, in particular, of Convolutional Neural Networks (CNNs) architectures. A deep residual network scheme is proposed and adapted for our purposes, allowing Residual Networks (ResNets), originally intended for image processing tasks, to be used in speech recognition where the network input is small in comparison with usual image dimensions. We have observed that ResNets on their own already enhance the robustness of the whole system against noisy conditions. Moreover, our experiments demonstrate that their combination with the auditory motivated features devised in this thesis provide significant improvements in recognition accuracy in comparison to other state-of-the-art CNN-based ASR systems under mismatched conditions, while maintaining the performance in matched scenarios. The proposed methods have been thoroughly tested and compared with other state-of-the-art proposals for a variety of datasets and conditions. The obtained results prove that our methods outperform other state-of-the-art approaches and reveal that they are suitable for practical applications, specially where the operating conditions are unknown.El objetivo de esta tesis se centra en proponer soluciones al problema del reconocimiento de habla robusto; por ello, se han llevado a cabo dos líneas de investigación. En la primera líınea se han propuesto esquemas de extracción de características novedosos, basados en el modelado del comportamiento del sistema auditivo humano, modelando especialmente los fenómenos de enmascaramiento y sincronía. En la segunda, se propone mejorar las tasas de reconocimiento mediante el uso de técnicas de aprendizaje profundo, en conjunto con las características propuestas. Los métodos propuestos tienen como principal objetivo, mejorar la precisión del sistema de reconocimiento cuando las condiciones de operación no son conocidas, aunque el caso contrario también ha sido abordado. En concreto, nuestras principales propuestas son los siguientes: Simular el sistema auditivo humano con el objetivo de mejorar la tasa de reconocimiento en condiciones difíciles, principalmente en situaciones de alto ruido, proponiendo esquemas de extracción de características novedosos. Siguiendo esta dirección, nuestras principales propuestas se detallan a continuación: • Modelar el comportamiento de enmascaramiento del sistema auditivo humano, usando técnicas del procesado de imagen sobre el espectro, en concreto, llevando a cabo el diseño de un filtro morfológico que captura este efecto. • Modelar el efecto de la sincroní que tiene lugar en el nervio auditivo. • La integración de ambos modelos en los conocidos Power Normalized Cepstral Coefficients (PNCC). La aplicación de técnicas de aprendizaje profundo con el objetivo de hacer el sistema más robusto frente al ruido, en particular con el uso de redes neuronales convolucionales profundas, como pueden ser las redes residuales. Por último, la aplicación de las características propuestas en combinación con las redes neuronales profundas, con el objetivo principal de obtener mejoras significativas, cuando las condiciones de entrenamiento y test no coinciden.Programa Oficial de Doctorado en Multimedia y ComunicacionesPresidente: Javier Ferreiros López.- Secretario: Fernando Díaz de María.- Vocal: Rubén Solera Ureñ

    Investigating the neural mechanisms underlying audio-visual perception using electroencephalography (EEG)

    Get PDF
    Traditionally research into how we perceive our external world focused on the unisensory approach, examining how information is processed by one sense at a time. This produced a vast literature of results revealing how our brains process information from the different senses, from fields such as psychophysics, animal electrophysiology, and neuroimaging. However, we know from our own experiences that we use more than one sense at a time to understand our external world. Therefore to fully understand perception, we must understand not only how the brain processes information from individual sensory modalities, but also how and when this information interacts and combines with information from other modalities. In short, we need to understand the phenomenon of multisensory perception. The work in this thesis describes three experiments aimed to provide new insights into this topic. Specifically, the three experiments presented here focused on examining when and where effects related to multisensory perception emerged in neural signals, and whether or not these effects could be related to behaviour in a time-resolved way and on a trial-by-trial basis. These experiments were carried out using a novel combination of psychophysics, high density electroencephalography (EEG), and advanced computational methods (linear discriminant analysis and mutual information analysis). Experiment 1 (Chapter 3) investigated how behavioural and neural signals are modulated by the reliability of sensory information. Previous work has shown that subjects will weight sensory cues in proportion to their relative reliabilities; high reliability cues are assigned a higher weight and have more influence on the final perceptual estimate, while low reliability cues are assigned a lower weight and have less influence. Despite this widespread finding, it remains unclear when neural correlates of sensory reliability emerge during a trial, and whether or not modulations in neural signals due to reliability relate to modulations in behavioural reweighting. To investigate these questions we used a combination of psychophysics, EEG-based neuroimaging, single-trial decoding, and regression modelling. Subjects performed an audio-visual rate discrimination task where the modality (auditory, visual, audio-visual), stimulus stream rate (8 to 14 Hz), visual reliability (high/low), and congruency in rate between audio-visual stimuli (± 2 Hz) were systematically manipulated. For the behavioural and EEG components (derived using linear discriminant analysis), a set of perceptual and neural weights were calculated for each time point. The behavioural results revealed that participants weighted sensory information based on reliability: as visual reliability decreased, auditory weighting increased. These modulations in perceptual weights emerged early after stimulus onset (48 ms). The EEG data revealed that neural correlates of sensory reliability and perceptual weighting were also evident in decoding signals, and that these occurred surprisingly early in the trial (84 ms). Finally, source localisation suggested that these correlates originated in early sensory (occipital/temporal) and parietal regions respectively. Overall, these results provide the first insights into the temporal dynamics underlying human cue weighting in the brain, and suggest that it is an early, dynamic, and distributed process in the brain. Experiment 2 (Chapter 4) expanded on this work by investigating how oscillatory power was modulated by the reliability of sensory information. To this end, we used a time-frequency approach to analyse the data collected for the work in Chapter 3. Our results showed that significant effects in the theta and alpha bands over fronto-central regions occurred during the same early time windows as a shift in perceptual weighting (100 ms and 250 ms respectively). Specifically, we found that theta power (4 - 6 Hz) was lower and alpha power (10 – 12 Hz) was higher in audio-visual conditions where visual reliability was low, relative to conditions where visual reliability was high. These results suggest that changes in oscillatory power may underlie reliability based cue weighting in the brain, and that these changes occur early during the sensory integration process. Finally, Experiment 3 (Chapter 5) moved away from examining reliability based cue weighting and focused on investigating cases where spatially and temporally incongruent auditory and visual cues interact to affect behaviour. Known collectively as “cross-modal associations”, past work has shown that observers have preferred and non-preferred stimuli pairings. For example, subjects will frequently pair high pitched tones with small objects and low pitched tones with large objects. However it is still unclear when and where these associations are reflected in neural signals, and whether they emerge at an early perceptual level or later decisional level. To investigate these questions we used a modified version of the implicit association test (IAT) to examine the modulation of behavioural and neural signals underlying an auditory pitch – visual size cross modal association. Congruency was manipulated by assigning two stimuli (one auditory and one visual) to each of the left or right response keys and changing this assignment across blocks to create congruent (left key: high tone – small circle, right key: low tone – large circle) and incongruent (left key: low tone – small circle, right key: high tone – large circle) pairings of stimuli. On each trial, subjects were presented with only one of the four stimuli (auditory high tone, auditory low tone, visual small circle, visual large circle), and asked to respond which was presented as quickly and accurately as possible. The key assumption with such a design is that subjects should respond faster when associated (i.e. congruent) stimuli are assigned to the same response key than when two non-associated stimuli are. In line with this, our behavioural results demonstrated that subjects responded faster on blocks where congruent pairings of stimuli were assigned to the response keys (high pitch-small circle and low pitch large circle), than blocks where incongruent pairings were. The EEG results demonstrated that information about auditory pitch and visual size could be extracted from neural signals using two approaches to single-trial analysis (linear discriminant analysis and mutual information analysis) early during the trial (50ms), with the strongest information contained over posterior and temporal electrodes for auditory trials, and posterior electrodes for visual trials. EEG components related to auditory pitch were significantly modulated by cross-modal congruency over temporal and frontal regions early in the trial (~100ms), while EEG components related to visual size were modulated later (~220ms) over frontal and temporal electrodes. For the auditory trials, these EEG components were significantly predictive of single trial reaction times, yet for the visual trials the components were not. As a result, the data support an early and short-latency origin of cross-modal associations, and suggest that these may originate in a bottom-up manner during early sensory processing rather than from high-level inference processes. Importantly, the findings were consistent across both analysis methods, suggesting these effects are robust. To summarise, the results across all three experiments showed that it is possible to extract meaningful, single-trial information from the EEG signal and relate it to behaviour on a time resolved basis. As a result, the work presented here steps beyond previous studies to provide new insights into the temporal dynamics of audio-visual perception in the brain. All experiments, although employing different paradigms and investigating different processes, showed early neural correlates related to audio-visual perception emerging in neural signals across early sensory, parietal, and frontal regions. Together, these results provide support for the prevailing modern view that the entire cortex is essentially multisensory and that multisensory effects can emerge at all stages during the perceptual process
    corecore