434 research outputs found
Speech Recognition
Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes
Recommended from our members
Evaluation and analysis of hybrid intelligent pattern recognition techniques for speaker identification
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.The rapid momentum of the technology progress in the recent years has led to a tremendous rise in the use of biometric authentication systems. The objective of this research is to investigate the problem
of identifying a speaker from its voice regardless of the content (i.e.
text-independent), and to design efficient methods of combining face and voice in producing a robust authentication system.
A novel approach towards speaker identification is developed using
wavelet analysis, and multiple neural networks including Probabilistic
Neural Network (PNN), General Regressive Neural Network (GRNN)and Radial Basis Function-Neural Network (RBF NN) with the AND
voting scheme. This approach is tested on GRID and VidTIMIT cor-pora and comprehensive test results have been validated with state-
of-the-art approaches. The system was found to be competitive and it improved the recognition rate by 15% as compared to the classical Mel-frequency Cepstral Coe±cients (MFCC), and reduced the recognition time by 40% compared to Back Propagation Neural Network (BPNN), Gaussian Mixture Models (GMM) and Principal Component Analysis (PCA).
Another novel approach using vowel formant analysis is implemented using Linear Discriminant Analysis (LDA). Vowel formant based speaker identification is best suitable for real-time implementation and requires only a few bytes of information to be stored for each speaker, making it both storage and time efficient. Tested on GRID and Vid-TIMIT, the proposed scheme was found to be 85.05% accurate when Linear Predictive Coding (LPC) is used to extract the vowel formants, which is much higher than the accuracy of BPNN and GMM. Since the proposed scheme does not require any training time other than creating a small database of vowel formants, it is faster as well. Furthermore, an increasing number of speakers makes it di±cult for BPNN and GMM to sustain their accuracy, but the proposed score-based methodology stays almost linear.
Finally, a novel audio-visual fusion based identification system is implemented using GMM and MFCC for speaker identi¯cation and PCA for face recognition. The results of speaker identification and face recognition are fused at different levels, namely the feature, score and decision levels. Both the score-level and decision-level (with OR voting) fusions were shown to outperform the feature-level fusion in terms of accuracy and error resilience. The result is in line with the distinct nature of the two modalities which lose themselves when combined at the feature-level. The GRID and VidTIMIT test results validate that
the proposed scheme is one of the best candidates for the fusion of
face and voice due to its low computational time and high recognition accuracy
Acoustic Approaches to Gender and Accent Identification
There has been considerable research on the problems of speaker and language recognition
from samples of speech. A less researched problem is that of accent recognition. Although this
is a similar problem to language identification, di�erent accents of a language exhibit more
fine-grained di�erences between classes than languages. This presents a tougher problem
for traditional classification techniques. In this thesis, we propose and evaluate a number of
techniques for gender and accent classification. These techniques are novel modifications and
extensions to state of the art algorithms, and they result in enhanced performance on gender
and accent recognition.
The first part of the thesis focuses on the problem of gender identification, and presents a
technique that gives improved performance in situations where training and test conditions are
mismatched.
The bulk of this thesis is concerned with the application of the i-Vector technique to accent
identification, which is the most successful approach to acoustic classification to have emerged
in recent years. We show that it is possible to achieve high accuracy accent identification without
reliance on transcriptions and without utilising phoneme recognition algorithms. The thesis
describes various stages in the development of i-Vector based accent classification that improve
the standard approaches usually applied for speaker or language identification, which are
insu�cient. We demonstrate that very good accent identification performance is possible with
acoustic methods by considering di�erent i-Vector projections, frontend parameters, i-Vector
configuration parameters, and an optimised fusion of the resulting i-Vector classifiers we can
obtain from the same data.
We claim to have achieved the best accent identification performance on the test corpus
for acoustic methods, with up to 90% identification rate. This performance is even better than
previously reported acoustic-phonotactic based systems on the same corpus, and is very close
to performance obtained via transcription based accent identification. Finally, we demonstrate
that the utilization of our techniques for speech recognition purposes leads to considerably
lower word error rates.
Keywords: Accent Identification, Gender Identification, Speaker Identification, Gaussian
Mixture Model, Support Vector Machine, i-Vector, Factor Analysis, Feature Extraction, British
English, Prosody, Speech Recognition
Bio-motivated features and deep learning for robust speech recognition
Mención Internacional en el título de doctorIn spite of the enormous leap forward that the Automatic Speech
Recognition (ASR) technologies has experienced over the last five years
their performance under hard environmental condition is still far from
that of humans preventing their adoption in several real applications.
In this thesis the challenge of robustness of modern automatic speech
recognition systems is addressed following two main research lines.
The first one focuses on modeling the human auditory system to
improve the robustness of the feature extraction stage yielding to novel
auditory motivated features. Two main contributions are produced.
On the one hand, a model of the masking behaviour of the Human
Auditory System (HAS) is introduced, based on the non-linear filtering
of a speech spectro-temporal representation applied simultaneously
to both frequency and time domains. This filtering is accomplished
by using image processing techniques, in particular mathematical
morphology operations with an specifically designed Structuring Element
(SE) that closely resembles the masking phenomena that take
place in the cochlea. On the other hand, the temporal patterns of
auditory-nerve firings are modeled. Most conventional acoustic features
are based on short-time energy per frequency band discarding
the information contained in the temporal patterns. Our contribution
is the design of several types of feature extraction schemes based on
the synchrony effect of auditory-nerve activity, showing that the modeling
of this effect can indeed improve speech recognition accuracy in
the presence of additive noise. Both models are further integrated into
the well known Power Normalized Cepstral Coefficients (PNCC).
The second research line addresses the problem of robustness in
noisy environments by means of the use of Deep Neural Networks
(DNNs)-based acoustic modeling and, in particular, of Convolutional
Neural Networks (CNNs) architectures. A deep residual network
scheme is proposed and adapted for our purposes, allowing Residual
Networks (ResNets), originally intended for image processing tasks,
to be used in speech recognition where the network input is small
in comparison with usual image dimensions. We have observed that
ResNets on their own already enhance the robustness of the whole system
against noisy conditions. Moreover, our experiments demonstrate
that their combination with the auditory motivated features devised
in this thesis provide significant improvements in recognition accuracy
in comparison to other state-of-the-art CNN-based ASR systems
under mismatched conditions, while maintaining the performance in
matched scenarios.
The proposed methods have been thoroughly tested and compared
with other state-of-the-art proposals for a variety of datasets and
conditions. The obtained results prove that our methods outperform
other state-of-the-art approaches and reveal that they are suitable for
practical applications, specially where the operating conditions are
unknown.El objetivo de esta tesis se centra en proponer soluciones al problema
del reconocimiento de habla robusto; por ello, se han llevado a cabo
dos líneas de investigación.
En la primera líınea se han propuesto esquemas de extracción de características novedosos, basados en el modelado del comportamiento
del sistema auditivo humano, modelando especialmente los fenómenos
de enmascaramiento y sincronía. En la segunda, se propone mejorar
las tasas de reconocimiento mediante el uso de técnicas de
aprendizaje profundo, en conjunto con las características propuestas.
Los métodos propuestos tienen como principal objetivo, mejorar la
precisión del sistema de reconocimiento cuando las condiciones de
operación no son conocidas, aunque el caso contrario también ha sido
abordado.
En concreto, nuestras principales propuestas son los siguientes:
Simular el sistema auditivo humano con el objetivo de mejorar
la tasa de reconocimiento en condiciones difíciles, principalmente
en situaciones de alto ruido, proponiendo esquemas de
extracción de características novedosos.
Siguiendo esta dirección, nuestras principales propuestas se detallan a continuación:
• Modelar el comportamiento de enmascaramiento del sistema
auditivo humano, usando técnicas del procesado de
imagen sobre el espectro, en concreto, llevando a cabo el
diseño de un filtro morfológico que captura este efecto.
• Modelar el efecto de la sincroní que tiene lugar en el nervio
auditivo.
• La integración de ambos modelos en los conocidos Power
Normalized Cepstral Coefficients (PNCC).
La aplicación de técnicas de aprendizaje profundo con el objetivo
de hacer el sistema más robusto frente al ruido, en particular
con el uso de redes neuronales convolucionales profundas, como
pueden ser las redes residuales.
Por último, la aplicación de las características propuestas en
combinación con las redes neuronales profundas, con el objetivo
principal de obtener mejoras significativas, cuando las condiciones
de entrenamiento y test no coinciden.Programa Oficial de Doctorado en Multimedia y ComunicacionesPresidente: Javier Ferreiros López.- Secretario: Fernando Díaz de María.- Vocal: Rubén Solera Ureñ
Investigating the neural mechanisms underlying audio-visual perception using electroencephalography (EEG)
Traditionally research into how we perceive our external world focused on the unisensory approach, examining how information is processed by one sense at a time. This produced a vast literature of results revealing how our brains process information from the different senses, from fields such as psychophysics, animal electrophysiology, and neuroimaging. However, we know from our own experiences that we use more than one sense at a time to understand our external world. Therefore to fully understand perception, we must understand not only how the brain processes information from individual sensory modalities, but also how and when this information interacts and combines with information from other modalities. In short, we need to understand the phenomenon of multisensory perception.
The work in this thesis describes three experiments aimed to provide new insights into this topic. Specifically, the three experiments presented here focused on examining when and where effects related to multisensory perception emerged in neural signals, and whether or not these effects could be related to behaviour in a time-resolved way and on a trial-by-trial basis. These experiments were carried out using a novel combination of psychophysics, high density electroencephalography (EEG), and advanced computational methods (linear discriminant analysis and mutual information analysis).
Experiment 1 (Chapter 3) investigated how behavioural and neural signals are modulated by the reliability of sensory information. Previous work has shown that subjects will weight sensory cues in proportion to their relative reliabilities; high reliability cues are assigned a higher weight and have more influence on the final perceptual estimate, while low reliability cues are assigned a lower weight and have less influence. Despite this widespread finding, it remains unclear when neural correlates of sensory reliability emerge during a trial, and whether or not modulations in neural signals due to reliability relate to modulations in behavioural reweighting. To investigate these questions we used a combination of psychophysics, EEG-based neuroimaging, single-trial decoding, and regression modelling. Subjects performed an audio-visual rate discrimination task where the modality (auditory, visual, audio-visual), stimulus stream rate (8 to 14 Hz), visual reliability (high/low), and congruency in rate between audio-visual stimuli (± 2 Hz) were systematically manipulated. For the behavioural and EEG components (derived using linear discriminant analysis), a set of perceptual and neural weights were calculated for each time point. The behavioural results revealed that participants weighted sensory information based on reliability: as visual reliability decreased, auditory weighting increased. These modulations in perceptual weights emerged early after stimulus onset (48 ms). The EEG data revealed that neural correlates of sensory reliability and perceptual weighting were also evident in decoding signals, and that these occurred surprisingly early in the trial (84 ms). Finally, source localisation suggested that these correlates originated in early sensory (occipital/temporal) and parietal regions respectively. Overall, these results provide the first insights into the temporal dynamics underlying human cue weighting in the brain, and suggest that it is an early, dynamic, and distributed process in the brain.
Experiment 2 (Chapter 4) expanded on this work by investigating how oscillatory power was modulated by the reliability of sensory information. To this end, we used a time-frequency approach to analyse the data collected for the work in Chapter 3. Our results showed that significant effects in the theta and alpha bands over fronto-central regions occurred during the same early time windows as a shift in perceptual weighting (100 ms and 250 ms respectively). Specifically, we found that theta power (4 - 6 Hz) was lower and alpha power (10 – 12 Hz) was higher in audio-visual conditions where visual reliability was low, relative to conditions where visual reliability was high. These results suggest that changes in oscillatory power may underlie reliability based cue weighting in the brain, and that these changes occur early during the sensory integration process.
Finally, Experiment 3 (Chapter 5) moved away from examining reliability based cue weighting and focused on investigating cases where spatially and temporally incongruent auditory and visual cues interact to affect behaviour. Known collectively as “cross-modal associations”, past work has shown that observers have preferred and non-preferred stimuli pairings. For example, subjects will frequently pair high pitched tones with small objects and low pitched tones with large objects. However it is still unclear when and where these associations are reflected in neural signals, and whether they emerge at an early perceptual level or later decisional level. To investigate these questions we used a modified version of the implicit association test (IAT) to examine the modulation of behavioural and neural signals underlying an auditory pitch – visual size cross modal association. Congruency was manipulated by assigning two stimuli (one auditory and one visual) to each of the left or right response keys and changing this assignment across blocks to create congruent (left key: high tone – small circle, right key: low tone – large circle) and incongruent (left key: low tone – small circle, right key: high tone – large circle) pairings of stimuli. On each trial, subjects were presented with only one of the four stimuli (auditory high tone, auditory low tone, visual small circle, visual large circle), and asked to respond which was presented as quickly and accurately as possible. The key assumption with such a design is that subjects should respond faster when associated (i.e. congruent) stimuli are assigned to the same response key than when two non-associated stimuli are. In line with this, our behavioural results demonstrated that subjects responded faster on blocks where congruent pairings of stimuli were assigned to the response keys (high pitch-small circle and low pitch large circle), than blocks where incongruent pairings were. The EEG results demonstrated that information about auditory pitch and visual size could be extracted from neural signals using two approaches to single-trial analysis (linear discriminant analysis and mutual information analysis) early during the trial (50ms), with the strongest information contained over posterior and temporal electrodes for auditory trials, and posterior electrodes for visual trials. EEG components related to auditory pitch were significantly modulated by cross-modal congruency over temporal and frontal regions early in the trial (~100ms), while EEG components related to visual size were modulated later (~220ms) over frontal and temporal electrodes. For the auditory trials, these EEG components were significantly predictive of single trial reaction times, yet for the visual trials the components were not. As a result, the data support an early and short-latency origin of cross-modal associations, and suggest that these may originate in a bottom-up manner during early sensory processing rather than from high-level inference processes. Importantly, the findings were consistent across both analysis methods, suggesting these effects are robust.
To summarise, the results across all three experiments showed that it is possible to extract meaningful, single-trial information from the EEG signal and relate it to behaviour on a time resolved basis. As a result, the work presented here steps beyond previous studies to provide new insights into the temporal dynamics of audio-visual perception in the brain. All experiments, although employing different paradigms and investigating different processes, showed early neural correlates related to audio-visual perception emerging in neural signals across early sensory, parietal, and frontal regions. Together, these results provide support for the prevailing modern view that the entire cortex is essentially multisensory and that multisensory effects can emerge at all stages during the perceptual process
- …