512 research outputs found
A detection-based pattern recognition framework and its applications
The objective of this dissertation is to present a detection-based pattern recognition framework and demonstrate its applications in automatic speech recognition and broadcast news video story segmentation.
Inspired by the studies of modern cognitive psychology and real-world pattern recognition systems, a detection-based pattern recognition framework is proposed to provide an alternative solution for some complicated pattern recognition problems. The primitive features are first detected and the task-specific knowledge hierarchy is constructed level by level; then a variety of heterogeneous information sources are combined together and the high-level context is incorporated as additional information at certain stages.
A detection-based framework is a â divide-and-conquerâ design paradigm for pattern recognition problems, which will decompose a conceptually difficult problem into many elementary sub-problems that can be handled directly and reliably. Some information fusion strategies will be employed to integrate the evidence from a lower level to form the evidence at a higher level. Such a fusion procedure continues until reaching the top level. Generally, a detection-based framework has many advantages: (1) more flexibility in both detector design and fusion strategies, as these two parts
can be optimized separately; (2) parallel and distributed computational components in primitive feature detection. In such a component-based framework, any primitive component can be replaced by a new one while other components remain unchanged; (3) incremental information integration; (4) high level context information as additional information sources, which can be combined with bottom-up processing at any stage.
This dissertation presents the basic principles, criteria, and techniques for detector design and hypothesis verification based on the statistical detection and decision theory. In addition, evidence fusion strategies were investigated in this dissertation. Several novel detection algorithms and evidence fusion methods were proposed and their effectiveness was justified in automatic speech recognition and broadcast news video segmentation system. We believe such a detection-based framework can be employed
in more applications in the future.Ph.D.Committee Chair: Lee, Chin-Hui; Committee Member: Clements, Mark; Committee Member: Ghovanloo, Maysam; Committee Member: Romberg, Justin; Committee Member: Yuan, Min
Interactive Pattern Recognition applied to Natural Language Processing
This thesis is about Pattern Recognition. In the last decades, huge efforts have been
made to develop automatic systems able to rival human capabilities in this field. Although
these systems achieve high productivity rates, they are not precise enough in
most situations. Humans, on the contrary, are very accurate but comparatively quite
slower. This poses an interesting question: the possibility of benefiting from both
worlds by constructing cooperative systems.
This thesis presents diverse contributions to this kind of collaborative approach.
The point is to improve the Pattern Recognition systems by properly introducing a
human operator into the system. We call this Interactive Pattern Recognition (IPR).
Firstly, a general proposal for IPR will be stated. The aim is to develop a framework
to easily derive new applications in this area. Some interesting IPR issues are
also introduced. Multi-modality or adaptive learning are examples of extensions that
can naturally fit into IPR.
In the second place, we will focus on a specific application. A novel method to
obtain high quality speech transcriptions (CAST, Computer Assisted Speech Transcription).
We will start by proposing a CAST formalization and, next, we will cope
with different implementation alternatives. Practical issues, as the system response
time, will be also taken into account, in order to allow for a practical implementation
of CAST. Word graphs and probabilistic error correcting parsing are tools that will
be used to reach an alternative formulation that allows for the use of CAST in a real
scenario.
Afterwards, a special application within the general IPR framework will be discussed.
This is intended to test the IPR capabilities in an extreme environment, where
no input pattern is available and the system only has access to the user actions to produce
a hypothesis. Specifically, we will focus here on providing assistance in the
problem of text generation.Rodríguez Ruiz, L. (2010). Interactive Pattern Recognition applied to Natural Language Processing [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8479Palanci
Recent Advances in Social Data and Artificial Intelligence 2019
The importance and usefulness of subjects and topics involving social data and artificial intelligence are becoming widely recognized. This book contains invited review, expository, and original research articles dealing with, and presenting state-of-the-art accounts pf, the recent advances in the subjects of social data and artificial intelligence, and potentially their links to Cyberspace
Representation Analysis Methods to Model Context for Speech Technology
Speech technology has developed to levels equivalent with human parity through the use of deep neural networks. However, it is unclear how the learned dependencies within these networks can be attributed to metrics such as recognition performance. This research focuses on strategies to interpret and exploit these learned context dependencies to improve speech recognition models. Context dependency analysis had not yet been explored for speech recognition networks.
In order to highlight and observe dependent representations within speech recognition models, a novel analysis framework is proposed. This analysis framework uses statistical correlation indexes to compute the coefficiency between neural representations. By comparing the coefficiency of neural representations between models using different approaches, it is possible to observe specific context dependencies within network layers. By providing insights on context dependencies it is then possible to adapt modelling approaches to become more computationally efficient and improve recognition performance. Here the performance of End-to-End speech recognition models are analysed, providing insights on the acoustic and language modelling context dependencies. The modelling approach for a speaker recognition task is adapted to exploit acoustic context dependencies and reach comparable performance with the state-of-the-art methods, reaching 2.89% equal error rate using the Voxceleb1 training and test sets with 50% of the parameters. Furthermore, empirical analysis of the
role of acoustic context for speech emotion recognition modelling revealed that emotion cues are presented as a distributed event. These analyses and results for speech recognition applications aim to provide objective direction for future development of automatic speech recognition systems
Proceedings of the 17th Annual Conference of the European Association for Machine Translation
Proceedings of the 17th Annual Conference of the European Association for Machine Translation (EAMT
Speech Recognition
Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes
Robust learning of acoustic representations from diverse speech data
Automatic speech recognition is increasingly applied to new domains. A key challenge is
to robustly learn, update and maintain representations to cope with transient acoustic
conditions. A typical example is broadcast media, for which speakers and environments
may change rapidly, and available supervision may be poor. The concern of this
thesis is to build and investigate methods for acoustic modelling that are robust to the
characteristics and transient conditions as embodied by such media.
The first contribution of the thesis is a technique to make use of inaccurate transcriptions as supervision for acoustic model training. There is an abundance of audio
with approximate labels, but training methods can be sensitive to label errors, and their
use is therefore not trivial. State-of-the-art semi-supervised training makes effective
use of a lattice of supervision, inherently encoding uncertainty in the labels to avoid
overfitting to poor supervision, but does not make use of the transcriptions. Existing
approaches that do aim to make use of the transcriptions typically employ an algorithm
to filter or combine the transcriptions with the recognition output from a seed model,
but the final result does not encode uncertainty. We propose a method to combine the
lattice output from a biased recognition pass with the transcripts, crucially preserving
uncertainty in the lattice where appropriate. This substantially reduces the word error
rate on a broadcast task.
The second contribution is a method to factorise representations for speakers and
environments so that they may be combined in novel combinations. In realistic scenarios,
the speaker or environment transform at test time might be unknown, or there may be
insufficient data to learn a joint transform. We show that in such cases, factorised, or
independent, representations are required to avoid deteriorating performance. Using
i-vectors, we factorise speaker or environment information using multi-condition training
with neural networks. Specifically, we extract bottleneck features from networks trained
to classify either speakers or environments. The resulting factorised representations
prove beneficial when one factor is missing at test time, or when all factors are seen,
but not in the desired combination.
The third contribution is an investigation of model adaptation in a longitudinal
setting. In this scenario, we repeatedly adapt a model to new data, with the constraint
that previous data becomes unavailable. We first demonstrate the effect of such a
constraint, and show that using a cyclical learning rate may help. We then observe
that these successive models lend themselves well to ensembling. Finally, we show
that the impact of this constraint in an active learning setting may be detrimental to
performance, and suggest to combine active learning with semi-supervised training to
avoid biasing the model.
The fourth contribution is a method to adapt low-level features in a parameter-efficient and interpretable manner. We propose to adapt the filters in a neural feature
extractor, known as SincNet. In contrast to traditional techniques that warp the
filterbank frequencies in standard feature extraction, adapting SincNet parameters is
more flexible and more readily optimised, whilst maintaining interpretability. On a task
adapting from adult to child speech, we show that this layer is well suited for adaptation
and is very effective with respect to the small number of adapted parameters
- …