507 research outputs found
Hidden Markov models and neural networks for speech recognition
The Hidden Markov Model (HMMs) is one of the most successful modeling approaches for acoustic events in speech recognition, and more recently it has proven useful for several problems in biological sequence analysis. Although the HMM is good at capturing the temporal nature of processes such as speech, it has a very limited capacity for recognizing complex patterns involving more than first order dependencies in the observed data sequences. This is due to the first order state process and the assumption of state conditional independence between observations. Artificial Neural Networks (NNs) are almost the opposite: they cannot model dynamic, temporally extended phenomena very well, but are good at static classification and regression tasks. Combining the two frameworks in a sensible way can therefore lead to a more powerful model with better classification abilities. The overall aim of this work has been to develop a probabilistic hybrid of hidden Markov models and neural networks and ..
Speech Recognition
Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes
A Review of Deep Learning Techniques for Speech Processing
The field of speech processing has undergone a transformative shift with the
advent of deep learning. The use of multiple processing layers has enabled the
creation of models capable of extracting intricate features from speech data.
This development has paved the way for unparalleled advancements in speech
recognition, text-to-speech synthesis, automatic speech recognition, and
emotion recognition, propelling the performance of these tasks to unprecedented
heights. The power of deep learning techniques has opened up new avenues for
research and innovation in the field of speech processing, with far-reaching
implications for a range of industries and applications. This review paper
provides a comprehensive overview of the key deep learning models and their
applications in speech-processing tasks. We begin by tracing the evolution of
speech processing research, from early approaches, such as MFCC and HMM, to
more recent advances in deep learning architectures, such as CNNs, RNNs,
transformers, conformers, and diffusion models. We categorize the approaches
and compare their strengths and weaknesses for solving speech-processing tasks.
Furthermore, we extensively cover various speech-processing tasks, datasets,
and benchmarks used in the literature and describe how different deep-learning
networks have been utilized to tackle these tasks. Additionally, we discuss the
challenges and future directions of deep learning in speech processing,
including the need for more parameter-efficient, interpretable models and the
potential of deep learning for multimodal speech processing. By examining the
field's evolution, comparing and contrasting different approaches, and
highlighting future directions and challenges, we hope to inspire further
research in this exciting and rapidly advancing field
Recognizing Speech in a Novel Accent: The Motor Theory of Speech Perception Reframed
The motor theory of speech perception holds that we perceive the speech of
another in terms of a motor representation of that speech. However, when we
have learned to recognize a foreign accent, it seems plausible that recognition
of a word rarely involves reconstruction of the speech gestures of the speaker
rather than the listener. To better assess the motor theory and this
observation, we proceed in three stages. Part 1 places the motor theory of
speech perception in a larger framework based on our earlier models of the
adaptive formation of mirror neurons for grasping, and for viewing extensions
of that mirror system as part of a larger system for neuro-linguistic
processing, augmented by the present consideration of recognizing speech in a
novel accent. Part 2 then offers a novel computational model of how a listener
comes to understand the speech of someone speaking the listener's native
language with a foreign accent. The core tenet of the model is that the
listener uses hypotheses about the word the speaker is currently uttering to
update probabilities linking the sound produced by the speaker to phonemes in
the native language repertoire of the listener. This, on average, improves the
recognition of later words. This model is neutral regarding the nature of the
representations it uses (motor vs. auditory). It serve as a reference point for
the discussion in Part 3, which proposes a dual-stream neuro-linguistic
architecture to revisits claims for and against the motor theory of speech
perception and the relevance of mirror neurons, and extracts some implications
for the reframing of the motor theory
A motion-based approach for audio-visual automatic speech recognition
The research work presented in this thesis introduces novel approaches for both visual
region of interest extraction and visual feature extraction for use in audio-visual
automatic speech recognition. In particular, the speaker‘s movement that occurs
during speech is used to isolate the mouth region in video sequences and motionbased
features obtained from this region are used to provide new visual features for
audio-visual automatic speech recognition. The mouth region extraction approach
proposed in this work is shown to give superior performance compared with existing
colour-based lip segmentation methods. The new features are obtained from three
separate representations of motion in the region of interest, namely the difference in
luminance between successive images, block matching based motion vectors and
optical flow. The new visual features are found to improve visual-only and audiovisual
speech recognition performance when compared with the commonly-used
appearance feature-based methods.
In addition, a novel approach is proposed for visual feature extraction from either the
discrete cosine transform or discrete wavelet transform representations of the mouth
region of the speaker. In this work, the image transform is explored from a new
viewpoint of data discrimination; in contrast to the more conventional data
preservation viewpoint. The main findings of this work are that audio-visual
automatic speech recognition systems using the new features extracted from the
frequency bands selected according to their discriminatory abilities generally
outperform those using features designed for data preservation.
To establish the noise robustness of the new features proposed in this work, their
performance has been studied in presence of a range of different types of noise and at
various signal-to-noise ratios. In these experiments, the audio-visual automatic speech
recognition systems based on the new approaches were found to give superior
performance both to audio-visual systems using appearance based features and to
audio-only speech recognition systems
Relative-fuzzy: a novel approach for handling complex ambiguity for software engineering of data mining models
There are two main defined classes of uncertainty namely: fuzziness and ambiguity, where ambiguity is ‘one-to-many’ relationship between syntax and semantic of a proposition. This definition seems that it ignores ‘many-to-many’ relationship ambiguity type of uncertainty. In this thesis, we shall use complex-uncertainty to term many-to-many relationship ambiguity type of uncertainty.
This research proposes a new approach for handling the complex ambiguity type of uncertainty that may exist in data, for software engineering of predictive Data Mining (DM) classification models. The proposed approach is based on Relative-Fuzzy Logic (RFL), a novel type of fuzzy logic. RFL defines a new formulation of the problem of ambiguity type of uncertainty in terms of States Of Proposition (SOP). RFL describes its membership (semantic) value by using the new definition of Domain of Proposition (DOP), which is based on the relativity principle as defined by possible-worlds logic.
To achieve the goal of proposing RFL, a question is needed to be answered, which is: how these two approaches; i.e. fuzzy logic and possible-world, can be mixed to produce a new membership value set (and later logic) that able to handle fuzziness and multiple viewpoints at the same time? Achieving such goal comes via providing possible world logic the ability to quantifying multiple viewpoints and also model fuzziness in each of these multiple viewpoints and expressing that in a new set of membership value.
Furthermore, a new architecture of Hierarchical Neural Network (HNN) called ML/RFL-Based Net has been developed in this research, along with a new learning algorithm and new recalling algorithm. The architecture, learning algorithm and recalling algorithm of ML/RFL-Based Net follow the principles of RFL. This new type of HNN is considered to be a RFL computation machine.
The ability of the Relative Fuzzy-based DM prediction model to tackle the problem of complex ambiguity type of uncertainty has been tested. Special-purpose Integrated Development Environment (IDE) software, which generates a DM prediction model for speech recognition, has been developed in this research too, which is called RFL4ASR. This special purpose IDE is an extension of the definition of the traditional IDE.
Using multiple sets of TIMIT speech data, the prediction model of type ML/RFL-Based Net has classification accuracy of 69.2308%. This accuracy is higher than the best achievements of WEKA data mining machines given the same speech data
Syväoppiminen puhutun kielen tunnistamisessa
This thesis applies deep learning based classification techniques to identify natural languages from speech. The primary motivation behind this thesis is to implement accurate techniques for segmenting multimedia materials by the languages spoken in them.
Several existing state-of-the-art, deep learning based approaches are discussed and a subset of the discussed approaches are selected for quantitative experimentation. The selected model architectures are trained on several well-known spoken language identification datasets containing several different languages. Segmentation granularity varies between models, some supporting input audio lengths of 0.2 seconds, while others require 10 second long input to make a language decision.
Results from the thesis experiments show that an unsupervised representation of acoustic units, produced by a deep sequence-to-sequence auto encoder, cannot reach the language identification performance of a supervised representation, produced by a multilingual phoneme recognizer. Contrary to most existing results, in this thesis, acoustic-phonetic language classifiers trained on labeled spectral representations outperform phonotactic classifiers trained on bottleneck features of a multilingual phoneme recognizer. More work is required, using transcribed datasets and automatic speech recognition techniques, to investigate why phoneme embeddings did not outperform simple, labeled spectral features.
While an accurate online language segmentation tool for multimedia materials could not be constructed, the work completed in this thesis provides several insights for building feasible, modern spoken language identification systems. As a side-product of the experiments performed during this thesis, a free open source spoken language identification software library called "lidbox" was developed, allowing future experiments to begin where the experiments of this thesis end.Tämä diplomityö keskittyy soveltamaan syviä neuroverkkomalleja luonnollisten kielien automaattiseen tunnistamiseen puheesta. Tämän työn ensisijainen tavoite on toteuttaa tarkka menetelmä multimediamateriaalien ositteluun niissä esiintyvien puhuttujen kielien perusteella.
Työssä tarkastellaan useampaa jo olemassa olevaa neuroverkkoihin perustuvaa lähestymistapaa, joista valitaan alijoukko tarkempaan tarkasteluun, kvantitatiivisten kokeiden suorittamiseksi. Valitut malliarkkitehtuurit koulutetaan käyttäen eri puhetietokantoja, sisältäen useampia eri kieliä. Kieliosittelun hienojakoisuus vaihtelee käytettyjen mallien mukaan, 0,2 sekunnista 10 sekuntiin, riippuen kuinka pitkän aikaikkunan perusteella malli pystyy tuottamaan kieliennusteen.
Diplomityön aikana suoritetut kokeet osoittavat, että sekvenssiautoenkoodaajalla ohjaamattomasti löydetty puheen diskreetti akustinen esitysmuoto ei ole riittävä kielen tunnistamista varten, verrattuna foneemitunnistimen tuottamaan, ohjatusti opetettuun foneemiesitysmuotoon. Tässä työssä havaittiin, että akustisfoneettiset kielentunnistusmallit saavuttavat korkeamman kielentunnistustarkkuuden kuin foneemiesitysmuotoa käyttävät kielentunnistusmallit, mikä eroaa monista kirjallisuudessa esitetyistä tuloksista. Diplomityön tutkimuksia on jatkettava, esimerkiksi litteroituja puhetietokantoja ja puheentunnistusmenetelmiä käyttäen, jotta pystyttäisiin selittämään miksi foneemimallin tuottamalla esitysmuodolla ei saatu parempia tuloksia kuin yksinkertaisemmalla, taajuusspektrin esitysmuodolla.
Tämän työn aikana puhutun kielen tunnistaminen osoittautui huomattavasti haasteellisemmaksi kuin mitä työn alussa oli arvioitu, eikä työn aikana onnistuttu toteuttamaan tarpeeksi tarkkaa multimediamateriaalien kielienosittelumenetelmää. Tästä huolimatta, työssä esitetyt lähestymistavat tarjoavat toimivia käytännön menetelmiä puhutun kielen tunnistamiseen tarkoitettujen, modernien järjestelmien rakentamiseksi. Tämän diplomityön sivutuotteena syntyi myös puhutun kielen tunnistamiseen tarkoitettu avoimen lähdekoodin kirjasto nimeltä "lidbox", jonka ansiosta tämän työn kvantitatiivisia kokeita voi jatkaa siitä, mihin ne tämän työn päätteeksi jäivät
- …