60 research outputs found
AN EFFICIENT AND ROBUST MULTI-STREAM FRAMEWORK FOR END-TO-END SPEECH RECOGNITION
In voice-enabled domestic or meeting environments, distributed microphone arrays aim to process distant-speech interaction into text with high accuracy.
However, with dynamic corruption of noises and reverberations or human movement present, there is no guarantee that any microphone array (stream) is constantly informative. In these cases, an appropriate strategy to dynamically fuse streams is necessary.
The multi-stream paradigm in Automatic Speech Recognition (ASR) considers scenarios where parallel streams carry diverse or complementary task-related knowledge. Such streams could be defined as microphone arrays, frequency bands, various modalities or etc. Hence, a robust stream fusion is crucial to emphasize on more informative streams than corrupted ones, especially under unseen conditions. This thesis focuses on improving the performance and robustness of speech recognition in multi-stream scenarios.
With increasing use of Deep Neural Networks (DNNs) in ASR, End-to-End (E2E) approaches, which directly transcribe human speech into text, have received greater attention. In this thesis, a multi-stream framework is presented based on the joint Connectionist Temporal Classification/ATTention (CTC/ATT) E2E model, where parallel streams are represented by separate encoders. On top of regular attention networks, a secondary stream-fusion network is to steer the decoder toward the most informative streams.
The MEM-Array model aims at improving the far-field ASR robustness using microphone arrays which are activated by separate encoders. With an increasing number of streams (encoders) requiring substantial memory and massive amounts of parallel data, a practical two-stage training strategy is designated to address these issues. Furthermore, a two-stage augmentation scheme is present to improve robustness of the multi-stream model. In MEM-Res, two heterogeneous encoders with different architectures, temporal resolutions and separate CTC networks work in parallel to extract complementary information from the same acoustics. Compared with the best single-stream performance, both models have achieved substantial improvement, outperforming alternative fusion strategies.
While the proposed framework optimizes information in multi-stream scenarios, this thesis also studies the Performance Monitoring (PM) measures to predict if recognition results of an E2E model are reliable without growth-truth knowledge. Four PM techniques are investigated, suggesting that PM measures on attention distributions and decoder posteriors are well-correlated with true performances
Characterization of speaker recognition in noisy channels
Speaker recognition is a frequently overlooked form of biometric security. Text-independent speaker identification is used by financial services, forensic experts, and human computer interaction developers to extract information that is transmitted along with a spoken message such as identity, gender, age, emotional state, etc. of a speaker. Speech features are classified as either low-level or high-level characteristics. Highlevel speech features are associated with syntax, dialect, and the overall meaning of a spoken message. In contrast, low-level features such as pitch, and phonemic spectra are associated much more with the physiology of the human vocal tract. It is these lowlevel features that are also the easiest and least computationally intensive characteristics of speech to extract. Once extracted, modern speaker recognition systems attempt to fit these features best to statistical classification models. One such widely used model is the Gaussian Mixture Model (GMM). The current standard of testing of speaker recognition systems is standardized by NIST in the often updated NIST Speaker Recognition Evaluation (NIST-SRE) standard. The results measured by the tests outlined in the standard are ultimately presented as Detection Error Tradeoff (DET) curves and detection cost function scores. A new method of measuring the effects of channel impediments on the quality of identifications made by Gaussian Mixture Model based speaker recognition systems will be presented in this thesis. With the exception of the NIST-SRE, no standardized or extensive testing of speaker recognition systems in noisy channels has been conducted. Thorough testing of speaker recognition systems will be conducted in channel model simulators. Additionally, the NIST-SRE error metric will be evaluated against a new proposed metric for gauging the performance and improvements of speaker recognition systems
Integrating incremental learning and episodic memory models of the hippocampal region.
By integrating previous computational models of corticohippocampal function, the authors develop and test a unified theory of the neural substrates of familiarity, recollection, and classical conditioning. This approach integrates models from 2 traditions of hippocampal modeling, those of episodic memory and incremental learning, by drawing on an earlier mathematical model of conditioning, SOP (A. Wagner, 1981). The model describes how a familiarity signal may arise from parahippocampal cortices, giving a novel explanation for the finding that the neural response to a stimulus in these regions decreases with increasing stimulus familiarity. Recollection is ascribed to the hippocampus proper. It is shown how the properties of episodic representations in the neocortex, parahippocampal gyrus, and hippocampus proper may explain phenomena in classical conditioning. The model reproduces the effects of hippocampal, septal, and broad hippocampal region lesions on contextual modulation of classical conditioning, blocking, learned irrelevance, and latent inhibition
Development of an acoustic communication link for micro underwater vehicles
PhD ThesisIn recent years there has been an increasing trend towards the use of
Micro Remotely Operated Vehicles (μROVs), such as the Videoray and
Seabotix LBV products, for a range of subsea applications, including
environmental monitoring, harbour security, military surveillance and
offshore inspection. A major operational limitation is the umbilical cable,
which is traditionally used to supply power and communications to the
vehicle. This tether has often been found to significantly restrict the
agility of the vehicle or in extreme cases, result in entanglement with
subsea structures.
This thesis addresses the challenges associated with developing a reliable
full-duplex wireless communications link aimed at tetherless operation
of a μROV. Previous research has demonstrated the ability to
support highly compressed video transmissions over several kilometres
through shallow water channels with large range-depth ratios. However,
the physical constraints of these platforms paired with the system cost
requirements pose significant additional challenges.
Firstly, the physical size/weight of transducers for the LF (8-16kHz)
and MF (16-32kHz) bands would significantly affect the dynamics of the
vehicle measuring less than 0.5m long. Therefore, this thesis explores the
challenges associated with moving the operating frequency up to around
50kHz centre, along with the opportunities for increased data rate and
tracking due to higher bandwidth.
The typical operating radius of μROVs is less than 200m, in water
< 100m deep, which gives rise to multipath channels characterised by
long timespread and relatively sparse arrivals. Hence, the system must
be optimised for performance in these conditions. The hardware costs of
large multi-element receiver arrays are prohibitive when compared to the
cost of the μROV platform. Additionally, the physical size of such arrays
complicates deployment from small surface vessels. Although some
recent developments in iterative equalisation and decoding structures
have enhanced the performance of single element receivers, they are not
found to be adequate in such channels. This work explores the optimum
cost/performance trade-off in a combination of a micro beamforming array
using a Bit Interleaved Coded Modulation with Iterative Decoding
(BICM-ID) receiver structure.
The highly dynamic nature of μROVs, with rapid acceleration/deceleration
and complex thruster/wake effects, are also a significant challenge to reliable
continuous communications. The thesis also explores how these effects
can best be mitigated via advanced Doppler correction techniques,
and adaptive coding and modulation via a simultaneous frequency multiplexed
down link. In order to fully explore continuous adaptation of
the transmitted signals, a real-time full-duplex communication system
was constructed in hardware, utilising low cost components and a highly
optimised PC based receiver structure. Rigorous testing, both in laboratory
conditions and through extensive field trials, have enabled the
author to explore the performance of the communication link on a vehicle
carrying out typical operations and presenting a wide range of channel,
noise, Doppler and transmission latency conditions. This has led to a
comprehensive set of design recommendations for a reliable and cost effective
link capable of continuous throughputs of >30 kbits/s
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation
Speech enhancement and speech separation are two related tasks, whose purpose
is to extract either one or more target speech signals, respectively, from a
mixture of sounds generated by several sources. Traditionally, these tasks have
been tackled using signal processing and machine learning techniques applied to
the available acoustic signals. Since the visual aspect of speech is
essentially unaffected by the acoustic environment, visual information from the
target speakers, such as lip movements and facial expressions, has also been
used for speech enhancement and speech separation systems. In order to
efficiently fuse acoustic and visual information, researchers have exploited
the flexibility of data-driven approaches, specifically deep learning,
achieving strong performance. The ceaseless proposal of a large number of
techniques to extract features and fuse multimodal information has highlighted
the need for an overview that comprehensively describes and discusses
audio-visual speech enhancement and separation based on deep learning. In this
paper, we provide a systematic survey of this research topic, focusing on the
main elements that characterise the systems in the literature: acoustic
features; visual features; deep learning methods; fusion techniques; training
targets and objective functions. In addition, we review deep-learning-based
methods for speech reconstruction from silent videos and audio-visual sound
source separation for non-speech signals, since these methods can be more or
less directly applied to audio-visual speech enhancement and separation.
Finally, we survey commonly employed audio-visual speech datasets, given their
central role in the development of data-driven approaches, and evaluation
methods, because they are generally used to compare different systems and
determine their performance
- …