60 research outputs found

    AN EFFICIENT AND ROBUST MULTI-STREAM FRAMEWORK FOR END-TO-END SPEECH RECOGNITION

    Get PDF
    In voice-enabled domestic or meeting environments, distributed microphone arrays aim to process distant-speech interaction into text with high accuracy. However, with dynamic corruption of noises and reverberations or human movement present, there is no guarantee that any microphone array (stream) is constantly informative. In these cases, an appropriate strategy to dynamically fuse streams is necessary. The multi-stream paradigm in Automatic Speech Recognition (ASR) considers scenarios where parallel streams carry diverse or complementary task-related knowledge. Such streams could be defined as microphone arrays, frequency bands, various modalities or etc. Hence, a robust stream fusion is crucial to emphasize on more informative streams than corrupted ones, especially under unseen conditions. This thesis focuses on improving the performance and robustness of speech recognition in multi-stream scenarios. With increasing use of Deep Neural Networks (DNNs) in ASR, End-to-End (E2E) approaches, which directly transcribe human speech into text, have received greater attention. In this thesis, a multi-stream framework is presented based on the joint Connectionist Temporal Classification/ATTention (CTC/ATT) E2E model, where parallel streams are represented by separate encoders. On top of regular attention networks, a secondary stream-fusion network is to steer the decoder toward the most informative streams. The MEM-Array model aims at improving the far-field ASR robustness using microphone arrays which are activated by separate encoders. With an increasing number of streams (encoders) requiring substantial memory and massive amounts of parallel data, a practical two-stage training strategy is designated to address these issues. Furthermore, a two-stage augmentation scheme is present to improve robustness of the multi-stream model. In MEM-Res, two heterogeneous encoders with different architectures, temporal resolutions and separate CTC networks work in parallel to extract complementary information from the same acoustics. Compared with the best single-stream performance, both models have achieved substantial improvement, outperforming alternative fusion strategies. While the proposed framework optimizes information in multi-stream scenarios, this thesis also studies the Performance Monitoring (PM) measures to predict if recognition results of an E2E model are reliable without growth-truth knowledge. Four PM techniques are investigated, suggesting that PM measures on attention distributions and decoder posteriors are well-correlated with true performances

    Characterization of speaker recognition in noisy channels

    Get PDF
    Speaker recognition is a frequently overlooked form of biometric security. Text-independent speaker identification is used by financial services, forensic experts, and human computer interaction developers to extract information that is transmitted along with a spoken message such as identity, gender, age, emotional state, etc. of a speaker. Speech features are classified as either low-level or high-level characteristics. Highlevel speech features are associated with syntax, dialect, and the overall meaning of a spoken message. In contrast, low-level features such as pitch, and phonemic spectra are associated much more with the physiology of the human vocal tract. It is these lowlevel features that are also the easiest and least computationally intensive characteristics of speech to extract. Once extracted, modern speaker recognition systems attempt to fit these features best to statistical classification models. One such widely used model is the Gaussian Mixture Model (GMM). The current standard of testing of speaker recognition systems is standardized by NIST in the often updated NIST Speaker Recognition Evaluation (NIST-SRE) standard. The results measured by the tests outlined in the standard are ultimately presented as Detection Error Tradeoff (DET) curves and detection cost function scores. A new method of measuring the effects of channel impediments on the quality of identifications made by Gaussian Mixture Model based speaker recognition systems will be presented in this thesis. With the exception of the NIST-SRE, no standardized or extensive testing of speaker recognition systems in noisy channels has been conducted. Thorough testing of speaker recognition systems will be conducted in channel model simulators. Additionally, the NIST-SRE error metric will be evaluated against a new proposed metric for gauging the performance and improvements of speaker recognition systems

    Audio-Visual Speech Enhancement Based on Deep Learning

    Get PDF

    Integrating incremental learning and episodic memory models of the hippocampal region.

    Get PDF
    By integrating previous computational models of corticohippocampal function, the authors develop and test a unified theory of the neural substrates of familiarity, recollection, and classical conditioning. This approach integrates models from 2 traditions of hippocampal modeling, those of episodic memory and incremental learning, by drawing on an earlier mathematical model of conditioning, SOP (A. Wagner, 1981). The model describes how a familiarity signal may arise from parahippocampal cortices, giving a novel explanation for the finding that the neural response to a stimulus in these regions decreases with increasing stimulus familiarity. Recollection is ascribed to the hippocampus proper. It is shown how the properties of episodic representations in the neocortex, parahippocampal gyrus, and hippocampus proper may explain phenomena in classical conditioning. The model reproduces the effects of hippocampal, septal, and broad hippocampal region lesions on contextual modulation of classical conditioning, blocking, learned irrelevance, and latent inhibition

    Development of an acoustic communication link for micro underwater vehicles

    Get PDF
    PhD ThesisIn recent years there has been an increasing trend towards the use of Micro Remotely Operated Vehicles (μROVs), such as the Videoray and Seabotix LBV products, for a range of subsea applications, including environmental monitoring, harbour security, military surveillance and offshore inspection. A major operational limitation is the umbilical cable, which is traditionally used to supply power and communications to the vehicle. This tether has often been found to significantly restrict the agility of the vehicle or in extreme cases, result in entanglement with subsea structures. This thesis addresses the challenges associated with developing a reliable full-duplex wireless communications link aimed at tetherless operation of a μROV. Previous research has demonstrated the ability to support highly compressed video transmissions over several kilometres through shallow water channels with large range-depth ratios. However, the physical constraints of these platforms paired with the system cost requirements pose significant additional challenges. Firstly, the physical size/weight of transducers for the LF (8-16kHz) and MF (16-32kHz) bands would significantly affect the dynamics of the vehicle measuring less than 0.5m long. Therefore, this thesis explores the challenges associated with moving the operating frequency up to around 50kHz centre, along with the opportunities for increased data rate and tracking due to higher bandwidth. The typical operating radius of μROVs is less than 200m, in water < 100m deep, which gives rise to multipath channels characterised by long timespread and relatively sparse arrivals. Hence, the system must be optimised for performance in these conditions. The hardware costs of large multi-element receiver arrays are prohibitive when compared to the cost of the μROV platform. Additionally, the physical size of such arrays complicates deployment from small surface vessels. Although some recent developments in iterative equalisation and decoding structures have enhanced the performance of single element receivers, they are not found to be adequate in such channels. This work explores the optimum cost/performance trade-off in a combination of a micro beamforming array using a Bit Interleaved Coded Modulation with Iterative Decoding (BICM-ID) receiver structure. The highly dynamic nature of μROVs, with rapid acceleration/deceleration and complex thruster/wake effects, are also a significant challenge to reliable continuous communications. The thesis also explores how these effects can best be mitigated via advanced Doppler correction techniques, and adaptive coding and modulation via a simultaneous frequency multiplexed down link. In order to fully explore continuous adaptation of the transmitted signals, a real-time full-duplex communication system was constructed in hardware, utilising low cost components and a highly optimised PC based receiver structure. Rigorous testing, both in laboratory conditions and through extensive field trials, have enabled the author to explore the performance of the communication link on a vehicle carrying out typical operations and presenting a wide range of channel, noise, Doppler and transmission latency conditions. This has led to a comprehensive set of design recommendations for a reliable and cost effective link capable of continuous throughputs of >30 kbits/s

    An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

    Get PDF
    Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. Since the visual aspect of speech is essentially unaffected by the acoustic environment, visual information from the target speakers, such as lip movements and facial expressions, has also been used for speech enhancement and speech separation systems. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning, achieving strong performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information has highlighted the need for an overview that comprehensively describes and discusses audio-visual speech enhancement and separation based on deep learning. In this paper, we provide a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: acoustic features; visual features; deep learning methods; fusion techniques; training targets and objective functions. In addition, we review deep-learning-based methods for speech reconstruction from silent videos and audio-visual sound source separation for non-speech signals, since these methods can be more or less directly applied to audio-visual speech enhancement and separation. Finally, we survey commonly employed audio-visual speech datasets, given their central role in the development of data-driven approaches, and evaluation methods, because they are generally used to compare different systems and determine their performance
    corecore