20 research outputs found

    Low latency modeling of temporal contexts for speech recognition

    Get PDF
    This thesis focuses on the development of neural network acoustic models for large vocabulary continuous speech recognition (LVCSR) to satisfy the design goals of low latency and low computational complexity. Low latency enables online speech recognition; and low computational complexity helps reduce the computational cost both during training and inference. Long span sequential dependencies and sequential distortions in the input vector sequence are a major challenge in acoustic modeling. Recurrent neural networks have been shown to effectively model these dependencies. Specifically, bidirectional long short term memory (BLSTM) networks, provide state-of-the-art performance across several LVCSR tasks. However the deployment of bidirectional models for online LVCSR is non-trivial due to their large latency; and unidirectional LSTM models are typically preferred. In this thesis we explore the use of hierarchical temporal convolution to model long span temporal dependencies. We propose a sub-sampled variant of these temporal convolution neural networks, termed time-delay neural networks (TDNNs). These sub-sampled TDNNs reduce the computation complexity by ~5x, compared to TDNNs, during frame randomized pre-training. These models are shown to be effective in modeling long-span temporal contexts, however there is a performance gap compared to (B)LSTMs. As recent advancements in acoustic model training have eliminated the need for frame randomized pre-training we modify the TDNN architecture to use higher sampling rates, as the increased computation can be amortized over the sequence. These variants of sub- sampled TDNNs provide performance superior to unidirectional LSTM networks, while also affording a lower real time factor (RTF) during inference. However we show that the BLSTM models outperform both the TDNN and LSTM models. We propose a hybrid architecture interleaving temporal convolution and LSTM layers which is shown to outperform the BLSTM models. Further we improve these BLSTM models by using higher frame rates at lower layers and show that the proposed TDNN- LSTM model performs similar to these superior BLSTM models, while reducing the overall latency to 200 ms. Finally we describe an online system for reverberation robust ASR, using the above described models in conjunction with other data augmentation techniques like reverberation simulation, which simulates far-field environments, and volume perturbation, which helps tackle volume variation even without gain normalization

    Advanced Techniques for Ground Penetrating Radar Imaging

    Get PDF
    Ground penetrating radar (GPR) has become one of the key technologies in subsurface sensing and, in general, in non-destructive testing (NDT), since it is able to detect both metallic and nonmetallic targets. GPR for NDT has been successfully introduced in a wide range of sectors, such as mining and geology, glaciology, civil engineering and civil works, archaeology, and security and defense. In recent decades, improvements in georeferencing and positioning systems have enabled the introduction of synthetic aperture radar (SAR) techniques in GPR systems, yielding GPR–SAR systems capable of providing high-resolution microwave images. In parallel, the radiofrequency front-end of GPR systems has been optimized in terms of compactness (e.g., smaller Tx/Rx antennas) and cost. These advances, combined with improvements in autonomous platforms, such as unmanned terrestrial and aerial vehicles, have fostered new fields of application for GPR, where fast and reliable detection capabilities are demanded. In addition, processing techniques have been improved, taking advantage of the research conducted in related fields like inverse scattering and imaging. As a result, novel and robust algorithms have been developed for clutter reduction, automatic target recognition, and efficient processing of large sets of measurements to enable real-time imaging, among others. This Special Issue provides an overview of the state of the art in GPR imaging, focusing on the latest advances from both hardware and software perspectives

    Deep Learning for Distant Speech Recognition

    Full text link
    Deep learning is an emerging technology that is considered one of the most promising directions for reaching higher levels of artificial intelligence. Among the other achievements, building computers that understand speech represents a crucial leap towards intelligent machines. Despite the great efforts of the past decades, however, a natural and robust human-machine speech interaction still appears to be out of reach, especially when users interact with a distant microphone in noisy and reverberant environments. The latter disturbances severely hamper the intelligibility of a speech signal, making Distant Speech Recognition (DSR) one of the major open challenges in the field. This thesis addresses the latter scenario and proposes some novel techniques, architectures, and algorithms to improve the robustness of distant-talking acoustic models. We first elaborate on methodologies for realistic data contamination, with a particular emphasis on DNN training with simulated data. We then investigate on approaches for better exploiting speech contexts, proposing some original methodologies for both feed-forward and recurrent neural networks. Lastly, inspired by the idea that cooperation across different DNNs could be the key for counteracting the harmful effects of noise and reverberation, we propose a novel deep learning paradigm called network of deep neural networks. The analysis of the original concepts were based on extensive experimental validations conducted on both real and simulated data, considering different corpora, microphone configurations, environments, noisy conditions, and ASR tasks.Comment: PhD Thesis Unitn, 201
    corecore