30 research outputs found

    Low latency modeling of temporal contexts for speech recognition

    Get PDF
    This thesis focuses on the development of neural network acoustic models for large vocabulary continuous speech recognition (LVCSR) to satisfy the design goals of low latency and low computational complexity. Low latency enables online speech recognition; and low computational complexity helps reduce the computational cost both during training and inference. Long span sequential dependencies and sequential distortions in the input vector sequence are a major challenge in acoustic modeling. Recurrent neural networks have been shown to effectively model these dependencies. Specifically, bidirectional long short term memory (BLSTM) networks, provide state-of-the-art performance across several LVCSR tasks. However the deployment of bidirectional models for online LVCSR is non-trivial due to their large latency; and unidirectional LSTM models are typically preferred. In this thesis we explore the use of hierarchical temporal convolution to model long span temporal dependencies. We propose a sub-sampled variant of these temporal convolution neural networks, termed time-delay neural networks (TDNNs). These sub-sampled TDNNs reduce the computation complexity by ~5x, compared to TDNNs, during frame randomized pre-training. These models are shown to be effective in modeling long-span temporal contexts, however there is a performance gap compared to (B)LSTMs. As recent advancements in acoustic model training have eliminated the need for frame randomized pre-training we modify the TDNN architecture to use higher sampling rates, as the increased computation can be amortized over the sequence. These variants of sub- sampled TDNNs provide performance superior to unidirectional LSTM networks, while also affording a lower real time factor (RTF) during inference. However we show that the BLSTM models outperform both the TDNN and LSTM models. We propose a hybrid architecture interleaving temporal convolution and LSTM layers which is shown to outperform the BLSTM models. Further we improve these BLSTM models by using higher frame rates at lower layers and show that the proposed TDNN- LSTM model performs similar to these superior BLSTM models, while reducing the overall latency to 200 ms. Finally we describe an online system for reverberation robust ASR, using the above described models in conjunction with other data augmentation techniques like reverberation simulation, which simulates far-field environments, and volume perturbation, which helps tackle volume variation even without gain normalization

    Total coloring of 1-toroidal graphs of maximum degree at least 11 and no adjacent triangles

    Full text link
    A {\em total coloring} of a graph GG is an assignment of colors to the vertices and the edges of GG such that every pair of adjacent/incident elements receive distinct colors. The {\em total chromatic number} of a graph GG, denoted by \chiup''(G), is the minimum number of colors in a total coloring of GG. The well-known Total Coloring Conjecture (TCC) says that every graph with maximum degree Δ\Delta admits a total coloring with at most Δ+2\Delta + 2 colors. A graph is {\em 11-toroidal} if it can be drawn in torus such that every edge crosses at most one other edge. In this paper, we investigate the total coloring of 11-toroidal graphs, and prove that the TCC holds for the 11-toroidal graphs with maximum degree at least~1111 and some restrictions on the triangles. Consequently, if GG is a 11-toroidal graph with maximum degree Δ\Delta at least~1111 and without adjacent triangles, then GG admits a total coloring with at most Δ+2\Delta + 2 colors.Comment: 10 page

    NICNET - a Hierarchic distributed computer-communication network for decision support in the Indian Government

    Get PDF
    A decision support information system for the Indian Government is being evolved, based on the design of a predominantly query-based computer network with hierarchric distributed databases and random access communication. The four level hierarchy spans 439 districts at the lowest level, the Central Government headquarters in New Delhi, the set of 32 State Capitals and Union Territories, and the set of four Regional Centres. With interference tolerance and random access as two guiding principles behind the choice, Spread Spectrum transmission and Code Division Multiple Access system of satellite communication was adopted. Each node of the network is a 32-bit computer which is capable of local bulk storage of up to three units of 300 megabytes each for purposes of queryaccessible distributed databases. The design and implementation of such a distributed database has endowed the network with the capability to distribute the data related to such databases over various nodes in the network so as to be able to accept a query from any of the nodes

    On topological relaxations of chromatic conjectures

    Get PDF
    There are several famous unsolved conjectures about the chromatic number that were relaxed and already proven to hold for the fractional chromatic number. We discuss similar relaxations for the topological lower bound(s) of the chromatic number. In particular, we prove that such a relaxed version is true for the Behzad-Vizing conjecture and also discuss the conjectures of Hedetniemi and of Hadwiger from this point of view. For the latter, a similar statement was already proven in an earlier paper of the first author with G. Tardos, our main concern here is that the so-called odd Hadwiger conjecture looks much more difficult in this respect. We prove that the statement of the odd Hadwiger conjecture holds for large enough Kneser graphs and Schrijver graphs of any fixed chromatic number

    Low latency modeling of temporal contexts for speech recognition

    No full text
    This thesis focuses on the development of neural network acoustic models for large vocabulary continuous speech recognition (LVCSR) to satisfy the design goals of low latency and low computational complexity. Low latency enables online speech recognition; and low computational complexity helps reduce the computational cost both during training and inference. Long span sequential dependencies and sequential distortions in the input vector sequence are a major challenge in acoustic modeling. Recurrent neural networks have been shown to effectively model these dependencies. Specifically, bidirectional long short term memory (BLSTM) networks, provide state-of-the-art performance across several LVCSR tasks. However the deployment of bidirectional models for online LVCSR is non-trivial due to their large latency; and unidirectional LSTM models are typically preferred. In this thesis we explore the use of hierarchical temporal convolution to model long span temporal dependencies. We propose a sub-sampled variant of these temporal convolution neural networks, termed time-delay neural networks (TDNNs). These sub-sampled TDNNs reduce the computation complexity by ~5x, compared to TDNNs, during frame randomized pre-training. These models are shown to be effective in modeling long-span temporal contexts, however there is a performance gap compared to (B)LSTMs. As recent advancements in acoustic model training have eliminated the need for frame randomized pre-training we modify the TDNN architecture to use higher sampling rates, as the increased computation can be amortized over the sequence. These variants of sub- sampled TDNNs provide performance superior to unidirectional LSTM networks, while also affording a lower real time factor (RTF) during inference. However we show that the BLSTM models outperform both the TDNN and LSTM models. We propose a hybrid architecture interleaving temporal convolution and LSTM layers which is shown to outperform the BLSTM models. Further we improve these BLSTM models by using higher frame rates at lower layers and show that the proposed TDNN- LSTM model performs similar to these superior BLSTM models, while reducing the overall latency to 200 ms. Finally we describe an online system for reverberation robust ASR, using the above described models in conjunction with other data augmentation techniques like reverberation simulation, which simulates far-field environments, and volume perturbation, which helps tackle volume variation even without gain normalization

    National Informatics Centre

    No full text
    175-17

    On Total Chromatic Number of a Graph

    No full text

    3-D CNN MODELS FOR FAR-FIELD MULTI-CHANNEL SPEECH RECOGNITION

    No full text
    Automatic speech recognition (ASR) in far-field reverberant environments, especially when involving natural conversational multiparty speech conditions, is challenging even with the state-of-theart recognition methodologies. The two main issues are artifacts in the signal due to reverberation and the presence of multiple speakers. In this paper, we propose a three dimensional (3-D) convolutional neural network (CNN) architecture for multi-channel far-field ASR. This architecture processes time, frequency & channel dimensions of the input spectrogram to learn representations using convolutional layers. Experiments are performed on the REVERB challenge LVCSR task and the augmented multi-party (AMI) LVCSR task using the array microphone recordings. The proposed method shows improvements over the baseline system that uses beamforming of the multi-channel audio along with a 2-D conventional CNN framework (absolute improvements of 1.1 % over the beamformed baseline system on AMI dataset)

    Low Latency Acoustic Modeling Using Temporal Convolution and LSTMs

    No full text
    corecore