Search CORE

14 research outputs found

Improving state-of-theart continuous speech recognition systems using the N-best paradigm with neural networks

Author: G. Zavaliagkos T
J. Makhoul
R. Schwartz
S. Austin
Publication venue: Morgan Kaufmann
Publication date: 01/01/1992
Field of study

In an effort to advance the state of the art in continuous speech recognition employing hidden Markov models (HMM), Segmental Neural Nets (SNN) were introduced recently to ameliorate the wellknown limitations of HMMs, namely, the conditional-independence limitation and the relative difficulty with which HMMs can handle segmental features. We describe a hybrid SNN/I-IMM system that combines the speed and performance of our HMM system with the segmental modeling capabilities of SNNs. The integration of the two acoustic modeling techniques is achieved successfully via the N-best rescoring paradigm. The N-best lists are used not only for recognition, but also during training. This discriminative training using N-best is demonstrated to improve performance. When tested on the DARPA Resource Management speaker-independent corpus, the hybrid SNN/HMM system decreases the error by about 20% compared to the state-of-the-art HMM system

CiteSeerX

Crossref

Подход к решению задачи членения слитной речи на речевые единицы

Author: Андреев И.А.
Армер А.И.
Крашенинникова Н.А.
Мошкин В.С.
Publication venue: Новая техника
Publication date: 01/01/2017
Field of study

В статье описывается алгоритм членения речевого сигнала на речевые единицы – фонемы, их сочетания и паузы. Алгоритм основан на преобразовании речевого сигнала в особое двумерное изображение – автокорреляционный портрет. Для определения границ речевых единиц производится совмещение портретов анализируемого сигнала и эталонных портретов каждой речевой единицы. При совмещении используется метод динамического программирования, позволяющий получить оптимальное расстояние между портретами.Работа выполнена при финансовой поддержке РФФИ. Проекты № 16-48-732046 и №16-48-730305

Samara University

Learning Word Embeddings: Unsupervised Methods for Fixed-size Representations of Variable-length Speech Segments

Author: Du Mingxing
Dupoux Emmanuel
Holzenberger Nils
Karadayi Julien
Riad Rachid
Publication venue: 'International Speech Communication Association'
Publication date: 02/09/2018
Field of study

International audienceFixed-length embeddings of words are very useful for a variety of tasks in speech and language processing. Here we systematically explore two methods of computing fixed-length embeddings for variable-length sequences. We evaluate their susceptibility to phonetic and speaker-specific variability on English, a high resource language and Xitsonga, a low resource language, using two evaluation metrics: ABX word discrimination and ROC-AUC on same-different phoneme n-grams. We show that a simple downsampling method supplemented with length information can outperform the variable-length input feature representation on both evaluations. Recurrent autoencoders, trained without supervision, can yield even better results at the expense of increased computational complexity

INRIA a CCSD electronic archive server

Multi-Stream Speech Recognition

Author: Bourlard Hervé
Dupont Stéphane
Ris Christophe
Publication venue: IDIAP
Publication date: 10/03/2006
Field of study

In this paper, we discuss a new automatic speech recognition (ASR) approach based on independent processing and recombination of several feature streams. In this framework, it is assumed that the speech signal is represented in terms of multiple input streams, each input stream representing a different characteristic of the signal. If the streams are entirely synchronous, they may be accommodated simply (as they usually are in state-of-the-art systems). However, as discussed in the paper, it may be required to permit some degree of asynchrony between streams. This paper introduces the basic framework of a statistical structure that can accommodate multiple (asynchronous) observation streams (possibly exhibiting different frame rates). This approach will then be applied to the particular case of multi-band speech recognition and will be shown to yield significantly better noise robustness

Infoscience - École polytechnique fédérale de Lausanne

End-to-end Lip-reading: A Preliminary Study

Author: Thapa K.
Thapa K.
Publication venue: London South Bank University
Publication date: 01/01/2023
Field of study

Deep lip-reading is the combination of the domains of computer vision and natural language processing. It uses deep neural networks to extract speech from silent videos. Most works in lip-reading use a multi staged training approach due to the complex nature of the task. A single stage, end-to-end, unified training approach, which is an ideal of machine learning, is also the goal in lip-reading. However, pure end-to-end systems have not yet been able to perform as good as non-end-to-end systems. Some exceptions to this are the very recent Temporal Convolutional Network (TCN) based architectures. This work lays out preliminary study of deep lip-reading, with a special focus on various end-to-end approaches. The research aims to test whether a purely end-to-end approach is justifiable for a task as complex as deep lip-reading. To achieve this, the meaning of pure end-to-end is first defined and several lip-reading systems that follow the definition are analysed. The system that most closely matches the definition is then adapted for pure end-to-end experiments. Four main contributions have been made: i) An analysis of 9 different end-to-end deep lip-reading systems, ii) Creation and public release of a pipeline1 to adapt sentence level Lipreading Sentences in the Wild 3 (LRS3) dataset into word level, iii) Pure end-to-end training of a TCN based network and evaluation on LRS3 word-level dataset as a proof of concept, iv) a public online portal2 to analyse visemes and experiment live end-to-end lip-reading inference. The study is able to verify that pure end-to-end is a sensible approach and an achievable goal for deep machine lip-reading

LSBU Research Open

End-to-End Deep Lip-reading: A Preliminary Study

Author: Thapa K.
Thapa K.
Publication venue: London South Bank University
Publication date: 01/01/2023
Field of study

Deep lip-reading is the use of deep neural networks to extract speech from silent videos. Most works in lip-reading use a multi staged training approach due to the complex nature of the task. A single stage, end-to-end, unified training approach, which is an ideal of machine learning, is also the goal in lip-reading. However, pure end-to-end systems have so far failed to perform as good as non-end-to-end systems. Some exceptions to this are the very recent Temporal Convolutional Network (TCN) based architectures (Martinez et al., 2020; Martinez et al., 2021). This work lays out preliminary study of deep lip-reading, with a special focus on various end-to-end approaches. The research aims to test whether a purely end-to-end approach is justifiable for a task as complex as deep lip-reading. To achieve this, the meaning of pure end-to-end is first defined and several lip-reading systems that follow the definition are analysed. The system that most closely matches the definition is then adapted for pure end-to-end experiments. We make four main contributions: i) An analysis of 9 different end-to-end deep lip-reading systems, ii) Creation and public release of a pipeline to adapt sentence level Lipreading Sentences in the Wild 3 (LRS3) dataset into word level, iii) Pure end-to-end training of a TCN based network and evaluation on LRS3 word-level dataset as a proof of concept, iv) a public online portal to analyse visemes and experiment live end-to-end lip-reading inference. The study is able to verify that pure end-to-end is a sensible approach and an achievable goal for deep machine lip-reading

LSBU Research Open

Context-dependent modeling in a segment-based speech recognition system

Author: Serridge Benjamin M. (Benjamin Michael), 1973-
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/1997
Field of study

Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1997.Includes bibliographical references (leaves 78-80).by Benjamin M. Serridge.M.Eng

DSpace@MIT