83 research outputs found
Combining Residual Networks with LSTMs for Lipreading
We propose an end-to-end deep learning architecture for word-level visual
speech recognition. The system is a combination of spatiotemporal
convolutional, residual and bidirectional Long Short-Term Memory networks. We
train and evaluate it on the Lipreading In-The-Wild benchmark, a challenging
database of 500-size target-words consisting of 1.28sec video excerpts from BBC
TV broadcasts. The proposed network attains word accuracy equal to 83.0,
yielding 6.8 absolute improvement over the current state-of-the-art, without
using information about word boundaries during training or testing.Comment: Submitted to Interspeech 201
Automatic Speech Recognition for Low-resource Languages and Accents Using Multilingual and Crosslingual Information
This thesis explores methods to rapidly bootstrap automatic speech recognition systems for languages, which lack resources for speech and language processing. We focus on finding approaches which allow using data from multiple languages to improve the performance for those languages on different levels, such as feature extraction, acoustic modeling and language modeling. Under application aspects, this thesis also includes research work on non-native and Code-Switching speech
Разновидности глубоких искусственных нейронных сетей для систем распознавания речи
This paper presents a survey of basic methods for acoustic and language model development based on artificial neural networks for automatic speech recognition systems. The hybrid and tandem approaches for combination of Hidden Markov Models and artificial neural networks for acoustic modelling are given. The creation of language models using feedforward and recurrent neural networks is described. The survey of researches, conducted in this field, shows that application of artificial neural networks at the stages of both acoustic and language modeling allows decreasing word error rate.В статье представлен аналитический обзор основных разновидностей акустических и языковых моделей на основе искусственных нейронных сетей для систем автоматического распознавания речи. Рассмотрены гибридный и тандемный под-ходы объединения скрытых марковских моделей и искусственных нейронных сетей для акустического моделирования, описано построение языковых моделей с применением сетей прямого распространения и рекуррентных нейросетей. Обзор исследований в данной области показывает, что применение искусственных нейронных сетей как на этапе акустического, так и на этапе языкового моделирования позволяет снизить ошибку распознавания слов
Low Resource Efficient Speech Retrieval
Speech retrieval refers to the task of retrieving the information, which is useful or relevant to a user query, from speech collection. This thesis aims to examine ways in which speech retrieval can be improved in terms of requiring low resources - without extensively annotated corpora on which automated processing systems are typically built - and achieving high computational efficiency.
This work is focused on two speech retrieval technologies, spoken keyword retrieval and spoken document classification. Firstly, keyword retrieval - also referred to as keyword search (KWS) or spoken term detection - is defined as the task of retrieving the occurrences of a keyword specified by the user in text form, from speech collections. We make advances in an open vocabulary KWS platform using context-dependent Point Process Model (PPM). We further accomplish a PPM-based lattice generation framework, which improves KWS performance and enables automatic speech recognition (ASR) decoding.
Secondly, the massive volumes of speech data motivate the effort to organize and search speech collections through spoken document classification. In classifying real-world unstructured speech into predefined classes, the wildly collected speech recordings can be extremely long, of varying length, and contain multiple class label shifts at variable locations in the audio. For this reason each spoken document is often first split into sequential segments, and then each segment is independently classified. We present a general purpose method for classifying spoken segments, using a cascade of language independent acoustic modeling, foreign-language to English translation lexicons, and English-language classification. Next, instead of classifying each segment independently, we demonstrate that exploring the contextual dependencies across sequential segments can provide large classification performance improvements. Lastly, we remove the need of any orthographic lexicon and instead exploit alternative unsupervised approaches to decoding speech in terms of automatically discovered word-like or phoneme-like units. We show that the spoken segment representations based on such lexical or phonetic discovery can achieve competitive classification performance as compared to those based on a domain-mismatched ASR or a universal phone set ASR
Adaptation of speech recognition systems to selected real-world deployment conditions
Tato habilitační práce se zabývá problematikou adaptace systémů
rozpoznávání řeči na vybrané reálné podmínky nasazení. Je koncipována
jako sborník celkem dvanácti článků, které se touto problematikou
zabývají. Jde o publikace, jejichž jsem hlavním autorem
nebo spoluatorem, a které vznikly v rámci několika navazujících
výzkumných projektů. Na řešení těchto projektů jsem se
podílel jak v roli člena výzkumného týmu, tak i v roli řešitele nebo
spoluřešitele.
Publikace zařazené do tohoto sborníku lze rozdělit podle tématu
do tří hlavních skupin. Jejich společným jmenovatelem je
snaha přizpůsobit daný rozpoznávací systém novým podmínkám či
konkrétnímu faktoru, který významným způsobem ovlivňuje jeho
funkci či přesnost.
První skupina článků se zabývá úlohou neřízené adaptace na
mluvčího, kdy systém přizpůsobuje svoje parametry specifickým
hlasovým charakteristikám dané mluvící osoby. Druhá část práce
se pak věnuje problematice identifikace neřečových událostí na vstupu
do systému a související úloze rozpoznávání řeči s hlukem
(a zejména hudbou) na pozadí. Konečně třetí část práce se zabývá
přístupy, které umožňují přepis audio signálu obsahujícího promluvy
ve více než v jednom jazyce. Jde o metody adaptace existujícího
rozpoznávacího systému na nový jazyk a metody identifikace
jazyka z audio signálu.
Obě zmíněné identifikační úlohy jsou přitom vyšetřovány zejména
v náročném a méně probádaném režimu zpracování po jednotlivých
rámcích vstupního signálu, který je jako jediný vhodný pro on-line
nasazení, např. pro streamovaná data.This habilitation thesis deals with adaptation of automatic speech
recognition (ASR) systems to selected real-world deployment conditions.
It is presented in the form of a collection of twelve articles
dealing with this task; I am the main author or a co-author of these
articles. They were published during my work on several consecutive
research projects. I have participated in the solution of them
as a member of the research team as well as the investigator or a
co-investigator.
These articles can be divided into three main groups according to
their topics. They have in common the effort to adapt a particular
ASR system to a specific factor or deployment condition that affects
its function or accuracy.
The first group of articles is focused on an unsupervised speaker
adaptation task, where the ASR system adapts its parameters to
the specific voice characteristics of one particular speaker. The second
part deals with a) methods allowing the system to identify
non-speech events on the input, and b) the related task of recognition
of speech with non-speech events, particularly music, in the
background. Finally, the third part is devoted to the methods
that allow the transcription of an audio signal containing multilingual
utterances. It includes a) approaches for adapting the existing
recognition system to a new language and b) methods for identification
of the language from the audio signal.
The two mentioned identification tasks are in particular investigated
under the demanding and less explored frame-wise scenario,
which is the only one suitable for processing of on-line data streams
Аналитический обзор интегральных систем распознавания речи
This article presents an analytic survey of various end-to-end speech recognition systems, as well as some approaches to their construction and optimization. We consider models based on connectionist temporal classification (CTC), models based on encoder-decoder architecture with attention mechanism and models using conditional random field (CRF). We also describe integration possibilities with language models at a stage of decoding. We see that such an approach significantly reduces recognition error rates for end-to-end models. A survey of research works in this subject area reveals that end-to-end systems allow achieving results close to that of the state-of-the-art hybrid models. Nevertheless, end-to-end models use simple configuration and demonstrate a high speed of learning and decoding. In addition, we consider popular frameworks and toolkits for creating speech recognition systems.Приведен аналитический обзор разновидностей интегральных (end-to-end) систем для распознавания речи, методов их построения, обучения и оптимизации. Рассмотрены варианты моделей на основе коннекционной временной классификации (CTC) в качестве функции потерь для нейронной сети, модели на основе механизма внимания и шифратор-дешифратор моделей. Также рассмотрены нейронные сети, построенные с использованием условных случайных полей (CRF), которые являются обобщением скрытых марковских моделей, что позволяет исправить многие недостатки стандартных гибридных систем распознавания речи, например, предположение о том, что элементы входных последовательностей звуков речи являются независимыми случайными величинами. Также описаны возможности интеграции с языковыми моделями на этапе декодирования, демонстрирующие существенное сокращение ошибки распознавания для интеграционных моделей. Описаны различные модификации и улучшения стандартных интегральных архитектур систем распознавания речи, как, например, обобщение коннекционной классификации и использовании регуляризации в моделях, основанных на механизмах внимания. Обзор исследований, проводимых в данной предметной области, показывает, что интегральные системы распознавания речи позволяют достичь результатов, сравнимых с результатами стандартных систем, использующих скрытые марковские модели, но с применением более простой конфигурации и быстрой работой системы распознавания как при обучении, так и при декодировании. Рассмотрены наиболее популярные и развивающиеся библиотеки и инструментарии для построения интегральных систем распознавания речи, такие как TensorFlow, Eesen, Kaldi и другие. Проведено сравнение описанных инструментариев по критериям простоты и доступности их использования для реализации интегральных систем распознавания речи
- …