Search CORE

75 research outputs found

Subphonetic Modeling for Speech Recognition

Author: Mei-yuh Hwang
Xuedong Huang
Publication venue
Publication date: 01/01/1992
Field of study

How to capture important acoustic clues and estimate essential parameters reliably is one of the central issues in speech recognition, since we will never have sufficient training data to model various acoustic-phonetic phenomena. Successful examples include subword models with many smoothing techniques. In comparison with subword models, subphonetic modeling may provide a finer level of details. We propose to model subphonetic events with Markov states and treat the state in phonetic hidden Markov models as our basic subphonetic unit-- senone. A word model is a concatenation of state-dependent senones and senones can be shared across different word models. Senones not only allow parameter sharing, but also enable pronunciation optimization and new word learning, where the phonetic baseform is replaced by the senonic baseform. In this paper, we report preliminary subphonetic modeling results, which not only significantly reduced the word error rate for speaker-independent continuous speech recognition but also demonstrated a novel application for new word learning.

CiteSeerX

Crossref

The Microsoft 2017 Conversational Speech Recognition System

Author: Alleva F.
Droppo J.
Huang X.
Stolcke A.
Wu L.
Xiong W.
Publication venue
Publication date: 24/08/2017
Field of study

We describe the 2017 version of Microsoft's conversational speech recognition system, in which we update our 2016 system with recent developments in neural-network-based acoustic and language modeling to further advance the state of the art on the Switchboard speech recognition task. The system adds a CNN-BLSTM acoustic model to the set of model architectures we combined previously, and includes character-based and dialog session aware LSTM language models in rescoring. For system combination we adopt a two-stage approach, whereby subsets of acoustic models are first combined at the senone/frame level, followed by a word-level voting via confusion networks. We also added a confusion network rescoring step after system combination. The resulting system yields a 5.1\% word error rate on the 2000 Switchboard evaluation set

arXiv.org e-Print Archive

Crossref

Transformer-based Acoustic Modeling for Hybrid Speech Recognition

Author: Fuegen Christian
Huang Hongzhao
Le Duc
Liu Chunxi
Mahadeokar Jay
Mohamed Abdelrahman
Seltzer Michael L.
Tjandra Andros
Wang Yongqiang
Xiao Alex
Zhang Frank
Zhang Xiaohui
Zweig Geoffrey
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 29/04/2020
Field of study

We propose and evaluate transformer-based acoustic models (AMs) for hybrid speech recognition. Several modeling choices are discussed in this work, including various positional embedding methods and an iterated loss to enable training deep transformers. We also present a preliminary study of using limited right context in transformer models, which makes it possible for streaming applications. We demonstrate that on the widely used Librispeech benchmark, our transformer-based AM outperforms the best published hybrid result by 19% to 26% relative when the standard n-gram language model (LM) is used. Combined with neural network LM for rescoring, our proposed approach achieves state-of-the-art results on Librispeech. Our findings are also confirmed on a much larger internal dataset.Comment: to appear in ICASSP 202

arXiv.org e-Print Archive

Crossref

Analyzing Autoencoder-Based Acoustic Word Embeddings

Author: Goldwater Sharon
Kamper Herman
Matusevych Yevgen
Publication venue
Publication date: 03/04/2020
Field of study

Recent studies have introduced methods for learning acoustic word embeddings (AWEs)---fixed-size vector representations of words which encode their acoustic features. Despite the widespread use of AWEs in speech processing research, they have only been evaluated quantitatively in their ability to discriminate between whole word tokens. To better understand the applications of AWEs in various downstream tasks and in cognitive modeling, we need to analyze the representation spaces of AWEs. Here we analyze basic properties of AWE spaces learned by a sequence-to-sequence encoder-decoder model in six typologically diverse languages. We first show that these AWEs preserve some information about words' absolute duration and speaker. At the same time, the representation space of these AWEs is organized such that the distance between words' embeddings increases with those words' phonetic dissimilarity. Finally, the AWEs exhibit a word onset bias, similar to patterns reported in various studies on human speech processing and lexical access. We argue this is a promising result and encourage further evaluation of AWEs as a potentially useful tool in cognitive science, which could provide a link between speech processing and lexical memory.Comment: 6 pages, 7 figures, accepted to BAICS workshop (ICLR2020

arXiv.org e-Print Archive

Edinburgh Research Explorer

Automatsko raspoznavanje hrvatskoga govora velikoga vokabulara

Author: Ivo Ipšić
Miran Pobar
Sanda Martinčić-Ipšić
Publication venue: KoREMA - Croatian Society for Communications, Computing, Electronics, Measurement and Control
Publication date: 01/01/2011
Field of study

This paper presents procedures used for development of a Croatian large vocabulary automatic speech recognition system (LVASR). The proposed acoustic model is based on context-dependent triphone hidden Markov models and Croatian phonetic rules. Different acoustic and language models, developed using a large collection of Croatian speech, are discussed and compared. The paper proposes the best feature vectors and acoustic modeling procedures using which lowest word error rates for Croatian speech are achieved. In addition, Croatian language modeling procedures are evaluated and adopted for speaker independent spontaneous speech recognition. Presented experiments and results show that the proposed approach for automatic speech recognition using context-dependent acoustic modeling based on Croatian phonetic rules and a parameter tying procedure can be used for efﬁcient Croatian large vocabulary speech recognition with word error rates below 5%.Članak prikazuje postupke akustičkog i jezičnog modeliranja sustava za automatsko raspoznavanje hrvatskoga govora velikoga vokabulara. Predloženi akustički modeli su zasnovani na kontekstno-ovisnim skrivenim Markovljevim modelima trifona i hrvatskim fonetskim pravilima. Na hrvatskome govoru prikupljenom u korpusu su ocjenjeni i uspoređeni različiti akustički i jezični modeli. U članku su uspoređ eni i predloženi postupci za izračun vektora značajki za akustičko modeliranje kao i sam pristup akustičkome modeliranju hrvatskoga govora s kojim je postignuta najmanja mjera pogrešno raspoznatih riječi. Predstavljeni su rezultati raspoznavanja spontanog hrvatskog govora neovisni o govorniku. Postignuti rezultati eksperimenata s mjerom pogreške ispod 5% ukazuju na primjerenost predloženih postupaka za automatsko raspoznavanje hrvatskoga govora velikoga vokabulara pomoću vezanih kontekstnoovisnih akustičkih modela na osnovu hrvatskih fonetskih pravila

HRČAK - Portal of Croatian Scientific and Professional Journals

Hrčak - Portal of scientific journals of Croatia

Automatic architecture design by likelihood- based context clustering with crossvalidation

Author: Rogina Ivica
Publication venue
Publication date: 02/08/2007
Field of study

KITopen

Multilingual Articulatory Features

Author: Metze Florian
Schultz Tanja
Stüker Sebastian
Waibel Alex
Publication venue
Publication date: 13/06/2008
Field of study

KITopen