Search CORE

7,181 research outputs found

The 2005 AMI system for the transcription of speech in meetings

Author: Burget Lukas
Dines John
Gaurau Giulia
Hain Thomas
Karafiat Martin
Lincoln Mike
McCowan Iain
Moore Darren
Ordelman Roeland
Renals Steve
Wan Vincent
Publication venue: Springer
Publication date: 01/01/2005
Field of study

In this paper we describe the 2005 AMI system for the transcription\ud of speech in meetings used for participation in the 2005 NIST\ud RT evaluations. The system was designed for participation in the speech\ud to text part of the evaluations, in particular for transcription of speech\ud recorded with multiple distant microphones and independent headset\ud microphones. System performance was tested on both conference room\ud and lecture style meetings. Although input sources are processed using\ud different front-ends, the recognition process is based on a unified system\ud architecture. The system operates in multiple passes and makes use\ud of state of the art technologies such as discriminative training, vocal\ud tract length normalisation, heteroscedastic linear discriminant analysis,\ud speaker adaptation with maximum likelihood linear regression and minimum\ud word error rate decoding. In this paper we describe the system performance\ud on the official development and test sets for the NIST RT05s\ud evaluations. The system was jointly developed in less than 10 months\ud by a multi-site team and was shown to achieve very competitive performance

CiteSeerX

Edinburgh Research Explorer

University of Twente Research Information

Visual units and confusion modelling for automatic lip-reading

Author: Cox Stephen
Howell Dominic
Theobald Barry
Publication venue: 'Elsevier BV'
Publication date: 01/07/2016
Field of study

Automatic lip-reading (ALR) is a challenging task because the visual speech signal is known to be missing some important information, such as voicing. We propose an approach to ALR that acknowledges that this information is missing but assumes that it is substituted or deleted in a systematic way that can be modelled. We describe a system that learns such a model and then incorporates it into decoding, which is realised as a cascade of weighted finite-state transducers. Our results show a small but statistically significant improvement in recognition accuracy. We also investigate the issue of suitable visual units for ALR, and show that visemes are sub-optimal, not but because they introduce lexical ambiguity, but because the reduction in modelling units entailed by their use reduces accuracy

Crossref

University of East Anglia digital repository

Kalman tracking of linear predictor and harmonic noise models for noisy speech enhancement

Author: Ben Milner
Boll
Chen
Deller
Ephraim
Ephraim
Ephraim
Ephraim
Esfandiar Zavarehei
Friedman
Griffin
Hansen
Ioannis Andrianakis
Jonathan Darch
Kalman
Lim
Lim
Paul White
Qin Yan
Rentzos
Saeed Vaseghi
Sameti
Secrest
Seltzer
Stylianou
Stylianou
Tucker
Turunen
Vaseghi
Weber
Yan
Publication venue: 'Elsevier BV'
Publication date: 01/01/2008
Field of study

This paper presents a speech enhancement method based on the tracking and denoising of the formants of a linear prediction (LP) model of the spectral envelope of speech and the parameters of a harmonic noise model (HNM) of its excitation. The main advantages of tracking and denoising the prominent energy contours of speech are the efficient use of the spectral and temporal structures of successive speech frames and a mitigation of processing artefact known as the ‘musical noise’ or ‘musical tones’.The formant-tracking linear prediction (FTLP) model estimation consists of three stages: (a) speech pre-cleaning based on a spectral amplitude estimation, (b) formant-tracking across successive speech frames using the Viterbi method, and (c) Kalman filtering of the formant trajectories across successive speech frames.The HNM parameters for the excitation signal comprise; voiced/unvoiced decision, the fundamental frequency, the harmonics’ amplitudes and the variance of the noise component of excitation. A frequency-domain pitch extraction method is proposed that searches for the peak signal to noise ratios (SNRs) at the harmonics. For each speech frame several pitch candidates are calculated. An estimate of the pitch trajectory across successive frames is obtained using a Viterbi decoder. The trajectories of the noisy excitation harmonics across successive speech frames are modeled and denoised using Kalman filters.The proposed method is used to deconstruct noisy speech, de-noise its model parameters and then reconstitute speech from its cleaned parts. Experimental evaluations show the performance gains of the formant tracking, pitch extraction and noise reduction stages

Crossref

Southampton (e-Prints Soton)

University of East Anglia digital repository

Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System

Author: Cai Weicheng
Chen Jinkun
Li Ming
Publication venue
Publication date: 13/04/2018
Field of study

In this paper, we explore the encoding/pooling layer and loss function in the end-to-end speaker and language recognition system. First, a unified and interpretable end-to-end system for both speaker and language recognition is developed. It accepts variable-length input and produces an utterance level result. In the end-to-end system, the encoding layer plays a role in aggregating the variable-length input sequence into an utterance level representation. Besides the basic temporal average pooling, we introduce a self-attentive pooling layer and a learnable dictionary encoding layer to get the utterance level representation. In terms of loss function for open-set speaker verification, to get more discriminative speaker embedding, center loss and angular softmax loss is introduced in the end-to-end system. Experimental results on Voxceleb and NIST LRE 07 datasets show that the performance of end-to-end learning system could be significantly improved by the proposed encoding layer and loss function.Comment: Accepted for Speaker Odyssey 201

arXiv.org e-Print Archive

Crossref

The Microsoft 2017 Conversational Speech Recognition System

Author: Alleva F.
Droppo J.
Huang X.
Stolcke A.
Wu L.
Xiong W.
Publication venue
Publication date: 24/08/2017
Field of study

We describe the 2017 version of Microsoft's conversational speech recognition system, in which we update our 2016 system with recent developments in neural-network-based acoustic and language modeling to further advance the state of the art on the Switchboard speech recognition task. The system adds a CNN-BLSTM acoustic model to the set of model architectures we combined previously, and includes character-based and dialog session aware LSTM language models in rescoring. For system combination we adopt a two-stage approach, whereby subsets of acoustic models are first combined at the senone/frame level, followed by a word-level voting via confusion networks. We also added a confusion network rescoring step after system combination. The resulting system yields a 5.1\% word error rate on the 2000 Switchboard evaluation set

arXiv.org e-Print Archive

Crossref