73 research outputs found
Comparing Human and Machine Errors in Conversational Speech Transcription
Recent work in automatic recognition of conversational telephone speech (CTS)
has achieved accuracy levels comparable to human transcribers, although there
is some debate how to precisely quantify human performance on this task, using
the NIST 2000 CTS evaluation set. This raises the question what systematic
differences, if any, may be found differentiating human from machine
transcription errors. In this paper we approach this question by comparing the
output of our most accurate CTS recognition system to that of a standard speech
transcription vendor pipeline. We find that the most frequent substitution,
deletion and insertion error types of both outputs show a high degree of
overlap. The only notable exception is that the automatic recognizer tends to
confuse filled pauses ("uh") and backchannel acknowledgments ("uhhuh"). Humans
tend not to make this error, presumably due to the distinctive and opposing
pragmatic functions attached to these words. Furthermore, we quantify the
correlation between human and machine errors at the speaker level, and
investigate the effect of speaker overlap between training and test data.
Finally, we report on an informal "Turing test" asking humans to discriminate
between automatic and human transcription error cases
Latent Dirichlet Allocation Based Organisation of Broadcast Media Archives for Deep Neural Network Adaptation
This paper presents a new method for the discovery of latent domains in diverse speech data, for the use of adaptation of Deep Neural Networks (DNNs) for Automatic Speech Recognition. Our work focuses on transcription of multi-genre broadcast media, which is often only categorised broadly in terms of high level genres such as sports, news, documentary, etc. However, in terms of acoustic modelling these categories are coarse. Instead, it is expected that a mixture of latent domains can better represent the complex and diverse behaviours within a TV show, and therefore lead to better and more robust performance. We propose a new method, whereby these latent domains are discovered with Latent Dirichlet Allocation, in an unsupervised manner. These are used to adapt DNNs using the Unique Binary Code (UBIC) representation for the LDA domains. Experiments conducted on a set of BBC TV broadcasts, with more than 2,000 shows for training and 47 shows for testing, show that the use of LDA-UBIC DNNs reduces the error up to 13% relative compared to the baseline hybrid DNN models
Automatic transcription of multi-genre media archives
This paper describes some recent results of our collaborative work on
developing a speech recognition system for the automatic transcription
or media archives from the British Broadcasting Corporation (BBC). The
material includes a wide diversity of shows with their associated
metadata. The latter are highly diverse in terms of completeness,
reliability and accuracy. First, we investigate how to improve lightly
supervised acoustic training, when timestamp information is inaccurate
and when speech deviates significantly from the transcription, and how
to perform evaluations when no reference transcripts are available.
An automatic timestamp correction method as well as a word and segment
level combination approaches between the lightly supervised transcripts
and the original programme scripts are presented which yield improved
metadata. Experimental results show that systems trained using the
improved metadata consistently outperform those trained with only the
original lightly supervised decoding hypotheses. Secondly, we show that
the recognition task may benefit from systems trained on a combination
of in-domain and out-of-domain data. Working with tandem HMMs, we
describe Multi-level Adaptive Networks, a novel technique for
incorporating information from out-of domain posterior features using
deep neural network. We show that it provides a substantial reduction in
WER over other systems including a PLP-based baseline, in-domain tandem
features, and the best out-of-domain tandem features.This research was supported by EPSRC Programme Grant EP/I031022/1 (Natural Speech Technology).This paper was presented at the First Workshop on Speech, Language and Audio in Multimedia, August 22-23, 2013; Marseille. It was published in CEUR Workshop Proceedings at http://ceur-ws.org/Vol-1012/
Improving lightly supervised training for broadcast transcription
This paper investigates improving lightly supervised acoustic
model training for an archive of broadcast data. Standard
lightly supervised training uses automatically derived decoding
hypotheses using a biased language model. However, as the
actual speech can deviate significantly from the original programme
scripts that are supplied, the quality of standard lightly
supervised hypotheses can be poor. To address this issue, word
and segment level combination approaches are used between
the lightly supervised transcripts and the original programme
scripts which yield improved transcriptions. Experimental results
show that systems trained using these improved transcriptions
consistently outperform those trained using only the original
lightly supervised decoding hypotheses. This is shown to be
the case for both the maximum likelihood and minimum phone
error trained systems.The research leading to these results was supported by EPSRC Programme Grant EP/I031022/1 (Natural Speech Technology).This is the accepted manuscript version. The final version is available at http://www.isca-speech.org/archive/interspeech_2013/i13_2187.html
Cross-domain paraphrasing for improving language modelling using out-of-domain data
In natural languages the variability in the underlying linguistic
generation rules significantly alters the observed surface word
sequence they create, and thus introduces a mismatch against
other data generated via alternative realizations associated with,
for example, a different domain. Hence, direct modelling of
out-of-domain data can result in poor generalization to the indomain
data of interest. To handle this problem, this paper
investigated using cross-domain paraphrastic language models
to improve in-domain language modelling (LM) using out-ofdomain
data. Phrase level paraphrase models learnt from each
domain were used to generate paraphrase variants for the data
of other domains. These were used to both improve the context
coverage of in-domain data, and reduce the domain mismatch of
the out-of-domain data. Significant error rate reduction of 0.6%
absolute was obtained on a state-of-the-art conversational telephone
speech recognition task using a cross-domain paraphrastic
multi-level LM trained on a billion words of mixed conversational
and broadcast news data. Consistent improvements on
the in-domain data context coverage were also obtained.The research leading to these results was supported by EPSRC Programme
Grant EP/I031022/1 (Natural Speech Technology) and DARPA
under the Broad Operational Language Translation (BOLT) program.This is the accepted manuscript. The final version is available at http://www.isca-speech.org/archive/interspeech_2013/i13_3424.htm
Two-pass Decoding and Cross-adaptation Based System Combination of End-to-end Conformer and Hybrid TDNN ASR Systems
Fundamental modelling differences between hybrid and end-to-end (E2E)
automatic speech recognition (ASR) systems create large diversity and
complementarity among them. This paper investigates multi-pass rescoring and
cross adaptation based system combination approaches for hybrid TDNN and
Conformer E2E ASR systems. In multi-pass rescoring, state-of-the-art hybrid
LF-MMI trained CNN-TDNN system featuring speed perturbation, SpecAugment and
Bayesian learning hidden unit contributions (LHUC) speaker adaptation was used
to produce initial N-best outputs before being rescored by the speaker adapted
Conformer system using a 2-way cross system score interpolation. In cross
adaptation, the hybrid CNN-TDNN system was adapted to the 1-best output of the
Conformer system or vice versa. Experiments on the 300-hour Switchboard corpus
suggest that the combined systems derived using either of the two system
combination approaches outperformed the individual systems. The best combined
system obtained using multi-pass rescoring produced statistically significant
word error rate (WER) reductions of 2.5% to 3.9% absolute (22.5% to 28.9%
relative) over the stand alone Conformer system on the NIST Hub5'00, Rt03 and
Rt02 evaluation data.Comment: It' s accepted to ISCA 202
- âŚ