Search CORE

81 research outputs found

Lightly supervised alignment of subtitles on multi-genre broadcasts

Author: Deena S.
Doulaty M.
Hain T.
Hasan M.
Khaliq B.
Milner R.
Ng R.W.M.
Olcoz J.
Saz Torralba O.
Publication venue: Springer Nature
Publication date: 01/12/2018
Field of study

This paper describes a system for performing alignment of subtitles to audio on multigenre broadcasts using a lightly supervised approach. Accurate alignment of subtitles plays a substantial role in the daily work of media companies and currently still requires large human effort. Here, a comprehensive approach to performing this task in an automated way using lightly supervised alignment is proposed. The paper explores the different alternatives to speech segmentation, lightly supervised speech recognition and alignment of text streams. The proposed system uses lightly supervised decoding to improve the alignment accuracy by performing language model adaptation using the target subtitles. The system thus built achieves the third best reported result in the alignment of broadcast subtitles in the Multi–Genre Broadcast (MGB) challenge, with an F1 score of 88.8%. This system is available for research and other non–commercial purposes through webASR, the University of Sheffield’s cloud–based speech technology web service. Taking as inputs an audio file and untimed subtitles, webASR can produce timed subtitles in multiple formats, including TTML, WebVTT and SRT

White Rose Research Online

Automatic transcription of multi-genre media archives

Author: Bell PJ
Gales MJF
Hain T
Lanchantin P
Liu X
Long Y
Quinnell J
Renals S
Saz O
Seigel MS
Swietojanski P
Woodland PC
Publication venue: CEUR Workshop Proceedings
Publication date: 01/01/2013
Field of study

This paper describes some recent results of our collaborative work on developing a speech recognition system for the automatic transcription or media archives from the British Broadcasting Corporation (BBC). The material includes a wide diversity of shows with their associated metadata. The latter are highly diverse in terms of completeness, reliability and accuracy. First, we investigate how to improve lightly supervised acoustic training, when timestamp information is inaccurate and when speech deviates significantly from the transcription, and how to perform evaluations when no reference transcripts are available. An automatic timestamp correction method as well as a word and segment level combination approaches between the lightly supervised transcripts and the original programme scripts are presented which yield improved metadata. Experimental results show that systems trained using the improved metadata consistently outperform those trained with only the original lightly supervised decoding hypotheses. Secondly, we show that the recognition task may benefit from systems trained on a combination of in-domain and out-of-domain data. Working with tandem HMMs, we describe Multi-level Adaptive Networks, a novel technique for incorporating information from out-of domain posterior features using deep neural network. We show that it provides a substantial reduction in WER over other systems including a PLP-based baseline, in-domain tandem features, and the best out-of-domain tandem features.This research was supported by EPSRC Programme Grant EP/I031022/1 (Natural Speech Technology).This paper was presented at the First Workshop on Speech, Language and Audio in Multimedia, August 22-23, 2013; Marseille. It was published in CEUR Workshop Proceedings at http://ceur-ws.org/Vol-1012/

CiteSeerX

Edinburgh Research Explorer

Apollo (Cambridge)

White Rose Research Online

CUED - Cambridge University Engineering Department

Recurrent neural network language model adaptation for multi-genre broadcast speech recognition and alignment

Author: Deena S.
Doulaty M.
Hain T.
Hasan M.
Saz O.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/03/2019
Field of study

Recurrent neural network language models (RNNLMs) generally outperform n-gram language models when used in automatic speech recognition. Adapting RNNLMs to new domains is an open problem and current approaches can be categorised as either feature-based and model-based. In feature-based adaptation, the input to the RNNLM is augmented with auxiliary features whilst model-based adaptation includes model fine-tuning and the introduction of adaptation layer(s) in the network. In this paper, the properties of both types of adaptation are investigated on multi-genre broadcast speech recognition. Existing techniques for both types of adaptation are reviewed and the proposed techniques for model-based adaptation, namely the linear hidden network (LHN) adaptation layer and the K-component adaptive RNNLM, are investigated. Moreover, new features derived from the acoustic domain are investigated for RNNLM adaptation. The contributions of this paper include two hybrid adaptation techniques: the fine-tuning of feature-based RNNLMs and a feature-based adaptation layer. Moreover, the semi-supervised adaptation of RNNLMs using genre information is also proposed. The ASR systems were trained using 700h of multi-genre broadcast speech. The gains obtained when using the RNNLM adaptation techniques proposed in this work are consistent when using RNNLMs trained on an in-domain set of 10M words and on a combination of in-domain and out-of-domain sets of 660M words, with approx. 10% perplexity and 2% relative word error rate improvements on a 28.3h. test set. The best RNNLM adaptation techniques for ASR are also evaluated on a lightly supervised alignment of subtitles task for the same data, where the use of RNNLM adaptation leads to an absolute increase in the F-measure of 0.5%

Crossref

White Rose Research Online

A SYSTEM FOR AUTOMATIC ALIGNMENT OF BROADCAST MEDIA CAPTIONS USING WEIGHTED FINITE-STATE TRANSDUCERS

Author: Peter Bell
Steve Renals
Publication venue
Publication date: 06/03/2020
Field of study

ABSTRACT We describe our system for alignment of broadcast media captions in the 2015 MGB Challenge. A precise time alignment of previously-generated subtitles to media data is important in the process of caption generation by broadcasters. However, this task is challenging due to the highly diverse, often noisy content of the audio, and because the subtitles are frequently not a verbatim representation of the actual words spoken. Our system employs a two-pass approach with appropriately constrained weighted finite state transducers (WFSTs) to enable good alignment even when the audio quality would be challenging for conventional ASR. The system achieves an f-score of 0.8965 on the MGB Challenge development set

CiteSeerX

A system for automatic alignment of broadcast media captions using weighted finite-state transducers

Author: Bell Peter
Renals Steve
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2015
Field of study

Edinburgh Research Explorer

Lattice-based lightly-supervised acoustic model training

Author: Bell Peter
Fainberg Joachim
Klejch Ondrej
Renals Steve
Publication venue: 'International Speech Communication Association'
Publication date: 13/07/2019
Field of study

In the broadcast domain there is an abundance of related text data and partial transcriptions, such as closed captions and subtitles. This text data can be used for lightly supervised training, in which text matching the audio is selected using an existing speech recognition model. Current approaches to light supervision typically filter the data based on matching error rates between the transcriptions and biased decoding hypotheses. In contrast, semi-supervised training does not require matching text data, instead generating a hypothesis using a background language model. State-of-the-art semi-supervised training uses lattice-based supervision with the lattice-free MMI (LF-MMI) objective function. We propose a technique to combine inaccurate transcriptions with the lattices generated for semi-supervised training, thus preserving uncertainty in the lattice where appropriate. We demonstrate that this combined approach reduces the expected error rates over the lattices, and reduces the word error rate (WER) on a broadcast task.Comment: Proc. INTERSPEECH 201

arXiv.org e-Print Archive

Crossref

Edinburgh Research Explorer

Improved acoustic modelling for automatic literacy assessment of children

Author: Hain T.
Nicolao M.
Sanders M.
Publication venue: 'International Speech Communication Association'
Publication date: 02/09/2018
Field of study

Automatic literacy assessment of children is a complex task that normally requires carefully annotated data. This paper focuses on a system for the assessment of reading skills, aiming to detection of a range of fluency and pronunciation errors. Naturally, reading is a prompted task, and thereby the acquisition of training data for acoustic modelling should be straightforward. However, given the prominence of errors in the training set and the importance of labelling them in the transcription, a lightly supervised approach to acoustic modelling has better chances of success. A method based on weighted finite state transducers is proposed, to model specific prompt corrections, such as repetitions, substitutions, and deletions, as observed in real recordings. Iterative cycles of lightly-supervised training are performed in which decoding improves the transcriptions and the derived models. Improvements are due to increasing accuracy in phone-to-sound alignment and in the training data selection. The effectiveness of the proposed methods for rela-belling and acoustic modelling is assessed through experiemnts on the CHOREC corpus, in terms of sequence error rate and alignment accuracy. Improvements over the baseline of up to 60% and 23.3% respectively are observed

Crossref

White Rose Research Online

Robust learning of acoustic representations from diverse speech data

Author: Fainberg Joachim
Publication venue: The University of Edinburgh
Publication date: 31/07/2021
Field of study

Automatic speech recognition is increasingly applied to new domains. A key challenge is to robustly learn, update and maintain representations to cope with transient acoustic conditions. A typical example is broadcast media, for which speakers and environments may change rapidly, and available supervision may be poor. The concern of this thesis is to build and investigate methods for acoustic modelling that are robust to the characteristics and transient conditions as embodied by such media. The first contribution of the thesis is a technique to make use of inaccurate transcriptions as supervision for acoustic model training. There is an abundance of audio with approximate labels, but training methods can be sensitive to label errors, and their use is therefore not trivial. State-of-the-art semi-supervised training makes effective use of a lattice of supervision, inherently encoding uncertainty in the labels to avoid overfitting to poor supervision, but does not make use of the transcriptions. Existing approaches that do aim to make use of the transcriptions typically employ an algorithm to filter or combine the transcriptions with the recognition output from a seed model, but the final result does not encode uncertainty. We propose a method to combine the lattice output from a biased recognition pass with the transcripts, crucially preserving uncertainty in the lattice where appropriate. This substantially reduces the word error rate on a broadcast task. The second contribution is a method to factorise representations for speakers and environments so that they may be combined in novel combinations. In realistic scenarios, the speaker or environment transform at test time might be unknown, or there may be insufficient data to learn a joint transform. We show that in such cases, factorised, or independent, representations are required to avoid deteriorating performance. Using i-vectors, we factorise speaker or environment information using multi-condition training with neural networks. Specifically, we extract bottleneck features from networks trained to classify either speakers or environments. The resulting factorised representations prove beneficial when one factor is missing at test time, or when all factors are seen, but not in the desired combination. The third contribution is an investigation of model adaptation in a longitudinal setting. In this scenario, we repeatedly adapt a model to new data, with the constraint that previous data becomes unavailable. We first demonstrate the effect of such a constraint, and show that using a cyclical learning rate may help. We then observe that these successive models lend themselves well to ensembling. Finally, we show that the impact of this constraint in an active learning setting may be detrimental to performance, and suggest to combine active learning with semi-supervised training to avoid biasing the model. The fourth contribution is a method to adapt low-level features in a parameter-efficient and interpretable manner. We propose to adapt the filters in a neural feature extractor, known as SincNet. In contrast to traditional techniques that warp the filterbank frequencies in standard feature extraction, adapting SincNet parameters is more flexible and more readily optimised, whilst maintaining interpretability. On a task adapting from adult to child speech, we show that this layer is well suited for adaptation and is very effective with respect to the small number of adapted parameters

Edinburgh Research Archive

A lightly supervised approach to detect stuttering in children's speech

Author: Alharbi S.
Brumfitt S.
Green P.
Hasan M.
Simons A.J.H.
Publication venue: 'International Speech Communication Association'
Publication date: 02/09/2018
Field of study

© 2018 International Speech Communication Association. All rights reserved. In speech pathology, new assistive technologies using ASR and machine learning approaches are being developed for detecting speech disorder events. Classically-trained ASR model tends to remove disfluencies from spoken utterances, due to its focus on producing clean and readable text output. However, diagnostic systems need to be able to track speech disfluencies, such as stuttering events, in order to determine the severity level of stuttering. To achieve this, ASR systems must be adapted to recognise full verbatim utterances, including pseudo-words and non-meaningful part-words. This work proposes a training regime to address this problem, and preserve a full verbatim output of stuttering speech. We use a lightly-supervised approach using task-oriented lattices to recognise the stuttering speech of children performing a standard reading task. This approach improved the WER by 27.8% relative to a baseline that uses word-lattices generated from the original prompt. The improved results preserved 63% of stuttering events (including sound, word, part-word and phrase repetition, and revision). This work also proposes a separate correction layer on top of the ASR that detects prolongation events (which are poorly recog-nised by the ASR). This increases the percentage of preserved stuttering events to 70%

Crossref

White Rose Research Online

Metadiscourse Tagging in Academic Lectures

Author: Alharbi Ghada
Publication venue: 'University of Sheffield Conference Proceedings'
Publication date: 31/08/2016
Field of study

This thesis presents a study into the nature and structure of academic lectures, with a special focus on metadiscourse phenomena. Metadiscourse refers to a set of linguistics expressions that signal specific discourse functions such as the Introduction: “Today we will talk about...” and Emphasising: “This is an important point”. These functions are important because they are part of lecturers’ strategies in understanding of what happens in a lecture. The knowledge of their presence and identity could serve as initial steps toward downstream applications that will require functional analysis of lecture content such as a browser for lectures archives, summarisation, or an automatic minute-taker for lectures. One challenging aspect for metadiscourse detection and classification is that the set of expressions are semi-fixed, meaning that different phrases can indicate the same function. To that end a four-stage approach is developed to study metadiscourse in academic lectures. Firstly, a corpus of metadiscourse for academic lectures from Physics and Economics courses is built by adapting an existing scheme that describes functional-oriented metadiscourse categories. Second, because producing reference transcripts is a time-consuming task and prone to some errors due to the manual efforts required, an automatic speech recognition (ASR) system is built specifically to produce transcripts of lectures. Since the reference transcripts lack time-stamp information, an alignment system is applied to the reference to be able to evaluate the ASR system. Then, a model is developed using Support Vector Machines (SVMs) to classify metadiscourse tags using both textual and acoustical features. The results show that n-grams are the most inductive features for the task; however, due to data sparsity the model does not generalise for unseen n-grams. This limits its ability to solve the variation issue in metadiscourse expressions. Continuous Bag-of-Words (CBOW) provide a promising solution as this can capture both the syntactic and semantic similarities between words and thus is able to solve the generalisation issue. However, CBOW ignores the word order completely, something which is very important to be retained when classifying metadiscourse tags. The final stage aims to address the issue of sequence modelling by developing a joint CBOW and Convolutional Neural Network (CNN) model. CNNs can work with continuous features such as word embedding in an elegant and robust fashion by producing a fixed-size feature vector that is able to identify indicative local information for the tagging task. The results show that metadiscourse tagging using CNNs outperforms the SVMs model significantly even on ASR outputs, owing to its ability to predict a sequence of words that is more representative for the task regardless of its position in the sentence. In addition, the inclusion of other features such as part-of-speech (POS) tags and prosodic cues improved the results further. These findings are consistent in both disciplines. The final contribution in this thesis is to investigate the suitability of using metadiscourse tags as discourse features in the lecture structure segmentation model, despite the fact that the task is approached as a classification model and most of the state-of-art models are unsupervised. In general, the obtained results show remarkable improvements over the state-of-the-art models in both disciplines

White Rose E-theses Online