33 research outputs found
ALISA: An automatic lightly supervised speech segmentation and alignment tool
This paper describes the ALISA tool, which implements a lightly supervised method for sentence-level alignment of speech with imperfect transcripts. Its intended use is to enable the creation of new speech corpora from a multitude of resources in a language-independent fashion, thus avoiding the need to record or transcribe speech data. The method is designed so that it requires minimum user intervention and expert knowledge, and it is able to align data in languages which employ alphabetic scripts. It comprises a GMM-based voice activity de-tector and a highly constrained grapheme-based speech aligner. The method is evaluated objectively against a gold standard segmentation and transcription, as well as subjectively through building and testing speech synthesis systems from the retrieved data. Results show that on average, 70 % of the original data is correctly aligned, with a word error rate of less than 0.5%. In one case, sub-jective listening tests show a statistically significant preference for voices built on the gold transcript, but this is small and in other tests, no statistically sig-nificant diâľerences between the systems built from the fully supervised training data and the one which uses the proposed method are found
Using Adaptation to Improve Speech Transcription Alignment in Noisy and Reverberant Environments
When using data retrieved from the internet to create new speech databases, the recording conditions can often be highly variable within and between sessions. This variance influences the overall performance of any automatic speech and text alignment techniques used to process this data. In this paper we discuss the use of speaker adaptation methods to address this issue. Starting from a baseline system for automatic sentence-level segmentation and speech and text alignment based on GMMs and grapheme HMMs, respectively, we employ Maximum A Posteriori (MAP) and Constrained Maximum Likelihood Linear Regression (CMLLR) techniques to model the variation in the data in order to increase the amount of confidently aligned speech. We tested 29 different scenarios, which include reverberation, 8 talker babble noise and white noise, each in various combinations and SNRs. Results show that the MAP-based segmentationâs performance is very much influenced by the noise type, as well as the presence or absence of reverberation. On the other hand, the CMLLR adaptation of the acoustic models gives an average 20 % increase in the aligned data percentage for the majority of the studied scenarios. Index Terms: speech alignment, speech segmentation, adaptive training, CMLLR, MAP, VA
The Simple4All entry to the Blizzard Challenge 2013
We describe the synthetic voices entered into the 2013 Blizzard Challenge by the SIMPLE4ALL consortium. The 2013 Blizzard Challenge presents an opportunity to test and benchmark some of the tools we have been developing to address two problems of interest: 1) how best to learn from plentiful âfoundâ data, and 2) how to produce systems in arbitrary new languages with minimal annotated data and language-specific expertise on the part of the system builders. We here explain how our tools were used to address these problems on the different tasks of the challenge, and provide some discussion of the evaluation results. Index Terms: statistical parametric speech synthesis, speech alignment, speech segmentation, style diarisation, unsupervise
Multi-dialect Arabic broadcast speech recognition
Dialectal Arabic speech research suffers from the lack of labelled resources and
standardised orthography. There are three main challenges in dialectal Arabic
speech recognition: (i) finding labelled dialectal Arabic speech data, (ii) training
robust dialectal speech recognition models from limited labelled data and (iii)
evaluating speech recognition for dialects with no orthographic rules. This thesis
is concerned with the following three contributions:
Arabic Dialect Identification: We are mainly dealing with Arabic speech
without prior knowledge of the spoken dialect. Arabic dialects could be sufficiently
diverse to the extent that one can argue that they are different languages
rather than dialects of the same language. We have two contributions:
First, we use crowdsourcing to annotate a multi-dialectal speech corpus collected
from Al Jazeera TV channel. We obtained utterance level dialect labels for 57
hours of high-quality consisting of four major varieties of dialectal Arabic (DA),
comprised of Egyptian, Levantine, Gulf or Arabic peninsula, North African or
Moroccan from almost 1,000 hours. Second, we build an Arabic dialect identification
(ADI) system. We explored two main groups of features, namely acoustic
features and linguistic features. For the linguistic features, we look at a wide
range of features, addressing words, characters and phonemes. With respect to
acoustic features, we look at raw features such as mel-frequency cepstral coefficients
combined with shifted delta cepstra (MFCC-SDC), bottleneck features and
the i-vector as a latent variable. We studied both generative and discriminative
classifiers, in addition to deep learning approaches, namely deep neural network
(DNN) and convolutional neural network (CNN). In our work, we propose Arabic
as a five class dialect challenge comprising of the previously mentioned four
dialects as well as modern standard Arabic.
Arabic Speech Recognition: We introduce our effort in building Arabic automatic
speech recognition (ASR) and we create an open research community
to advance it. This section has two main goals: First, creating a framework for
Arabic ASR that is publicly available for research. We address our effort in building
two multi-genre broadcast (MGB) challenges. MGB-2 focuses on broadcast
news using more than 1,200 hours of speech and 130M words of text collected
from the broadcast domain. MGB-3, however, focuses on dialectal multi-genre
data with limited non-orthographic speech collected from YouTube, with special
attention paid to transfer learning. Second, building a robust Arabic ASR system
and reporting a competitive word error rate (WER) to use it as a potential
benchmark to advance the state of the art in Arabic ASR. Our overall system is
a combination of five acoustic models (AM): unidirectional long short term memory
(LSTM), bidirectional LSTM (BLSTM), time delay neural network (TDNN),
TDNN layers along with LSTM layers (TDNN-LSTM) and finally TDNN layers
followed by BLSTM layers (TDNN-BLSTM). The AM is trained using purely
sequence trained neural networks lattice-free maximum mutual information (LFMMI).
The generated lattices are rescored using a four-gram language model
(LM) and a recurrent neural network with maximum entropy (RNNME) LM.
Our official WER is 13%, which has the lowest WER reported on this task.
Evaluation: The third part of the thesis addresses our effort in evaluating dialectal
speech with no orthographic rules. Our methods learn from multiple
transcribers and align the speech hypothesis to overcome the non-orthographic
aspects. Our multi-reference WER (MR-WER) approach is similar to the BLEU
score used in machine translation (MT). We have also automated this process
by learning different spelling variants from Twitter data. We mine automatically
from a huge collection of tweets in an unsupervised fashion to build more than
11M n-to-m lexical pairs, and we propose a new evaluation metric: dialectal
WER (WERd). Finally, we tried to estimate the word error rate (e-WER) with
no reference transcription using decoding and language features. We show that
our word error rate estimation is robust for many scenarios with and without the
decoding features
Speech Recognition Challenge in the Wild: Arabic MGB-3
This paper describes the Arabic MGB-3 Challenge - Arabic Speech Recognition
in the Wild. Unlike last year's Arabic MGB-2 Challenge, for which the
recognition task was based on more than 1,200 hours broadcast TV news
recordings from Aljazeera Arabic TV programs, MGB-3 emphasises dialectal Arabic
using a multi-genre collection of Egyptian YouTube videos. Seven genres were
used for the data collection: comedy, cooking, family/kids, fashion, drama,
sports, and science (TEDx). A total of 16 hours of videos, split evenly across
the different genres, were divided into adaptation, development and evaluation
data sets. The Arabic MGB-Challenge comprised two tasks: A) Speech
transcription, evaluated on the MGB-3 test set, along with the 10 hour MGB-2
test set to report progress on the MGB-2 evaluation; B) Arabic dialect
identification, introduced this year in order to distinguish between four major
Arabic dialects - Egyptian, Levantine, North African, Gulf, as well as Modern
Standard Arabic. Two hours of audio per dialect were released for development
and a further two hours were used for evaluation. For dialect identification,
both lexical features and i-vector bottleneck features were shared with
participants in addition to the raw audio recordings. Overall, thirteen teams
submitted ten systems to the challenge. We outline the approaches adopted in
each system, and summarise the evaluation results
The UEDIN ASR Systems for the IWSLT 2014 Evaluation
This paper describes the University of Edinburgh (UEDIN) ASR systems for the 2014 IWSLT Evaluation. Notable fea-tures of the English system include deep neural network acoustic models in both tandem and hybrid configuration with the use of multi-level adaptive networks, LHUC adapta-tion and Maxout units. The German system includes lightly supervised training and a new method for dictionary gener-ation. Our voice activity detection system now uses a semi-Markov model to incorporate a prior on utterance lengths. There are improvements of up to 30 % relative WER on the tst2013 English test set. 1