64 research outputs found
Topic Identification for Speech without ASR
Modern topic identification (topic ID) systems for speech use automatic
speech recognition (ASR) to produce speech transcripts, and perform supervised
classification on such ASR outputs. However, under resource-limited conditions,
the manually transcribed speech required to develop standard ASR systems can be
severely limited or unavailable. In this paper, we investigate alternative
unsupervised solutions to obtaining tokenizations of speech in terms of a
vocabulary of automatically discovered word-like or phoneme-like units, without
depending on the supervised training of ASR systems. Moreover, using automatic
phoneme-like tokenizations, we demonstrate that a convolutional neural network
based framework for learning spoken document representations provides
competitive performance compared to a standard bag-of-words representation, as
evidenced by comprehensive topic ID evaluations on both single-label and
multi-label classification tasks.Comment: 5 pages, 2 figures; accepted for publication at Interspeech 201
An Empirical Evaluation of Zero Resource Acoustic Unit Discovery
Acoustic unit discovery (AUD) is a process of automatically identifying a
categorical acoustic unit inventory from speech and producing corresponding
acoustic unit tokenizations. AUD provides an important avenue for unsupervised
acoustic model training in a zero resource setting where expert-provided
linguistic knowledge and transcribed speech are unavailable. Therefore, to
further facilitate zero-resource AUD process, in this paper, we demonstrate
acoustic feature representations can be significantly improved by (i)
performing linear discriminant analysis (LDA) in an unsupervised self-trained
fashion, and (ii) leveraging resources of other languages through building a
multilingual bottleneck (BN) feature extractor to give effective cross-lingual
generalization. Moreover, we perform comprehensive evaluations of AUD efficacy
on multiple downstream speech applications, and their correlated performance
suggests that AUD evaluations are feasible using different alternative language
resources when only a subset of these evaluation resources can be available in
typical zero resource applications.Comment: 5 pages, 1 figure; Accepted for publication at ICASSP 201
Phonene-based topic spotting on the switchboard corpus
Thesis (MScEng)--Stellenbosch University, 2002.ENGLISH ABSTRACT: The field of topic spotting in conversational speech deals with the problem of identifying
"interesting" conversations or speech extracts contained within large volumes of speech
data. Typical applications where the technology can be found include the surveillance
and screening of messages before referring to human operators. Closely related methods
can also be used for data-mining of multimedia databases, literature searches, language
identification, call routing and message prioritisation.
The first topic spotting systems used words as the most basic units. However, because of the
poor performance of speech recognisers, a large amount of topic-specific hand-transcribed
training data is needed. It is for this reason that researchers started concentrating on methods
using phonemes instead, because the errors then occur on smaller, and therefore less
important, units. Phoneme-based methods consequently make it feasible to use computer
generated transcriptions as training data.
Building on word-based methods, a number of phoneme-based systems have emerged.
The two most promising ones are the Euclidean Nearest Wrong Neighbours (ENWN) algorithm
and the newly developed Stochastic Method for the Automatic Recognition of
Topics (SMART). Previous experiments on the Oregon Graduate Institute of Science and
Technology's Multi-Language Telephone Speech Corpus suggested that SMART yields a
large improvement over ENWN which outperformed competing phoneme-based systems
in evaluations. However, the small amount of data available for these experiments meant
that more rigorous testing was required.
In this research, the algorithms were therefore re-implemented to run on the much larger
Switchboard Corpus. Subsequently, a substantial improvement of SMART over ENWN
was observed, confirming the result that was previously obtained. In addition to this,
an investigation was conducted into the improvement of SMART. This resulted in a new
counting strategy with a corresponding improvement in performance.AFRIKAANSE OPSOMMING: Die veld van onderwerp-herkenning in spraak het te doen met die probleem om "interessante"
gesprekke of spraaksegmente te identifiseer tussen groot hoeveelhede spraakdata.
Die tegnologie word tipies gebruik om gesprekke te verwerk voor dit verwys word na
menslike operateurs. Verwante metodes kan ook gebruik word vir die ontginning van
data in multimedia databasisse, literatuur-soektogte, taal-herkenning, oproep-kanalisering
en boodskap-prioritisering.
Die eerste onderwerp-herkenners was woordgebaseerd, maar as gevolg van die swak resultate
wat behaal word met spraak-herkenners, is groot hoeveelhede hand-getranskribeerde
data nodig om sulke stelsels af te rig. Dit is om hierdie rede dat navorsers tans foneemgebaseerde
benaderings verkies, aangesien die foute op kleiner, en dus minder belangrike,
eenhede voorkom. Foneemgebaseerde metodes maak dit dus moontlik om rekenaargegenereerde
transkripsies as afrigdata te gebruik.
Verskeie foneemgebaseerde stelsels het verskyn deur voort te bou op woordgebaseerde
metodes. Die twee belowendste stelsels is die "Euclidean Nearest Wrong Neighbours"
(ENWN) algoritme en die nuwe "Stochastic Method for the Automatic Recognition of
Topics" (SMART). Vorige eksperimente op die "Oregon Graduate Institute of Science and
Technology's Multi-Language Telephone Speech Corpus" het daarop gedui dat die SMART
algoritme beter vaar as die ENWN-stelsel wat ander foneemgebaseerde algoritmes geklop
het. Die feit dat daar te min data beskikbaar was tydens die eksperimente het daarop
gedui dat strenger toetse nodig was.
Gedurende hierdie navorsing is die algoritmes dus herimplementeer sodat eksperimente
op die "Switchboard Corpus" uitgevoer kon word. Daar is vervolgens waargeneem dat
SMART aansienlik beter resultate lewer as ENWN en dit het dus die geldigheid van die
vorige resultate bevestig. Ter aanvulling hiervan, is 'n ondersoek geloods om SMART te
probeer verbeter. Dit het tot 'n nuwe telling-strategie gelei met 'n meegaande verbetering
in resultate
A summary of the 2012 JHU CLSP Workshop on Zero Resource Speech Technologies and Models of Early Language Acquisition
We summarize the accomplishments of a multi-disciplinary workshop exploring the computational and scientific issues surrounding zero resource (unsupervised) speech technologies and related models of early language acquisition. Centered around the tasks of phonetic and lexical discovery, we consider unified evaluation metrics, present two new approaches for improving speaker independence in the absence of supervision, and evaluate the application of Bayesian word segmentation algorithms to automatic subword unit tokenizations. Finally, we present two strategies for integrating zero resource techniques into supervised settings, demonstrating the potential of unsupervised methods to improve mainstream technologies.5 page(s
Discriminative and adaptive training for robust speech recognition and understanding
Robust automatic speech recognition (ASR) and understanding (ASU) under various conditions remains to be a challenging problem even with the advances of deep learning. To achieve robust ASU, two discriminative training objectives are proposed for keyword spotting and topic classification: (1) To accurately recognize the semantically important keywords, the non-uniform error cost minimum classification error training of deep neural network (DNN) and bi-directional long short-term memory (BLSTM) acoustic models is proposed to minimize the recognition errors of only the keywords. (2) To compensate for the mismatched objectives of speech recognition and understanding, minimum semantic error cost training of the BLSTM acoustic model is proposed to generate semantically accurate lattices for topic classification. Further, to expand the application of the ASU system to various conditions, four adaptive training approaches are proposed to improve the robustness of the ASR under different conditions: (1) To suppress the effect of inter-speaker variability on speaker-independent DNN acoustic model, speaker-invariant training is proposed to learn a deep representation in the DNN that is both senone-discriminative and speaker-invariant through adversarial multi-task training (2) To achieve condition-robust unsupervised adaptation with parallel data, adversarial teacher-student learning is proposed to suppress multiple factors of condition variability in the procedure of knowledge transfer from a well-trained source domain LSTM acoustic model to the target domain. (3) To further improve the adversarial learning for unsupervised adaptation with unparallel data, domain separation networks are used to enhance the domain-invariance of the senone-discriminative deep representation by explicitly modeling the private component that is unique to each domain. (4) To achieve robust far-field ASR, an LSTM adaptive beamforming network is proposed to estimate the real-time beamforming filter coefficients to cope with non-stationary environmental noise and dynamic nature of source and microphones positions.Ph.D
Spoken content retrieval: A survey of techniques and technologies
Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR
Unsupervised crosslingual adaptation of tokenisers for spoken language recognition
Phone tokenisers are used in spoken language recognition (SLR) to obtain elementary
phonetic information. We present a study on the use of deep neural
network tokenisers. Unsupervised crosslingual adaptation was performed to
adapt the baseline tokeniser trained on English conversational telephone speech
data to different languages. Two training and adaptation approaches, namely
cross-entropy adaptation and state-level minimum Bayes risk adaptation, were
tested in a bottleneck i-vector and a phonotactic SLR system. The SLR systems
using the tokenisers adapted to different languages were combined using score
fusion, giving 7-18% reduction in minimum detection cost function (minDCF)
compared with the baseline configurations without adapted tokenisers. Analysis
of results showed that the ensemble tokenisers gave diverse representation of
phonemes, thus bringing complementary effects when SLR systems with different
tokenisers were combined. SLR performance was also shown to be related
to the quality of the adapted tokenisers
Unsupervised crosslingual adaptation of tokenisers for spoken language recognition
Phone tokenisers are used in spoken language recognition (SLR) to obtain elementary
phonetic information. We present a study on the use of deep neural
network tokenisers. Unsupervised crosslingual adaptation was performed to
adapt the baseline tokeniser trained on English conversational telephone speech
data to different languages. Two training and adaptation approaches, namely
cross-entropy adaptation and state-level minimum Bayes risk adaptation, were
tested in a bottleneck i-vector and a phonotactic SLR system. The SLR systems
using the tokenisers adapted to different languages were combined using score
fusion, giving 7-18% reduction in minimum detection cost function (minDCF)
compared with the baseline configurations without adapted tokenisers. Analysis
of results showed that the ensemble tokenisers gave diverse representation of
phonemes, thus bringing complementary effects when SLR systems with different
tokenisers were combined. SLR performance was also shown to be related
to the quality of the adapted tokenisers
- …