Search CORE

64 research outputs found

Topic Identification for Speech without ASR

Author: Harman Craig
Khudanpur Sanjeev
Liu Chunxi
Trmal Jan
Wiesner Matthew
Publication venue
Publication date: 11/07/2017
Field of study

Modern topic identification (topic ID) systems for speech use automatic speech recognition (ASR) to produce speech transcripts, and perform supervised classification on such ASR outputs. However, under resource-limited conditions, the manually transcribed speech required to develop standard ASR systems can be severely limited or unavailable. In this paper, we investigate alternative unsupervised solutions to obtaining tokenizations of speech in terms of a vocabulary of automatically discovered word-like or phoneme-like units, without depending on the supervised training of ASR systems. Moreover, using automatic phoneme-like tokenizations, we demonstrate that a convolutional neural network based framework for learning spoken document representations provides competitive performance compared to a standard bag-of-words representation, as evidenced by comprehensive topic ID evaluations on both single-label and multi-label classification tasks.Comment: 5 pages, 2 figures; accepted for publication at Interspeech 201

arXiv.org e-Print Archive

An Empirical Evaluation of Zero Resource Acoustic Unit Discovery

Author: Burget Lukas
Dehak Najim
Ghahremani Pegah
Kesiraju Santosh
Khudanpur Sanjeev
Liu Chunxi
Ondel Lucas
Rott Alena
Sun Ming
Yang Jinyi
Publication venue
Publication date: 04/02/2017
Field of study

Acoustic unit discovery (AUD) is a process of automatically identifying a categorical acoustic unit inventory from speech and producing corresponding acoustic unit tokenizations. AUD provides an important avenue for unsupervised acoustic model training in a zero resource setting where expert-provided linguistic knowledge and transcribed speech are unavailable. Therefore, to further facilitate zero-resource AUD process, in this paper, we demonstrate acoustic feature representations can be significantly improved by (i) performing linear discriminant analysis (LDA) in an unsupervised self-trained fashion, and (ii) leveraging resources of other languages through building a multilingual bottleneck (BN) feature extractor to give effective cross-lingual generalization. Moreover, we perform comprehensive evaluations of AUD efficacy on multiple downstream speech applications, and their correlated performance suggests that AUD evaluations are feasible using different alternative language resources when only a subset of these evaluation resources can be available in typical zero resource applications.Comment: 5 pages, 1 figure; Accepted for publication at ICASSP 201

arXiv.org e-Print Archive

Phonene-based topic spotting on the switchboard corpus

Author: Theunissen M. W. (Marthinus Wilhelmus)
Publication venue: Stellenbosch : Stellenbosch University
Publication date: 01/04/2002
Field of study

Thesis (MScEng)--Stellenbosch University, 2002.ENGLISH ABSTRACT: The field of topic spotting in conversational speech deals with the problem of identifying "interesting" conversations or speech extracts contained within large volumes of speech data. Typical applications where the technology can be found include the surveillance and screening of messages before referring to human operators. Closely related methods can also be used for data-mining of multimedia databases, literature searches, language identification, call routing and message prioritisation. The first topic spotting systems used words as the most basic units. However, because of the poor performance of speech recognisers, a large amount of topic-specific hand-transcribed training data is needed. It is for this reason that researchers started concentrating on methods using phonemes instead, because the errors then occur on smaller, and therefore less important, units. Phoneme-based methods consequently make it feasible to use computer generated transcriptions as training data. Building on word-based methods, a number of phoneme-based systems have emerged. The two most promising ones are the Euclidean Nearest Wrong Neighbours (ENWN) algorithm and the newly developed Stochastic Method for the Automatic Recognition of Topics (SMART). Previous experiments on the Oregon Graduate Institute of Science and Technology's Multi-Language Telephone Speech Corpus suggested that SMART yields a large improvement over ENWN which outperformed competing phoneme-based systems in evaluations. However, the small amount of data available for these experiments meant that more rigorous testing was required. In this research, the algorithms were therefore re-implemented to run on the much larger Switchboard Corpus. Subsequently, a substantial improvement of SMART over ENWN was observed, confirming the result that was previously obtained. In addition to this, an investigation was conducted into the improvement of SMART. This resulted in a new counting strategy with a corresponding improvement in performance.AFRIKAANSE OPSOMMING: Die veld van onderwerp-herkenning in spraak het te doen met die probleem om "interessante" gesprekke of spraaksegmente te identifiseer tussen groot hoeveelhede spraakdata. Die tegnologie word tipies gebruik om gesprekke te verwerk voor dit verwys word na menslike operateurs. Verwante metodes kan ook gebruik word vir die ontginning van data in multimedia databasisse, literatuur-soektogte, taal-herkenning, oproep-kanalisering en boodskap-prioritisering. Die eerste onderwerp-herkenners was woordgebaseerd, maar as gevolg van die swak resultate wat behaal word met spraak-herkenners, is groot hoeveelhede hand-getranskribeerde data nodig om sulke stelsels af te rig. Dit is om hierdie rede dat navorsers tans foneemgebaseerde benaderings verkies, aangesien die foute op kleiner, en dus minder belangrike, eenhede voorkom. Foneemgebaseerde metodes maak dit dus moontlik om rekenaargegenereerde transkripsies as afrigdata te gebruik. Verskeie foneemgebaseerde stelsels het verskyn deur voort te bou op woordgebaseerde metodes. Die twee belowendste stelsels is die "Euclidean Nearest Wrong Neighbours" (ENWN) algoritme en die nuwe "Stochastic Method for the Automatic Recognition of Topics" (SMART). Vorige eksperimente op die "Oregon Graduate Institute of Science and Technology's Multi-Language Telephone Speech Corpus" het daarop gedui dat die SMART algoritme beter vaar as die ENWN-stelsel wat ander foneemgebaseerde algoritmes geklop het. Die feit dat daar te min data beskikbaar was tydens die eksperimente het daarop gedui dat strenger toetse nodig was. Gedurende hierdie navorsing is die algoritmes dus herimplementeer sodat eksperimente op die "Switchboard Corpus" uitgevoer kon word. Daar is vervolgens waargeneem dat SMART aansienlik beter resultate lewer as ENWN en dit het dus die geldigheid van die vorige resultate bevestig. Ter aanvulling hiervan, is 'n ondersoek geloods om SMART te probeer verbeter. Dit het tot 'n nuwe telling-strategie gelei met 'n meegaande verbetering in resultate

A summary of the 2012 JHU CLSP Workshop on Zero Resource Speech Technologies and Models of Early Language Acquisition

Author: Bennett Erin
Borschinger Benjamin
Chiu Justin
Church Kenneth
Clark Pascal
Dunbar Ewan
Dupoux Emmanuel
Feldman Naomi
Fourtassi Abdallah
Goldwater Sharon
Harwath David
Hermansky Hynek
Jansen Aren
Johnson Mark
Khudanpur Sanjeev
Lee Chia-ying
Levin Keith
McGraw Ian
Metze Florian
Norouzian Atta
Peddinti Vijay
Richardson Rachel
Rose Richard
Schatz Thomas
Seltzer Mike
Thomas Samuel
Varadarajan Balakrishnan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2013
Field of study

We summarize the accomplishments of a multi-disciplinary workshop exploring the computational and scientific issues surrounding zero resource (unsupervised) speech technologies and related models of early language acquisition. Centered around the tasks of phonetic and lexical discovery, we consider unified evaluation metrics, present two new approaches for improving speaker independence in the absence of supervision, and evaluate the application of Bayesian word segmentation algorithms to automatic subword unit tokenizations. Finally, we present two strategies for integrating zero resource techniques into supervised settings, demonstrating the potential of unsupervised methods to improve mainstream technologies.5 page(s

Discriminative and adaptive training for robust speech recognition and understanding

Author: Meng Zhong
Publication venue: Georgia Institute of Technology
Publication date: 20/08/2018
Field of study

Robust automatic speech recognition (ASR) and understanding (ASU) under various conditions remains to be a challenging problem even with the advances of deep learning. To achieve robust ASU, two discriminative training objectives are proposed for keyword spotting and topic classification: (1) To accurately recognize the semantically important keywords, the non-uniform error cost minimum classification error training of deep neural network (DNN) and bi-directional long short-term memory (BLSTM) acoustic models is proposed to minimize the recognition errors of only the keywords. (2) To compensate for the mismatched objectives of speech recognition and understanding, minimum semantic error cost training of the BLSTM acoustic model is proposed to generate semantically accurate lattices for topic classification. Further, to expand the application of the ASU system to various conditions, four adaptive training approaches are proposed to improve the robustness of the ASR under different conditions: (1) To suppress the effect of inter-speaker variability on speaker-independent DNN acoustic model, speaker-invariant training is proposed to learn a deep representation in the DNN that is both senone-discriminative and speaker-invariant through adversarial multi-task training (2) To achieve condition-robust unsupervised adaptation with parallel data, adversarial teacher-student learning is proposed to suppress multiple factors of condition variability in the procedure of knowledge transfer from a well-trained source domain LSTM acoustic model to the target domain. (3) To further improve the adversarial learning for unsupervised adaptation with unparallel data, domain separation networks are used to enhance the domain-invariance of the senone-discriminative deep representation by explicitly modeling the private component that is unique to each domain. (4) To achieve robust far-field ASR, an LSTM adaptive beamforming network is proposed to estimate the real-time beamforming filter coefficients to cope with non-stationary environmental noise and dynamic nature of source and microphones positions.Ph.D

Spoken content retrieval: A survey of techniques and technologies

Author: Ani Nenkova
C A. Nenkova
K. Mckeown
Kathleen Mckeown
Publication venue: 'Now Publishers'
Publication date: 01/01/2012
Field of study

Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR

CiteSeerX

Unsupervised crosslingual adaptation of tokenisers for spoken language recognition

Author: Raymond W.M. Ng
Mauro Nicolao
Thomas Hain
Ambikairajah
Anderson
BenZeghiba
BenZeghiba
Caraballo
Corboda
Davis
Dehak
D’Haro
D’Haro
Fék
Ferrer
Gauvain
Gibson
Glembek
Hazen
Hermansky
Joachims
Knill
Li
Li
Lööf
Ma
Muthusamy
Navrátil
Ng
Ng
Richardson
Schultz
Schwarz
Singer
Suzuki
Torres-Carrasquillo
Torres-Carrasquillo
Veselý
Vu
Xue
Zissman
Zissman
Publication venue: 'Elsevier BV'
Publication date: 01/11/2017
Field of study

Phone tokenisers are used in spoken language recognition (SLR) to obtain elementary phonetic information. We present a study on the use of deep neural network tokenisers. Unsupervised crosslingual adaptation was performed to adapt the baseline tokeniser trained on English conversational telephone speech data to different languages. Two training and adaptation approaches, namely cross-entropy adaptation and state-level minimum Bayes risk adaptation, were tested in a bottleneck i-vector and a phonotactic SLR system. The SLR systems using the tokenisers adapted to different languages were combined using score fusion, giving 7-18% reduction in minimum detection cost function (minDCF) compared with the baseline configurations without adapted tokenisers. Analysis of results showed that the ensemble tokenisers gave diverse representation of phonemes, thus bringing complementary effects when SLR systems with different tokenisers were combined. SLR performance was also shown to be related to the quality of the adapted tokenisers

Biblioteca Digital de la Comunidad de Madrid

Unsupervised crosslingual adaptation of tokenisers for spoken language recognition

Author: Ambikairajah
Anderson
BenZeghiba
BenZeghiba
Caraballo
Corboda
Davis
Dehak
D’Haro
D’Haro
Ferrer
Fék
Gauvain
Gibson
Glembek
Hazen
Hermansky
Joachims
Knill
Li
Li
Lööf
Ma
Mauro Nicolao
Muthusamy
Navrátil
Ng
Ng
Raymond W.M. Ng
Richardson
Schultz
Schwarz
Singer
Suzuki
Thomas Hain
Torres-Carrasquillo
Torres-Carrasquillo
Veselý
Vu
Xue
Zissman
Zissman
Publication venue: 'Elsevier BV'
Publication date: 01/11/2017
Field of study