535 research outputs found
Topic Identification for Speech without ASR
Modern topic identification (topic ID) systems for speech use automatic
speech recognition (ASR) to produce speech transcripts, and perform supervised
classification on such ASR outputs. However, under resource-limited conditions,
the manually transcribed speech required to develop standard ASR systems can be
severely limited or unavailable. In this paper, we investigate alternative
unsupervised solutions to obtaining tokenizations of speech in terms of a
vocabulary of automatically discovered word-like or phoneme-like units, without
depending on the supervised training of ASR systems. Moreover, using automatic
phoneme-like tokenizations, we demonstrate that a convolutional neural network
based framework for learning spoken document representations provides
competitive performance compared to a standard bag-of-words representation, as
evidenced by comprehensive topic ID evaluations on both single-label and
multi-label classification tasks.Comment: 5 pages, 2 figures; accepted for publication at Interspeech 201
An Empirical Evaluation of Zero Resource Acoustic Unit Discovery
Acoustic unit discovery (AUD) is a process of automatically identifying a
categorical acoustic unit inventory from speech and producing corresponding
acoustic unit tokenizations. AUD provides an important avenue for unsupervised
acoustic model training in a zero resource setting where expert-provided
linguistic knowledge and transcribed speech are unavailable. Therefore, to
further facilitate zero-resource AUD process, in this paper, we demonstrate
acoustic feature representations can be significantly improved by (i)
performing linear discriminant analysis (LDA) in an unsupervised self-trained
fashion, and (ii) leveraging resources of other languages through building a
multilingual bottleneck (BN) feature extractor to give effective cross-lingual
generalization. Moreover, we perform comprehensive evaluations of AUD efficacy
on multiple downstream speech applications, and their correlated performance
suggests that AUD evaluations are feasible using different alternative language
resources when only a subset of these evaluation resources can be available in
typical zero resource applications.Comment: 5 pages, 1 figure; Accepted for publication at ICASSP 201
Experimental studies on effect of speaking mode on spoken term detection
The objective of this paper is to study the effect of speaking mode on spoken term detection (STD) system. The experiments are conducted with respect to query words recorded in isolated manner and words cut out from continuous speech. Durations of phonemes in query words greatly vary between these two modes. Hence pattern matching stage plays a crucial role which takes care of temporal variations. Matching is done using Subsequence dynamic time warping (DTW) on posterior features of query and reference utterances, obtained by training Multilayer perceptron (MLP). The difference in performance of the STD system for different phoneme groupings (45, 25, 15 and 6 classes) is also analyzed. Our STD system is tested on Telugu broadcast news. Major difference in STD system performance is observed for recorded and cut-out types of query words. It is observed that STD system performance is better with query words cut out from continuous speech compared to words recorded in isolated manner. This performance difference can be accounted for large temporal variations
Word hypothesis from undifferentiated, errorful phonetic strings
This thesis investigates a dynamic programming approach to word hypothesis in the context of a speaker independent, large vocabulary, continuous speech recognition system. Using a method known as Dynamic Time Warping, an undifferentiated phonetic string (one without word boundaries) is parsed to produce all possible words contained in a domain specific lexicon. Dynamic Time Warping is a common method of sequence comparison used in matching the acoustic feature vectors representing an unknown input utterance and some reference utterance. The cumulative least cost path, when compared with some threshold can be used as a decision criterion for recognition. This thesis attempts to extend the DTW technique using strings of phonetic symbols, instead. Three variables that were found to affect the parsing process include: (1) minimum distance threshold, (2) the number of word candidates accepted at any given phonetic index, and (3) the lexical search space used for reference pattern comparisons. The performance of this parser as a function of these variables is discussed. Also discussed is the performance of the parser at a variety of input error conditions
- …