249 research outputs found

    A K-nearest neighbours approach to unsupervised spoken term discovery

    Get PDF
    International audienceUnsupervised spoken term discovery is the task of finding recurrent acoustic patterns in speech without any annotations. Current approaches consists of two steps: (1) discovering similar patterns in speech, and (2) partitioning those pairs of acoustic tokens using graph clustering methods. We propose a new approach for the first step. Previous systems used various approximation algorithms to make the search tractable on large amounts of data. Our approach is based on an optimized k-nearest neighbours (KNN) search coupled with a fixed word embedding algorithm. The results show that the KNN algorithm is robust across languages, consistently out-performs the DTW-based baseline, and is competitive with current state-of-the-art spoken term discovery systems

    Unsupervised speech processing with applications to query-by-example spoken term detection

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2013.Cataloged from PDF version of thesis.Includes bibliographical references (p. 163-173).This thesis is motivated by the challenge of searching and extracting useful information from speech data in a completely unsupervised setting. In many real world speech processing problems, obtaining annotated data is not cost and time effective. We therefore ask how much can we learn from speech data without any transcription. To address this question, in this thesis, we chose the query-by-example spoken term detection as a specific scenario to demonstrate that this task can be done in the unsupervised setting without any annotations. To build the unsupervised spoken term detection framework, we contributed three main techniques to form a complete working flow. First, we present two posteriorgram-based speech representations which enable speaker-independent, and noisy spoken term matching. The feasibility and effectiveness of both posteriorgram features are demonstrated through a set of spoken term detection experiments on different datasets. Second, we show two lower-bounding based methods for Dynamic Time Warping (DTW) based pattern matching algorithms. Both algorithms greatly outperform the conventional DTW in a single-threaded computing environment. Third, we describe the parallel implementation of the lower-bounded DTW search algorithm. Experimental results indicate that the total running time of the entire spoken detection system grows linearly with corpus size. We also present the training of large Deep Belief Networks (DBNs) on Graphical Processing Units (GPUs). The phonetic classification experiment on the TIMIT corpus showed a speed-up of 36x for pre-training and 45x for back-propagation for a two-layer DBN trained on the GPU platform compared to the CPU platform.by Yaodong Zhang.Ph.D

    Neural approaches to spoken content embedding

    Full text link
    Comparing spoken segments is a central operation to speech processing. Traditional approaches in this area have favored frame-level dynamic programming algorithms, such as dynamic time warping, because they require no supervision, but they are limited in performance and efficiency. As an alternative, acoustic word embeddings -- fixed-dimensional vector representations of variable-length spoken word segments -- have begun to be considered for such tasks as well. However, the current space of such discriminative embedding models, training approaches, and their application to real-world downstream tasks is limited. We start by considering ``single-view" training losses where the goal is to learn an acoustic word embedding model that separates same-word and different-word spoken segment pairs. Then, we consider ``multi-view" contrastive losses. In this setting, acoustic word embeddings are learned jointly with embeddings of character sequences to generate acoustically grounded embeddings of written words, or acoustically grounded word embeddings. In this thesis, we contribute new discriminative acoustic word embedding (AWE) and acoustically grounded word embedding (AGWE) approaches based on recurrent neural networks (RNNs). We improve model training in terms of both efficiency and performance. We take these developments beyond English to several low-resource languages and show that multilingual training improves performance when labeled data is limited. We apply our embedding models, both monolingual and multilingual, to the downstream tasks of query-by-example speech search and automatic speech recognition. Finally, we show how our embedding approaches compare with and complement more recent self-supervised speech models.Comment: PhD thesi

    Unsupervised modeling of latent topics and lexical units in speech audio

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2013.Cataloged from PDF version of thesis.Includes bibliographical references (p. 67-70).Zero-resource speech processing involves the automatic analysis of a collection of speech data in a completely unsupervised fashion without the benefit of any transcriptions or annotations of the data. In this thesis, we describe a zero-resource framework that automatically discovers important words, phrases and topical themes present in an audio corpus. This system employs a segmental dynamic time warping (S-DTW) algorithm for acoustic pattern discovery in conjunction with a probabilistic model which treats the topic and pseudo-word identity of each discovered pattern as hidden variables. By applying an Expectation-Maximization (EM) algorithm, our method estimates the latent probability distributions over the pseudo-words and topics associated with the discovered patterns. Using this information, we produce informative acoustic summaries of the dominant topical themes of the audio document collection.by David F. Harwath.S.M

    Parsimonious Higher-Order Hidden Markov Models for Improved Array-CGH Analysis with Applications to Arabidopsis thaliana

    Get PDF
    Array-based comparative genomic hybridization (Array-CGH) is an important technology in molecular biology for the detection of DNA copy number polymorphisms between closely related genomes. Hidden Markov Models (HMMs) are popular tools for the analysis of Array-CGH data, but current methods are only based on first-order HMMs having constrained abilities to model spatial dependencies between measurements of closely adjacent chromosomal regions. Here, we develop parsimonious higher-order HMMs enabling the interpolation between a mixture model ignoring spatial dependencies and a higher-order HMM exhaustively modeling spatial dependencies. We apply parsimonious higher-order HMMs to the analysis of Array-CGH data of the accessions C24 and Col-0 of the model plant Arabidopsis thaliana. We compare these models against first-order HMMs and other existing methods using a reference of known deletions and sequence deviations. We find that parsimonious higher-order HMMs clearly improve the identification of these polymorphisms. Moreover, we perform a functional analysis of identified polymorphisms revealing novel details of genomic differences between C24 and Col-0. Additional model evaluations are done on widely considered Array-CGH data of human cell lines indicating that parsimonious HMMs are also well-suited for the analysis of non-plant specific data. All these results indicate that parsimonious higher-order HMMs are useful for Array-CGH analyses. An implementation of parsimonious higher-order HMMs is available as part of the open source Java library Jstacs (www.jstacs.de/index.php/PHHMM)

    Unsupervised neural and Bayesian models for zero-resource speech processing

    Get PDF
    Zero-resource speech processing is a growing research area which aims to develop methods that can discover linguistic structure and representations directly from unlabelled speech audio. Such unsupervised methods would allow speech technology to be developed in settings where transcriptions, pronunciation dictionaries, and text for language modelling are not available. Similar methods are required for cognitive models of language acquisition in human infants, and for developing robotic applications that are able to automatically learn language in a novel linguistic environment. There are two central problems in zero-resource speech processing: (i) finding frame-level feature representations which make it easier to discriminate between linguistic units (phones or words), and (ii) segmenting and clustering unlabelled speech into meaningful units. The claim of this thesis is that both top-down modelling (using knowledge of higher-level units to to learn, discover and gain insight into their lower-level constituents) as well as bottom-up modelling (piecing together lower-level features to give rise to more complex higher-level structures) are advantageous in tackling these two problems. The thesis is divided into three parts. The first part introduces a new autoencoder-like deep neural network for unsupervised frame-level representation learning. This correspondence autoencoder (cAE) uses weak top-down supervision from an unsupervised term discovery system that identifies noisy word-like terms in unlabelled speech data. In an intrinsic evaluation of frame-level representations, the cAE outperforms several state-of-the-art bottom-up and top-down approaches, achieving a relative improvement of more than 60% over the previous best system. This shows that the cAE is particularly effective in using top-down knowledge of longer-spanning patterns in the data; at the same time, we find that the cAE is only able to learn useful representations when it is initialized using bottom-up pretraining on a large set of unlabelled speech. The second part of the thesis presents a novel unsupervised segmental Bayesian model that segments unlabelled speech data and clusters the segments into hypothesized word groupings. The result is a complete unsupervised tokenization of the input speech in terms of discovered word types|the system essentially performs unsupervised speech recognition. In this approach, a potential word segment (of arbitrary length) is embedded in a fixed-dimensional vector space. The model, implemented as a Gibbs sampler, then builds a whole-word acoustic model in this embedding space while jointly performing segmentation. We first evaluate the approach in a small-vocabulary multi-speaker connected digit recognition task, where we report unsupervised word error rates (WER) by mapping the unsupervised decoded output to ground truth transcriptions. The model achieves around 20% WER, outperforming a previous HMM-based system by about 10% absolute. To achieve this performance, the acoustic word embedding function (which maps variable-duration segments to single vectors) is refined in a top-down manner by using terms discovered by the model in an outer loop of segmentation. The third and final part of the study extends the small-vocabulary system in order to handle larger vocabularies in conversational speech data. To our knowledge, this is the first full-coverage segmentation and clustering system that is applied to large-vocabulary multi-speaker data. To improve efficiency, the system incorporates a bottom-up syllable boundary detection method to eliminate unlikely word boundaries. We compare the system on English and Xitsonga datasets to several state-of-the-art baselines. We show that by imposing a consistent top-down segmentation while also using bottom-up knowledge from detected syllable boundaries, both single-speaker and multi-speaker versions of our system outperform a purely bottom-up single-speaker syllable-based approach. We also show that the discovered clusters can be made less speaker- and gender-specific by using features from the cAE (which incorporates both top-down and bottom-up learning). The system's discovered clusters are still less pure than those of two multi-speaker unsupervised term discovery systems, but provide far greater coverage. In summary, the different models and systems presented in this thesis show that both top-down and bottom-up modelling can improve representation learning, segmentation and clustering of unlabelled speech data

    Text mining with the WEBSOM

    Get PDF
    The emerging field of text mining applies methods from data mining and exploratory data analysis to analyzing text collections and to conveying information to the user in an intuitive manner. Visual, map-like displays provide a powerful and fast medium for portraying information about large collections of text. Relationships between text items and collections, such as similarity, clusters, gaps and outliers can be communicated naturally using spatial relationships, shading, and colors. In the WEBSOM method the self-organizing map (SOM) algorithm is used to automatically organize very large and high-dimensional collections of text documents onto two-dimensional map displays. The map forms a document landscape where similar documents appear close to each other at points of the regular map grid. The landscape can be labeled with automatically identified descriptive words that convey properties of each area and also act as landmarks during exploration. With the help of an HTML-based interactive tool the ordered landscape can be used in browsing the document collection and in performing searches on the map. An organized map offers an overview of an unknown document collection helping the user in familiarizing herself with the domain. Map displays that are already familiar can be used as visual frames of reference for conveying properties of unknown text items. Static, thematically arranged document landscapes provide meaningful backgrounds for dynamic visualizations of for example time-related properties of the data. Search results can be visualized in the context of related documents. Experiments on document collections of various sizes, text types, and languages show that the WEBSOM method is scalable and generally applicable. Preliminary results in a text retrieval experiment indicate that even when the additional value provided by the visualization is disregarded the document maps perform at least comparably with more conventional retrieval methods.reviewe

    Representation learning for unsupervised speech processing

    Get PDF
    Automatic speech recognition for our most widely used languages has recently seen substantial improvements, driven by improved training procedures for deep artificial neural networks, cost-effective availability of computational power at large scale, and, crucially, availability of large quantities of labelled training data. This success cannot be transferred to low and zero resource languages where the requisite transcriptions are unavailable. Unsupervised speech processing promises better methods for dealing with under-resourced languages. Here we investigate unsupervised neural network based models for learning frame- and sequence- level representations with the goal of improving zero-resource speech processing. Good representations eliminate differences in accent, gender, channel characteristics, and other factors to model subword or whole-term units for within- and across- speaker speech unit discrimination. We present two contributions focussing on unsupervised learning of frame-level representations: (1) an improved version of the correspondence autoencoder applied to the INTERSPEECH 2015 Zero Resource Challenge, and (2) a proposed model for learning representations that explicitly optimize speech unit discrimination. We also present two contributions focussing on efficiency and scalability of unsupervised speech processing: (1) a proposed model and pilot experiments for learning a linear-time approximation of the quadratic-time dynamic time warping algorithm, and (2) a series of model proposals for learning fixed size representations of variable length speech segments enabling efficient vector space similarity measures

    Morphologically motivated word classes for very large vocabulary speech recognition of Finnish and Estonian

    Get PDF
    We study class-based n-gram and neural network language models for very large vocabulary speech recognition of two morphologically rich languages: Finnish and Estonian. Due to morphological processes such as derivation, inflection and compounding, the models need to be trained with vocabulary sizes of several millions of word types. Class-based language modelling is in this case a powerful approach to alleviate the data sparsity and reduce the computational load. For a very large vocabulary, bigram statistics may not be an optimal way to derive the classes. We thus study utilizing the output of a morphological analyzer to achieve efficient word classes. We show that efficient classes can be learned by refining the morphological classes to smaller equivalence classes using merging, splitting and exchange procedures with suitable constraints. This type of classification can improve the results, particularly when language model training data is not very large. We also extend the previous analyses by rescoring the hypotheses obtained from a very large vocabulary recognizer using class-based neural network language models. We show that despite the fixed vocabulary, carefully constructed classes for word-based language models can in some cases result in lower error rates than subword-based unlimited vocabulary language models.We study class-based n-gram and neural network language models for very large vocabulary speech recognition of two morphologically rich languages: Finnish and Estonian. Due to morphological processes such as derivation, inflection and compounding, the models need to be trained with vocabulary sizes of several millions of word types. Class-based language modelling is in this case a powerful approach to alleviate the data sparsity and reduce the computational load. For a very large vocabulary, bigram statistics may not be an optimal way to derive the classes. We thus study utilizing the output of a morphological analyzer to achieve efficient word classes. We show that efficient classes can be learned by refining the morphological classes to smaller equivalence classes using merging, splitting and exchange procedures with suitable constraints. This type of classification can improve the results, particularly when language model training data is not very large. We also extend the previous analyses by rescoring the hypotheses obtained from a very large vocabulary recognizer using class-based neural network language models. We show that despite the fixed vocabulary, carefully constructed classes for word-based language models can in some cases result in lower error rates than subword-based unlimited vocabulary language models.Peer reviewe
    • …
    corecore