34 research outputs found
Single Speaker Segmentation and Inventory Selection Using Dynamic Time Warping Self Organization and Joint Multigram Mapping
In speech synthesis the inventory of units is decided by inspection and on the basis of phonological and phonetic expertise. The ephone (or emergent phone) project at CSTR is investigating how self organisation techniques can be applied to build an inventory based on collected acoustic data together with the constraints of a synthesis lexicon. In this paper we will describe a prototype inventory creation method using dynamic time warping (DTW) for acoustic clustering and a joint multigram approach for relating a series of symbols that represent the speech to these emerged units. We initially examined two symbol sets: 1) A baseline of standard phones 2) Orthographic symbols. The success of the approach is evaluated by comparing word boundaries generated by the emergent phones against those created using state-of-the-art HMM segmentation. Initial results suggest the DTW segmentation can match word boundaries with a root mean square error (RMSE) of 35ms. Results from mapping units onto phones resulted in a higher RMSE of 103ms. This error was increased when multiple multigram types were added and when the default unit clustering was altered from 40 (our baseline) to 10. Results for orthographic matching had a higher RMSE of 125ms. To conclude we discuss future work that we believe can reduce this error rate to a level sufficient for the techniques to be applied to a unit selection synthesis system
Statistical language modeling based on variable-length sequences
Abstract In natural language and especially in spontaneous speech, people often group words in order to constitute phrases which become usual expressions. This is due to phonological (to make the pronunciation easier), or to semantic reasons (to remember more easily a phrase by assigning a meaning to a block of words). Classical language models do not adequately take into account such phrases. A better approach consists in modeling some word sequences as if they were individual dictionary elements. Sequences are considered as additional entries of the vocabulary, on which language models are computed. In this paper, we present a method for automatically retrieving the most relevant phrases from a corpus of written sentences. The originality of our approach resides in the fact that the extracted phrases are obtained from a linguistically tagged corpus. Therefore, the obtained phrases are linguistically viable. To measure the contribution of classes in retrieving phrases, we have implemented the same algorithm without using classes. The class-based method outperformed by 11% the other method. Our approach uses information theoretic criteria which insure a high statistical consistency and make the decision of selecting a potential sequence optimal in accordance with the language perplexity. We propose several variants of language model with and without word sequences. Among them, we present a model in which the trigger pairs are linguistically more significant. We show that the use of sequences decrease the word error rate and improve the normalized perplexity. For instance, the best sequence model improves the perplexity by 16%, and the the accuracy of our dictation system (MAUD) by approximately 14%. Experiments, in terms of perplexity and recognition rate, have been carried out on a vocabulary of 20,000 words extracted from a corpus of 43 million words made up of two years of the French newspaper Le Monde. The acoustic model (HMM) is trained with the Bref80 corpus. Ó 2002 Published b
The Unsupervised Acquisition of a Lexicon from Continuous Speech
We present an unsupervised learning algorithm that acquires a
natural-language lexicon from raw speech. The algorithm is based on the optimal
encoding of symbol sequences in an MDL framework, and uses a hierarchical
representation of language that overcomes many of the problems that have
stymied previous grammar-induction procedures. The forward mapping from symbol
sequences to the speech stream is modeled using features based on articulatory
gestures. We present results on the acquisition of lexicons and language models
from raw speech, text, and phonetic transcripts, and demonstrate that our
algorithm compares very favorably to other reported results with respect to
segmentation performance and statistical efficiency.Comment: 27 page technical repor
Unsupervised Language Acquisition
This thesis presents a computational theory of unsupervised language
acquisition, precisely defining procedures for learning language from ordinary
spoken or written utterances, with no explicit help from a teacher. The theory
is based heavily on concepts borrowed from machine learning and statistical
estimation. In particular, learning takes place by fitting a stochastic,
generative model of language to the evidence. Much of the thesis is devoted to
explaining conditions that must hold for this general learning strategy to
arrive at linguistically desirable grammars. The thesis introduces a variety of
technical innovations, among them a common representation for evidence and
grammars, and a learning strategy that separates the ``content'' of linguistic
parameters from their representation. Algorithms based on it suffer from few of
the search problems that have plagued other computational approaches to
language acquisition.
The theory has been tested on problems of learning vocabularies and grammars
from unsegmented text and continuous speech, and mappings between sound and
representations of meaning. It performs extremely well on various objective
criteria, acquiring knowledge that causes it to assign almost exactly the same
structure to utterances as humans do. This work has application to data
compression, language modeling, speech recognition, machine translation,
information retrieval, and other tasks that rely on either structural or
stochastic descriptions of language.Comment: PhD thesis, 133 page
Combining Evidence from Unconstrained Spoken Term Frequency Estimation for Improved Speech Retrieval
This dissertation considers the problem of information retrieval in speech. Today's speech retrieval
systems generally use a large vocabulary continuous speech
recognition system to first hypothesize the words which were spoken.
Because these systems have a predefined lexicon, words which
fall outside of the lexicon can significantly reduce search quality---as measured
by Mean Average Precision (MAP). This is particularly important because these Out-Of-Vocabulary (OOV)
words are often rare and therefore good discriminators for topically relevant speech segments.
The focus of this dissertation is on handling these out-of-vocabulary query words. The approach
is to combine results from a word-based speech retrieval system with those from vocabulary-independent
ranked utterance retrieval. The goal of ranked utterance retrieval is to rank speech utterances
by the system's confidence that they contain a particular spoken word, which is accomplished by ranking
the utterances by the estimated frequency of the word in the utterance. Several
new approaches for estimating this frequency are considered, which are motivated by the disparity between
reference and errorfully hypothesized phoneme sequences. The first method learns alternate pronunciations or
degradations from actual recognition hypotheses and incorporates these variants into a new generative estimator for
term frequency. A second method learns transformations of several easily computed features in a discriminative
model for the same task. Both methods significantly improved ranked utterance retrieval in an experimental
validation on new speech.
The best of these ranked utterance retrieval methods is then combined with a word-based speech retrieval system. The combination
approach uses a normalization learned in an additive model, which maps the retrieval status values from each system into estimated probabilities
of relevance that are easily combined. Using this combination, much of the MAP lost because of OOV words is recovered. Evaluated on a
collection of spontaneous, conversational speech, the system recovers 57.5\% of the MAP lost on short (title-only) queries and
41.3\% on longer (title plus description) queries
Beyond the Conventional Statistical Language Models: The Variable-Length Sequences Approach
Colloque avec actes et comité de lecture. internationale.International audienceIn natural language, several sequences of words are very frequent.A classical language model, like n-gram, does not adequately takeinto account such sequences, because it underestimates theirprobabilities. A better approach consists in modelling wordsequences as if they were individual dictionary elements.Sequences are considered as additional entries of the wordlexicon, on which language models are computed. In this paper,we present an original method for automatically determining themost important phrases in corpora. This method is based oninformation theoretic criteria, which insure a high statisticalconsistency, and on French grammatical classes which includeadditional type of linguistic dependencies. In addition, theperplexity is used in order to make the decision of selecting apotential sequence more accurate. We propose also severalvariants of language models with and without word sequences.Among them, we present a model in which the trigger pairs aremore significant linguistically. The originality of this model,compared with the commonly used trigger approaches, is the useof word sequences to estimate the trigger pair without limitingitself to single words. Experimental tests, in terms of perplexityand recognition rate, are carried out on a vocabulary of 20000words and a corpus of 43 million words. The use of wordsequences proposed by our algorithm reduces perplexity by morethan 16% compared to those, which are limited to single words.The introduction of these word sequences in our dictationmachine improves the accuracy by approximately 15%
Redefining Concatenative Speech Synthesis for Use in Spontaneous Conversational Dialogues; A Study with the GBO Corpus
This chapter describes how a very large corpus of conversational speech is being tested as a source of units for concatenative speech synthesis. It shows that the challenge no longer lies in phone-sized unit selection, but in categorising larger units for their affective and pragmatic effect. The work is by nature exploratory, but much progress has been achieved and we now have the beginnings of an understanding of the types of grammar and the ontology of vocal productions that will be required for the interactive synthesis of conversational speech. The chapter describes the processes involved and explains some of the features selected for optimal expressive speech rendering