Search CORE

7 research outputs found

Silence Models in Weighted Finite-State Transducers

Author: Garner Philip N.
Publication venue: IDIAP
Publication date: 11/02/2010
Field of study

We investigate the effects of different silence modelling strategies in Weighted Finite-State Transducers for Automatic Speech Recognition. We show that the choice of silence models, and the way they are included in the transducer, can have a significant effect on the size of the resulting transducer; we present a means to prevent particularly large silence overheads. Our conclusions include that context-free silence modelling fits well with transducer based grammars, whereas modelling silence as a monophone and a context has larger overheads

Infoscience - École polytechnique fédérale de Lausanne

To separate speech! a system for recognizing simultaneous speech

Author: Dietrich Klakow
Emilian Stoimenov
John Mcdonough
Kenichi Kumatani
Matthias Wölfel
Stefan Schacht
Tobias Gehrig
Uwe Mayer
Publication venue
Publication date: 01/01/2007
Field of study

Abstract. The PASCAL Speech Separation Challenge (SSC) is based on a corpus of sentences from the Wall Street Journal task read by two speakers simultaneously and captured with two circular eight-channel microphone arrays. This work describes our system for the recognition of such simultaneous speech. Our system has four principal components: A person tracker returns the locations of both active speakers, as well as segmentation information for each utterance, which are often of unequal length; two beamformers in generalized sidelobe canceller (GSC) configuration separate the simultaneous speech by setting their active weight vectors according to a minimum mutual information (MMI) criterion; a postfilter and binary mask operating on the outputs of the beamformers further enhance the separated speech; and finally an automatic speech recognition (ASR) engine based on a weighted finite-state transducer (WFST) returns the most likely word hypotheses for the separated streams. In addition to optimizing each of these components, we investigated the effect of the filter bank design used to perform subband analysis and synthesis during beamforming. On the SSC development data, our system achieved a word error rate of 39.6%

CiteSeerX

A Weighted Finite State Transducer tutorial

Author: Garner Philip N.
Publication venue: IDIAP
Publication date: 11/02/2010
Field of study

The concepts of WFSTs are summarised, including structural and stochastic optimisations. A typical composition process for ASR is described. Some experiments show that care should be taken with silence models

Infoscience - École polytechnique fédérale de Lausanne

The Juicer LVCSR Decoder - User Manual for Juicer version 0.5.0

Author: Moore Darren
Publication venue: IDIAP
Publication date: 08/06/2006
Field of study

Juicer is a decoder for HMM-based large vocabulary speech recognition that uses a weighted finite state transducer (WFST) representation of the search space. The package consists of a number of command line utilities: the Juicer decoder itself, along with a number of tools and scripts that are used to combine the various ASR knowledge sources (language model, pronunciation dictionary, acoustic models) into a single, optimised WFST that is input to the decoder

Infoscience - École polytechnique fédérale de Lausanne

Use of contexts in language model interpolation and adaptation

Author: Bahl
Bellegarda
Bengio
Blei
Brants
Bulyko
Bulyko
Caseiro
Chen
Chen
Cheng
Chien
Clarkson
Darroch
Della Pietra
Doumpiotis
Federico
Federico
Gildea
Gopalakrishnan
Hermansky
Hieronymus
Hinton
Hsu
Iyer
Iyer
Jelinek
Jelinek
Kaiser
Katz
Kneser
Kneser
Liu
Liu
Liu
Liu
Liu
M.J.F. Gales
McDonough
Mohri
Mohri
Mohri
Mohri
Mrva
Mrva
Och
Oonishi
P.C. Woodland
Povey
Rosenfeld
Rosenfeld
Rosenfeld
Schwenk
Seymore
Sinha
Stolcke
Tam
Woodland
X. Liu
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref

Learning commonsense human-language descriptions from temporal and spatial sensor-network data

Author: Morgan Bo
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2006
Field of study

Thesis (S.M.)--Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 2006.Includes bibliographical references (p. 105-109) and index.Embedded-sensor platforms are advancing toward such sophistication that they can differentiate between subtle actions. For example, when placed in a wristwatch, such platforms can tell whether a person is shaking hands or turning a doorknob. Sensors placed on objects in the environment now report many parameters, including object location, movement, sound, and temperature. A persistent problem, however, is the description of these sense data in meaningful human-language. This is an important problem that appears across domains ranging from organizational security surveillance to individual activity journaling. Previous models of activity recognition pigeon-hole descriptions into small, formal categories specified in advance; for example, location is often categorized as "at home" or "at the office." These models have not been able to adapt to the wider range of complex, dynamic, and idiosyncratic human activities. We hypothesize that the commonsense, semantically related, knowledge bases can be used to bootstrap learning algorithms for classifying and recognizing human activities from sensors.(cont.) Our system, LifeNet, is a first-person commonsense inference model, which consists of a graph with nodes drawn from a large repository of commonsense assertions expressed in human-language phrases. LifeNet is used to construct a mapping between streams of sensor data and partially ordered sequences of events, co-located in time and space. Further, by gathering sensor data in vivo, we are able to validate and extend the commonsense knowledge from which LifeNet is derived. LifeNet is evaluated in the context of its performance on a sensor-network platform distributed in an office environment. We hypothesize that mapping sensor data into LifeNet will act as a "semantic mirror" to meaningfully interpret sensory data into cohesive patterns in order to understand and predict human action.by Bo Morgan.S.M

DSpace@MIT