Search CORE

512 research outputs found

The Unsupervised Acquisition of a Lexicon from Continuous Speech

Author: de Marcken Carl
Publication venue
Publication date: 01/01/1995
Field of study

We present an unsupervised learning algorithm that acquires a natural-language lexicon from raw speech. The algorithm is based on the optimal encoding of symbol sequences in an MDL framework, and uses a hierarchical representation of language that overcomes many of the problems that have stymied previous grammar-induction procedures. The forward mapping from symbol sequences to the speech stream is modeled using features based on articulatory gestures. We present results on the acquisition of lexicons and language models from raw speech, text, and phonetic transcripts, and demonstrate that our algorithm compares very favorably to other reported results with respect to segmentation performance and statistical efficiency.Comment: 27 page technical repor

arXiv.org e-Print Archive

CiteSeerX

DSpace@MIT

Learning Physically Grounded Lexicons from Spoken Utterances

Author: Kotaro Funakoshi
Mikio Nakano
Naoto Iwahashi
Ryo Taguchi
Takashi Nose
Tsuneo Nitta
Publication venue: 'IntechOpen'
Publication date: 25/01/2012
Field of study

IntechOpen

Integrated Phoneme Subspace Method for Speech Feature Extraction

Author
Publication venue: Springer
Publication date
Field of study

Springer - Publisher Connector

Speaker segmentation and clustering

Author: Ajmera
Ajmera
Almpanidis
Barras
Bimbot
Campbell
Campbell
Cettolo
Constantine Kotropoulos
Delacourt
Deller
Fiscus
Gales
Garofolo
Godfrey
Graff
Graff
Graff
Hansen
Harb
Hess
Huang
Jain
Kim
Know
Lapidot
Lu
Manjunath
Margarita Kotti
Meignier
Oppenheim
Pellom
Reynolds
Sondhi
Tranter
Vassiliki Moschou
Ververidis
Wang
Wu
Wu
Zhou
Zhu
Publication venue: 'Elsevier BV'
Publication date: 01/01/2008
Field of study

This survey focuses on two challenging speech processing topics, namely: speaker segmentation and speaker clustering. Speaker segmentation aims at finding speaker change points in an audio stream, whereas speaker clustering aims at grouping speech segments based on speaker characteristics. Model-based, metric-based, and hybrid speaker segmentation algorithms are reviewed. Concerning speaker clustering, deterministic and probabilistic algorithms are examined. A comparative assessment of the reviewed algorithms is undertaken, the algorithm advantages and disadvantages are indicated, insight to the algorithms is offered, and deductions as well as recommendations are given. Rich transcription and movie analysis are candidate applications that benefit from combined speaker segmentation and clustering. © 2007 Elsevier B.V. All rights reserved

CiteSeerX

Crossref

Spiral - Imperial College Digital Repository

Speech vocoding for laboratory phonology

Author: Benus Stefan
Cernak Milos
Lazaridis Alexandros
Publication venue: 'Elsevier BV'
Publication date: 19/05/2015
Field of study

Using phonological speech vocoding, we propose a platform for exploring relations between phonology and speech processing, and in broader terms, for exploring relations between the abstract and physical structures of a speech signal. Our goal is to make a step towards bridging phonology and speech processing and to contribute to the program of Laboratory Phonology. We show three application examples for laboratory phonology: compositional phonological speech modelling, a comparison of phonological systems and an experimental phonological parametric text-to-speech (TTS) system. The featural representations of the following three phonological systems are considered in this work: (i) Government Phonology (GP), (ii) the Sound Pattern of English (SPE), and (iii) the extended SPE (eSPE). Comparing GP- and eSPE-based vocoded speech, we conclude that the latter achieves slightly better results than the former. However, GP - the most compact phonological speech representation - performs comparably to the systems with a higher number of phonological features. The parametric TTS based on phonological speech representation, and trained from an unlabelled audiobook in an unsupervised manner, achieves intelligibility of 85% of the state-of-the-art parametric speech synthesis. We envision that the presented approach paves the way for researchers in both fields to form meaningful hypotheses that are explicitly testable using the concepts developed and exemplified in this paper. On the one hand, laboratory phonologists might test the applied concepts of their theoretical models, and on the other hand, the speech processing community may utilize the concepts developed for the theoretical phonological models for improvements of the current state-of-the-art applications

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Speech Synthesis Based on Hidden Markov Models

Author: Nankaku Y.
Oura K.
Toda T.
Tokuda K.
Yamagishi J.
Zen H.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/05/2013
Field of study

Edinburgh Research Explorer

Statistical Language Modeling for Automatic Speech Recognition of Agglutinative Languages

Author: Ebru Ar&#305
Ha&#351
Janne Pylkk&#246
Mikko Kurimo
Murat Sara&#231
Tanel Alum&#228
Teemu Hirsim&#228
Publication venue: 'IntechOpen'
Publication date: 01/11/2008
Field of study

IntechOpen

Integrating Articulatory Features into HMM-based Parametric Speech Synthesis

Author: Ling Zhenhua
Richmond Korin
Wang Ren-Hua
Yamagishi Junichi
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2009
Field of study

This paper presents an investigation of ways to integrate articulatory features into Hidden Markov Model (HMM)-based parametric speech synthesis, primarily with the aim of improving the performance of acoustic parameter generation. The joint distribution of acoustic and articulatory features is estimated during training and is then used for parameter generation at synthesis time in conjunction with a maximum-likelihood criterion. Different model structures are explored to allow the articulatory features to influence acoustic modeling: model clustering, state synchrony and cross-stream feature dependency. The results of objective evaluation show that the accuracy of acoustic parameter prediction can be improved when shared clustering and asynchronous-state model structures are adopted for combined acoustic and articulatory features. More significantly, our experiments demonstrate that modeling the dependency between these two feature streams can make speech synthesis more flexible. The characteristics of synthetic speech can be easily controlled by modifying generated articulatory features as part of the process of acoustic parameter generation

CiteSeerX

Crossref

Edinburgh Research Archive

Edinburgh Research Explorer

Recent development of the HMM-based speech synthesis system (HTS)

Author: Black Alan W
Masuko Takashi
Nose Takashi
Oura Keiichiro
Sako Shinji
Toda Tomoki
Tokuda Keiichi
Yamagishi Junichi
Zen Heiga
Publication venue
Publication date: 01/01/2009
Field of study

A statistical parametric approach to speech synthesis based on hidden Markov models (HMMs) has grown in popularity over the last few years. In this approach, spectrum, excitation, and duration of speech are simultaneously modeled by context-dependent HMMs, and speech waveforms are generate from the HMMs themselves. Since December 2002, we have publicly released an open-source software toolkit named “HMM-based speech synthesis system (HTS)” to provide a research and development toolkit for statistical parametric speech synthesis. This paper describes recent developments of HTS in detail, as well as future release plans

CiteSeerX

NAIST Academic Repository

Edinburgh Research Archive

Edinburgh Research Explorer

Hokkaido University Collection of Scholarly and Academic Papers

Automatic Emotion Recognition: Quantifying Dynamics and Structure in Human Behavior.

Author: Kim Yelin
Publication venue
Publication date: 01/01/2016
Field of study

Emotion is a central part of human interaction, one that has a huge influence on its overall tone and outcome. Today's human-centered interactive technology can greatly benefit from automatic emotion recognition, as the extracted affective information can be used to measure, transmit, and respond to user needs. However, developing such systems is challenging due to the complexity of emotional expressions and their dynamics in terms of the inherent multimodality between audio and visual expressions, as well as the mixed factors of modulation that arise when a person speaks. To overcome these challenges, this thesis presents data-driven approaches that can quantify the underlying dynamics in audio-visual affective behavior. The first set of studies lay the foundation and central motivation of this thesis. We discover that it is crucial to model complex non-linear interactions between audio and visual emotion expressions, and that dynamic emotion patterns can be used in emotion recognition. Next, the understanding of the complex characteristics of emotion from the first set of studies leads us to examine multiple sources of modulation in audio-visual affective behavior. Specifically, we focus on how speech modulates facial displays of emotion. We develop a framework that uses speech signals which alter the temporal dynamics of individual facial regions to temporally segment and classify facial displays of emotion. Finally, we present methods to discover regions of emotionally salient events in a given audio-visual data. We demonstrate that different modalities, such as the upper face, lower face, and speech, express emotion with different timings and time scales, varying for each emotion type. We further extend this idea into another aspect of human behavior: human action events in videos. We show how transition patterns between events can be used for automatically segmenting and classifying action events. Our experimental results on audio-visual datasets show that the proposed systems not only improve performance, but also provide descriptions of how affective behaviors change over time. We conclude this dissertation with the future directions that will innovate three main research topics: machine adaptation for personalized technology, human-human interaction assistant systems, and human-centered multimedia content analysis.PhDElectrical Engineering: SystemsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/133459/1/yelinkim_1.pd

Deep Blue Documents at the University of Michigan