1,732 research outputs found
Dialogue Act Modeling for Automatic Tagging and Recognition of Conversational Speech
We describe a statistical approach for modeling dialogue acts in
conversational speech, i.e., speech-act-like units such as Statement, Question,
Backchannel, Agreement, Disagreement, and Apology. Our model detects and
predicts dialogue acts based on lexical, collocational, and prosodic cues, as
well as on the discourse coherence of the dialogue act sequence. The dialogue
model is based on treating the discourse structure of a conversation as a
hidden Markov model and the individual dialogue acts as observations emanating
from the model states. Constraints on the likely sequence of dialogue acts are
modeled via a dialogue act n-gram. The statistical dialogue grammar is combined
with word n-grams, decision trees, and neural networks modeling the
idiosyncratic lexical and prosodic manifestations of each dialogue act. We
develop a probabilistic integration of speech recognition with dialogue
modeling, to improve both speech recognition and dialogue act classification
accuracy. Models are trained and evaluated using a large hand-labeled database
of 1,155 conversations from the Switchboard corpus of spontaneous
human-to-human telephone speech. We achieved good dialogue act labeling
accuracy (65% based on errorful, automatically recognized words and prosody,
and 71% based on word transcripts, compared to a chance baseline accuracy of
35% and human accuracy of 84%) and a small reduction in word recognition error.Comment: 35 pages, 5 figures. Changes in copy editing (note title spelling
changed
FCTalker: Fine and Coarse Grained Context Modeling for Expressive Conversational Speech Synthesis
Conversational Text-to-Speech (TTS) aims to synthesis an utterance with the
right linguistic and affective prosody in a conversational context. The
correlation between the current utterance and the dialogue history at the
utterance level was used to improve the expressiveness of synthesized speech.
However, the fine-grained information in the dialogue history at the word level
also has an important impact on the prosodic expression of an utterance, which
has not been well studied in the prior work. Therefore, we propose a novel
expressive conversational TTS model, termed as FCTalker, that learn the fine
and coarse grained context dependency at the same time during speech
generation. Specifically, the FCTalker includes fine and coarse grained
encoders to exploit the word and utterance-level context dependency. To model
the word-level dependencies between an utterance and its dialogue history, the
fine-grained dialogue encoder is built on top of a dialogue BERT model. The
experimental results show that the proposed method outperforms all baselines
and generates more expressive speech that is contextually appropriate. We
release the source code at: https://github.com/walker-hyf/FCTalker.Comment: 5 pages, 4 figures, 1 table. Submitted to ICASSP 2023. We release the
source code at: https://github.com/walker-hyf/FCTalke
Continuous Interaction with a Virtual Human
Attentive Speaking and Active Listening require that a Virtual Human be capable of simultaneous perception/interpretation and production of communicative behavior. A Virtual Human should be able to signal its attitude and attention while it is listening to its interaction partner, and be able to attend to its interaction partner while it is speaking – and modify its communicative behavior on-the-fly based on what it perceives from its partner. This report presents the results of a four week summer project that was part of eNTERFACE’10. The project resulted in progress on several aspects of continuous interaction such as scheduling and interrupting multimodal behavior, automatic classification of listener responses, generation of response eliciting behavior, and models for appropriate reactions to listener responses. A pilot user study was conducted with ten participants. In addition, the project yielded a number of deliverables that are released for public access
Automatsko raspoznavanje hrvatskoga govora velikoga vokabulara
This paper presents procedures used for development of a Croatian large vocabulary automatic speech recognition system (LVASR). The proposed acoustic model is based on context-dependent triphone hidden Markov models and Croatian phonetic rules. Different acoustic and language models, developed using a large collection of Croatian speech, are discussed and compared. The paper proposes the best feature vectors and acoustic modeling procedures using which lowest word error rates for Croatian speech are achieved. In addition, Croatian language modeling procedures are evaluated and adopted for speaker independent spontaneous speech recognition. Presented experiments and results show that the proposed approach for automatic speech recognition using context-dependent acoustic modeling based on Croatian phonetic rules and a parameter tying procedure can be used for efficient Croatian large vocabulary speech recognition with word error rates below 5%.Članak prikazuje postupke akustičkog i jezičnog modeliranja sustava za automatsko raspoznavanje hrvatskoga govora velikoga vokabulara. Predloženi akustički modeli su zasnovani na kontekstno-ovisnim skrivenim Markovljevim modelima trifona i hrvatskim fonetskim pravilima. Na hrvatskome govoru prikupljenom u korpusu su ocjenjeni i uspoređeni različiti akustički i jezični modeli. U članku su uspoređ eni i predloženi postupci za izračun vektora značajki za akustičko modeliranje kao i sam pristup akustičkome modeliranju hrvatskoga govora s kojim je postignuta najmanja mjera pogrešno raspoznatih riječi. Predstavljeni su rezultati raspoznavanja spontanog hrvatskog govora neovisni o govorniku. Postignuti rezultati eksperimenata s mjerom pogreške ispod 5% ukazuju na primjerenost predloženih postupaka za automatsko raspoznavanje hrvatskoga govora velikoga vokabulara pomoću vezanih kontekstnoovisnih akustičkih modela na osnovu hrvatskih fonetskih pravila
Statistical parametric speech synthesis using conversational data and phenomena
Statistical parametric text-to-speech synthesis currently relies on predefined and highly
controlled prompts read in a “neutral” voice. This thesis presents work on utilising
recordings of free conversation for the purpose of filled pause synthesis and as an
inspiration for improved general modelling of speech for text-to-speech synthesis purposes.
A corpus of both standard prompts and free conversation is presented and the
potential usefulness of conversational speech as the basis for text-to-speech voices
is validated. Additionally, through psycholinguistic experimentation it is shown that
filled pauses can have potential subconscious benefits to the listener but that current
text-to-speech voices cannot replicate these effects. A method for pronunciation variant
forced alignment is presented in order to obtain a more accurate automatic speech
segmentation something which is particularly bad for spontaneously produced speech.
This pronunciation variant alignment is utilised not only to create a more accurate underlying
acoustic model, but also as the driving force behind creating more natural
pronunciation prediction at synthesis time. While this improves both the standard and
spontaneous voices the naturalness of spontaneous speech based voices still lags behind
the quality of voices based on standard read prompts. Thus, the synthesis of filled
pauses is investigated in relation to specific phonetic modelling of filled pauses and
through techniques for the mixing of standard prompts with spontaneous utterances in
order to retain the higher quality of standard speech based voices while still utilising
the spontaneous speech for filled pause modelling. A method for predicting where to
insert filled pauses in the speech stream is also developed and presented, relying on
an analysis of human filled pause usage and a mix of language modelling methods.
The method achieves an insertion accuracy in close agreement with human usage. The
various approaches are evaluated and their improvements documented throughout the
thesis, however, at the end the resulting filled pause quality is assessed through a repetition
of the psycholinguistic experiments and an evaluation of the compilation of all
developed methods
- …