1,887 research outputs found
AutoPrep: An Automatic Preprocessing Framework for In-the-Wild Speech Data
Recently, the utilization of extensive open-sourced text data has
significantly advanced the performance of text-based large language models
(LLMs). However, the use of in-the-wild large-scale speech data in the speech
technology community remains constrained. One reason for this limitation is
that a considerable amount of the publicly available speech data is compromised
by background noise, speech overlapping, lack of speech segmentation
information, missing speaker labels, and incomplete transcriptions, which can
largely hinder their usefulness. On the other hand, human annotation of speech
data is both time-consuming and costly. To address this issue, we introduce an
automatic in-the-wild speech data preprocessing framework (AutoPrep) in this
paper, which is designed to enhance speech quality, generate speaker labels,
and produce transcriptions automatically. The proposed AutoPrep framework
comprises six components: speech enhancement, speech segmentation, speaker
clustering, target speech extraction, quality filtering and automatic speech
recognition. Experiments conducted on the open-sourced WenetSpeech and our
self-collected AutoPrepWild corpora demonstrate that the proposed AutoPrep
framework can generate preprocessed data with similar DNSMOS and PDNSMOS scores
compared to several open-sourced TTS datasets. The corresponding TTS system can
achieve up to 0.68 in-domain speaker similarity
Continuous Interaction with a Virtual Human
Attentive Speaking and Active Listening require that a Virtual Human be capable of simultaneous perception/interpretation and production of communicative behavior. A Virtual Human should be able to signal its attitude and attention while it is listening to its interaction partner, and be able to attend to its interaction partner while it is speaking – and modify its communicative behavior on-the-fly based on what it perceives from its partner. This report presents the results of a four week summer project that was part of eNTERFACE’10. The project resulted in progress on several aspects of continuous interaction such as scheduling and interrupting multimodal behavior, automatic classification of listener responses, generation of response eliciting behavior, and models for appropriate reactions to listener responses. A pilot user study was conducted with ten participants. In addition, the project yielded a number of deliverables that are released for public access
- …