3,746 research outputs found
The listening talker: A review of human and algorithmic context-induced modifications of speech
International audienceSpeech output technology is finding widespread application, including in scenarios where intelligibility might be compromised - at least for some listeners - by adverse conditions. Unlike most current algorithms, talkers continually adapt their speech patterns as a response to the immediate context of spoken communication, where the type of interlocutor and the environment are the dominant situational factors influencing speech production. Observations of talker behaviour can motivate the design of more robust speech output algorithms. Starting with a listener-oriented categorisation of possible goals for speech modification, this review article summarises the extensive set of behavioural findings related to human speech modification, identifies which factors appear to be beneficial, and goes on to examine previous computational attempts to improve intelligibility in noise. The review concludes by tabulating 46 speech modifications, many of which have yet to be perceptually or algorithmically evaluated. Consequently, the review provides a roadmap for future work in improving the robustness of speech output
Prosody-Based Automatic Segmentation of Speech into Sentences and Topics
A crucial step in processing speech audio data for information extraction,
topic detection, or browsing/playback is to segment the input into sentence and
topic units. Speech segmentation is challenging, since the cues typically
present for segmenting text (headers, paragraphs, punctuation) are absent in
spoken language. We investigate the use of prosody (information gleaned from
the timing and melody of speech) for these tasks. Using decision tree and
hidden Markov modeling techniques, we combine prosodic cues with word-based
approaches, and evaluate performance on two speech corpora, Broadcast News and
Switchboard. Results show that the prosodic model alone performs on par with,
or better than, word-based statistical language models -- for both true and
automatically recognized words in news speech. The prosodic model achieves
comparable performance with significantly less training data, and requires no
hand-labeling of prosodic events. Across tasks and corpora, we obtain a
significant improvement over word-only models using a probabilistic combination
of prosodic and lexical information. Inspection reveals that the prosodic
models capture language-independent boundary indicators described in the
literature. Finally, cue usage is task and corpus dependent. For example, pause
and pitch features are highly informative for segmenting news speech, whereas
pause, duration and word-based cues dominate for natural conversation.Comment: 30 pages, 9 figures. To appear in Speech Communication 32(1-2),
Special Issue on Accessing Information in Spoken Audio, September 200
Effects of acoustic features modifications on the perception of dysarthric speech - preliminary study (pitch, intensity and duration modifications)
Marking stress is important in conveying meaning and drawing listener’s attention to specific parts of a message. Extensive research has shown that healthy speakers mark stress using three main acoustic cues; pitch, intensity, and duration. The relationship between acoustic and perception cues is vital in the development of a computer-based tool that aids the therapists in providing effective treatment to people with Dysarthria. It is, therefore, important to investigate the acoustic cues deficiency in dysarthric speech and the potential compensatory techniques needed for effective treatment. In this paper, the relationship between acoustic and perceptive cues in dysarthric speech are investigated. This is achieved by modifying stress marked sentences from 10 speakers with Ataxic dysarthria. Each speaker produced 30 sentences using the 10 Subject-Verb-Object-Adjective (SVOA) structured sentences across three stress conditions. These stress conditions are stress on the initial (S), medial (O) and final (A) target words respectively. To effectively measure the deficiencies in Dysarthria speech, the acoustic features (pitch, intensity, and duration) are modified incrementally. The paper presents the techniques involved in the modification of these acoustic features. The effects of these modifications are analysed based on steps of 25% increments in pitch, intensity and duration. For robustness and validation, 50 untrained listeners participated in the listening experiment. The results and the relationship between acoustic modifications (what is measured) and perception (what is heard) in Dysarthric speech are discussed
Automatic Feedback for L2 Prosody Learning
International audienceWe have designed automatic feedback for the realisation of the prosody of a foreign language. Besides classical F0 displays, two kinds of feedback are provided to learners, each of them based upon a comparison between a reference and the learner's production. The first feedback, a diagnosis, provided both in the form of a short text and visual displays such as arrows, comes from an acoustic evaluation of the learner's realisation; it deals with two prosodic cues: the melodic curve, and phoneme duration. The second feedback is perceptual and consists in a replacement of the learner's prosodic cues (duration and F0) by those of the reference. A pilot experiment has been undertaken to test the immediate impact of the "advanced" feedback proposed here. We have chosen to test the production of English lexical accent in isolated words by French speakers. It shows that feedback based upon diagnosis and speech modification enables French learners with a low production level to improve their realisations of English lexical accents more than (simple) auditory feedback. On the contrary, for advanced learners involved in this study, auditory feedback appears to be as efficient as more elaborated feedback
- …