296 research outputs found
Stylization of Pitch with Syllable-Based Linear Segments
Fundamental frequency contours for speech, as obtained by common pitch tracking algorithms, contain a great deal of fine detail that is unlikely to hold much perceptual significance for listeners. In our experiments, a radically reduced pitch contour consisting of a single linear segment for each syllable was found to judged as equally natural as the original pitch track by listeners, based on high-quality analysis-synthesis. We describe the algorithms both for segmenting speech into syllables based on fitting Gaussians to the energy envelope, and for approximating the pitch contour by independent linear segments for each syllable. We report our web-based test in which 40 listeners compared the stylized pitch contour resyntheses to equivalent resyntheses based on the original pitch track, and also to pitch tracks stylized by the existing Momel algorithm. Listeners preferred the original pitch contour to the linear approximation in only 60% of cases, where 50% would indicate random guessing. By contrast, the original was preferred over Momel in 74% of cases
CoPaSul Manual -- Contour-based parametric and superpositional intonation stylization
The purposes of the CoPaSul toolkit are (1) automatic prosodic annotation and
(2) prosodic feature extraction from syllable to utterance level. CoPaSul
stands for contour-based, parametric, superpositional intonation stylization.
In this framework intonation is represented as a superposition of global and
local contours that are described parametrically in terms of polynomial
coefficients. On the global level (usually associated but not necessarily
restricted to intonation phrases) the stylization serves to represent register
in terms of time-varying F0 level and range. On the local level (e.g. accent
groups), local contour shapes are described. From this parameterization several
features related to prosodic boundaries and prominence can be derived.
Furthermore, by coefficient clustering prosodic contour classes can be obtained
in a bottom-up way. Next to the stylization-based feature extraction also
standard F0 and energy measures (e.g. mean and variance) as well as rhythmic
aspects can be calculated. At the current state automatic annotation comprises:
segmentation into interpausal chunks, syllable nucleus extraction, and
unsupervised localization of prosodic phrase boundaries and prominent
syllables. F0 and partly also energy feature sets can be derived for: standard
measurements (as median and IQR), register in terms of F0 level and range,
prosodic boundaries, local contour shapes, bottom-up derived contour classes,
Gestalt of accent groups in terms of their deviation from higher level prosodic
units, as well as for rhythmic aspects quantifying the relation between F0 and
energy contours and prosodic event rates
Prosody-Based Automatic Segmentation of Speech into Sentences and Topics
A crucial step in processing speech audio data for information extraction,
topic detection, or browsing/playback is to segment the input into sentence and
topic units. Speech segmentation is challenging, since the cues typically
present for segmenting text (headers, paragraphs, punctuation) are absent in
spoken language. We investigate the use of prosody (information gleaned from
the timing and melody of speech) for these tasks. Using decision tree and
hidden Markov modeling techniques, we combine prosodic cues with word-based
approaches, and evaluate performance on two speech corpora, Broadcast News and
Switchboard. Results show that the prosodic model alone performs on par with,
or better than, word-based statistical language models -- for both true and
automatically recognized words in news speech. The prosodic model achieves
comparable performance with significantly less training data, and requires no
hand-labeling of prosodic events. Across tasks and corpora, we obtain a
significant improvement over word-only models using a probabilistic combination
of prosodic and lexical information. Inspection reveals that the prosodic
models capture language-independent boundary indicators described in the
literature. Finally, cue usage is task and corpus dependent. For example, pause
and pitch features are highly informative for segmenting news speech, whereas
pause, duration and word-based cues dominate for natural conversation.Comment: 30 pages, 9 figures. To appear in Speech Communication 32(1-2),
Special Issue on Accessing Information in Spoken Audio, September 200
Effects of tonal alignment on lexical identification in Italian
The aim of this paper is to examine the role of tonal alignment in Italian variety spoken in Naples. We focused on the effects of intonation in the perception of minimal pairs contrasting in consonant duration. Spectrographic analyses show that the timing of pitch accent varies with the syllable structure (see also [1]). In open syllable (CV), the pitch peak is realized within the stressed vowel, while in the closed syllable (CVC) the peak is reached at the end of the accented syllable, associated with the last consonant. In order to analyze these effects, series of words contrasting in consonant duration and inserted in the same segmental environment were produced by a native speaker. Two kinds of manipulation were performed. First, we modified the length of the stressed vowel and the following consonant in five steps; then, the timing of the pitch peak was modified in four steps, too. Finally, a set of resynthesized stimuli was created by the combination of all the different steps of duration and pitch: this set constituted our basis for the perception experiments. We asked thirteen Neapolitan people to listen to the stimuli, and to identify them with either one word or the other of each pair. Our results show that manipulation of intonation was significant for the stimuli coming from the CVC words. That is, a garden path effect (or: a shift in responses) related to the timing of the pitch peak was found. These results lend support to the hypothesis that the listeners use temporal alignment for the perception of segmental identity and that the contribution of intonation both in production and in perception is a fundamental source of linguistic information
Effects of tonal alignment on lexical identification in Italian
The aim of this paper is to examine the role of tonal alignment in Italian variety spoken in Naples. We focused on the effects of intonation in the perception of minimal pairs contrasting in consonant duration. Spectrographic analyses show that the timing of pitch accent varies with the syllable structure (see also [1]). In open syllable (CV), the pitch peak is realized within the stressed vowel, while in the closed syllable (CVC) the peak is reached at the end of the accented syllable, associated with the last consonant. In order to analyze these effects, series of words contrasting in consonant duration and inserted in the same segmental environment were produced by a native speaker. Two kinds of manipulation were performed. First, we modified the length of the stressed vowel and the following consonant in five steps; then, the timing of the pitch peak was modified in four steps, too. Finally, a set of resynthesized stimuli was created by the combination of all the different steps of duration and pitch: this set constituted our basis for the perception experiments. We asked thirteen Neapolitan people to listen to the stimuli, and to identify them with either one word or the other of each pair. Our results show that manipulation of intonation was significant for the stimuli coming from the CVC words. That is, a garden path effect (or: a shift in responses) related to the timing of the pitch peak was found. These results lend support to the hypothesis that the listeners use temporal alignment for the perception of segmental identity and that the contribution of intonation both in production and in perception is a fundamental source of linguistic information
Emotion Recognition from Speech Signals and Perception of Music
This thesis deals with emotion recognition from speech signals. The feature extraction step shall be improved by looking at the perception of music. In music theory, different pitch intervals (consonant, dissonant) and chords are believed to invoke different feelings in listeners. The question is whether there is a similar mechanism between perception of music and perception of emotional speech. Our research will follow three stages. First, the relationship between speech and music at segmental and supra-segmental levels will be analyzed. Secondly, the encoding of emotions through music shall be investigated. In the third stage, a description of the most common features used for emotion recognition from speech will be provided. We will additionally derive new high-level musical features, which will lead us to an improvement of the recognition rate for the basic spoken emotions
- …