296 research outputs found

    Stylization of Pitch with Syllable-Based Linear Segments

    Get PDF
    Fundamental frequency contours for speech, as obtained by common pitch tracking algorithms, contain a great deal of fine detail that is unlikely to hold much perceptual significance for listeners. In our experiments, a radically reduced pitch contour consisting of a single linear segment for each syllable was found to judged as equally natural as the original pitch track by listeners, based on high-quality analysis-synthesis. We describe the algorithms both for segmenting speech into syllables based on fitting Gaussians to the energy envelope, and for approximating the pitch contour by independent linear segments for each syllable. We report our web-based test in which 40 listeners compared the stylized pitch contour resyntheses to equivalent resyntheses based on the original pitch track, and also to pitch tracks stylized by the existing Momel algorithm. Listeners preferred the original pitch contour to the linear approximation in only 60% of cases, where 50% would indicate random guessing. By contrast, the original was preferred over Momel in 74% of cases

    CoPaSul Manual -- Contour-based parametric and superpositional intonation stylization

    Full text link
    The purposes of the CoPaSul toolkit are (1) automatic prosodic annotation and (2) prosodic feature extraction from syllable to utterance level. CoPaSul stands for contour-based, parametric, superpositional intonation stylization. In this framework intonation is represented as a superposition of global and local contours that are described parametrically in terms of polynomial coefficients. On the global level (usually associated but not necessarily restricted to intonation phrases) the stylization serves to represent register in terms of time-varying F0 level and range. On the local level (e.g. accent groups), local contour shapes are described. From this parameterization several features related to prosodic boundaries and prominence can be derived. Furthermore, by coefficient clustering prosodic contour classes can be obtained in a bottom-up way. Next to the stylization-based feature extraction also standard F0 and energy measures (e.g. mean and variance) as well as rhythmic aspects can be calculated. At the current state automatic annotation comprises: segmentation into interpausal chunks, syllable nucleus extraction, and unsupervised localization of prosodic phrase boundaries and prominent syllables. F0 and partly also energy feature sets can be derived for: standard measurements (as median and IQR), register in terms of F0 level and range, prosodic boundaries, local contour shapes, bottom-up derived contour classes, Gestalt of accent groups in terms of their deviation from higher level prosodic units, as well as for rhythmic aspects quantifying the relation between F0 and energy contours and prosodic event rates

    Pitch elbow detection

    Get PDF

    Prosody-Based Automatic Segmentation of Speech into Sentences and Topics

    Get PDF
    A crucial step in processing speech audio data for information extraction, topic detection, or browsing/playback is to segment the input into sentence and topic units. Speech segmentation is challenging, since the cues typically present for segmenting text (headers, paragraphs, punctuation) are absent in spoken language. We investigate the use of prosody (information gleaned from the timing and melody of speech) for these tasks. Using decision tree and hidden Markov modeling techniques, we combine prosodic cues with word-based approaches, and evaluate performance on two speech corpora, Broadcast News and Switchboard. Results show that the prosodic model alone performs on par with, or better than, word-based statistical language models -- for both true and automatically recognized words in news speech. The prosodic model achieves comparable performance with significantly less training data, and requires no hand-labeling of prosodic events. Across tasks and corpora, we obtain a significant improvement over word-only models using a probabilistic combination of prosodic and lexical information. Inspection reveals that the prosodic models capture language-independent boundary indicators described in the literature. Finally, cue usage is task and corpus dependent. For example, pause and pitch features are highly informative for segmenting news speech, whereas pause, duration and word-based cues dominate for natural conversation.Comment: 30 pages, 9 figures. To appear in Speech Communication 32(1-2), Special Issue on Accessing Information in Spoken Audio, September 200

    Effects of tonal alignment on lexical identification in Italian

    No full text
    The aim of this paper is to examine the role of tonal alignment in Italian variety spoken in Naples. We focused on the effects of intonation in the perception of minimal pairs contrasting in consonant duration. Spectrographic analyses show that the timing of pitch accent varies with the syllable structure (see also [1]). In open syllable (CV), the pitch peak is realized within the stressed vowel, while in the closed syllable (CVC) the peak is reached at the end of the accented syllable, associated with the last consonant. In order to analyze these effects, series of words contrasting in consonant duration and inserted in the same segmental environment were produced by a native speaker. Two kinds of manipulation were performed. First, we modified the length of the stressed vowel and the following consonant in five steps; then, the timing of the pitch peak was modified in four steps, too. Finally, a set of resynthesized stimuli was created by the combination of all the different steps of duration and pitch: this set constituted our basis for the perception experiments. We asked thirteen Neapolitan people to listen to the stimuli, and to identify them with either one word or the other of each pair. Our results show that manipulation of intonation was significant for the stimuli coming from the CVC words. That is, a garden path effect (or: a shift in responses) related to the timing of the pitch peak was found. These results lend support to the hypothesis that the listeners use temporal alignment for the perception of segmental identity and that the contribution of intonation both in production and in perception is a fundamental source of linguistic information

    Effects of tonal alignment on lexical identification in Italian

    No full text
    The aim of this paper is to examine the role of tonal alignment in Italian variety spoken in Naples. We focused on the effects of intonation in the perception of minimal pairs contrasting in consonant duration. Spectrographic analyses show that the timing of pitch accent varies with the syllable structure (see also [1]). In open syllable (CV), the pitch peak is realized within the stressed vowel, while in the closed syllable (CVC) the peak is reached at the end of the accented syllable, associated with the last consonant. In order to analyze these effects, series of words contrasting in consonant duration and inserted in the same segmental environment were produced by a native speaker. Two kinds of manipulation were performed. First, we modified the length of the stressed vowel and the following consonant in five steps; then, the timing of the pitch peak was modified in four steps, too. Finally, a set of resynthesized stimuli was created by the combination of all the different steps of duration and pitch: this set constituted our basis for the perception experiments. We asked thirteen Neapolitan people to listen to the stimuli, and to identify them with either one word or the other of each pair. Our results show that manipulation of intonation was significant for the stimuli coming from the CVC words. That is, a garden path effect (or: a shift in responses) related to the timing of the pitch peak was found. These results lend support to the hypothesis that the listeners use temporal alignment for the perception of segmental identity and that the contribution of intonation both in production and in perception is a fundamental source of linguistic information

    Emotion Recognition from Speech Signals and Perception of Music

    Full text link
    This thesis deals with emotion recognition from speech signals. The feature extraction step shall be improved by looking at the perception of music. In music theory, different pitch intervals (consonant, dissonant) and chords are believed to invoke different feelings in listeners. The question is whether there is a similar mechanism between perception of music and perception of emotional speech. Our research will follow three stages. First, the relationship between speech and music at segmental and supra-segmental levels will be analyzed. Secondly, the encoding of emotions through music shall be investigated. In the third stage, a description of the most common features used for emotion recognition from speech will be provided. We will additionally derive new high-level musical features, which will lead us to an improvement of the recognition rate for the basic spoken emotions
    corecore