462 research outputs found

    Extending automatic transcripts in a unified data representation towards a prosodic-based metadata annotation and evaluation

    Get PDF
    This paper describes a framework that extends automatic speech transcripts in order to accommodate relevant information coming from manual transcripts, the speech signal itself, and other resources, like lexica. The proposed framework automatically collects, relates, computes, and stores all relevant information together in a self-contained data source, making it possible to easily provide a wide range of interconnected information suitable for speech analysis, training, and evaluating a number of automatic speech processing tasks. The main goal of this framework is to integrate different linguistic and paralinguistic layers of knowledge for a more complete view of their representation and interactions in several domains and languages. The processing chain is composed of two main stages, where the first consists of integrating the relevant manual annotations in the speech recognition data, and the second consists of further enriching the previous output in order to accommodate prosodic information. The described framework has been used for the identification and analysis of structural metadata in automatic speech transcripts. Initially put to use for automatic detection of punctuation marks and for capitalization recovery from speech data, it has also been recently used for studying the characterization of disfluencies in speech. It was already applied to several domains of Portuguese corpora, and also to English and Spanish Broadcast News corpora

    Three-dimensional point-cloud room model in room acoustics simulations

    Get PDF

    Models and analysis of vocal emissions for biomedical applications: 5th International Workshop: December 13-15, 2007, Firenze, Italy

    Get PDF
    The MAVEBA Workshop proceedings, held on a biannual basis, collect the scientific papers presented both as oral and poster contributions, during the conference. The main subjects are: development of theoretical and mechanical models as an aid to the study of main phonatory dysfunctions, as well as the biomedical engineering methods for the analysis of voice signals and images, as a support to clinical diagnosis and classification of vocal pathologies. The Workshop has the sponsorship of: Ente Cassa Risparmio di Firenze, COST Action 2103, Biomedical Signal Processing and Control Journal (Elsevier Eds.), IEEE Biomedical Engineering Soc. Special Issues of International Journals have been, and will be, published, collecting selected papers from the conference

    Automatic prosodic analysis for computer aided pronunciation teaching

    Get PDF
    Correct pronunciation of spoken language requires the appropriate modulation of acoustic characteristics of speech to convey linguistic information at a suprasegmental level. Such prosodic modulation is a key aspect of spoken language and is an important component of foreign language learning, for purposes of both comprehension and intelligibility. Computer aided pronunciation teaching involves automatic analysis of the speech of a non-native talker in order to provide a diagnosis of the learner's performance in comparison with the speech of a native talker. This thesis describes research undertaken to automatically analyse the prosodic aspects of speech for computer aided pronunciation teaching. It is necessary to describe the suprasegmental composition of a learner's speech in order to characterise significant deviations from a native-like prosody, and to offer some kind of corrective diagnosis. Phonological theories of prosody aim to describe the suprasegmental composition of speech..

    Models and analysis of vocal emissions for biomedical applications

    Get PDF
    This book of Proceedings collects the papers presented at the 3rd International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications, MAVEBA 2003, held 10-12 December 2003, Firenze, Italy. The workshop is organised every two years, and aims to stimulate contacts between specialists active in research and industrial developments, in the area of voice analysis for biomedical applications. The scope of the Workshop includes all aspects of voice modelling and analysis, ranging from fundamental research to all kinds of biomedical applications and related established and advanced technologies

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    The Pitch Range of Italians and Americans. A Comparative Study

    Get PDF
    Linguistic experiments have investigated the nature of F0 span and level in cross-linguistic comparisons. However, only few studies have focused on the elaboration of a general-agreed methodology that may provide a unifying approach to the analysis of pitch range (Ladd, 1996; Patterson and Ladd, 1999; Daly and Warren, 2001; Bishop and Keating, 2010; Mennen et al. 2012). Pitch variation is used in different languages to convey different linguistic and paralinguistic meanings that may range from the expression of sentence modality to the marking of emotional and attitudinal nuances (Grice and Baumann, 2007). A number of factors have to be taken into consideration when determining the existence of measurable and reliable differences in pitch values. Daly and Warren (2001) demonstrated the importance of some independent variables such as language, age, body size, speaker sex (female vs. male), socio-cultural background, regional accents, speech task (read sentences vs. spontaneous dialogues), sentence type (questions vs. statements) and measure scales (Hertz, semitones, ERB etc.). Coherently with the model proposed by Mennen et al. (2012), my analysis of pitch range is based on the investigation of LTD (long-term distributional) and linguistic measures. LTD measures deal with the F0 distribution within a speaker’s contour (e.g. F0 minimum, F0 maximum, F0 mean, F0 median, standard deviation, F0 span) while linguistic measures are linked to specific targets within the contour, such as peaks and valleys (e.g. high and low landmarks) and preserve the temporal sequences of pitch contours. This investigation analyzed the characteristics of pitch range production and perception in English sentences uttered by Americans and Italians. Four experiments were conducted to examine different phenomena: i) the contrast between measures of F0 level and span in utterances produced by Americans and Italians (experiments 1-2); ii) the contrast between the pitch range produced by males and females in L1 and L2 (experiment 1); iii) the F0 patterns in different sentence types, that is, yes-no questions, wh-questions, and exclamations (experiment 2); iv) listeners’ evaluations of pitch span in terms of ±interesting, ±excited, ±credible, ±friendly ratings of different sentence types (experiments 3-4); v) the correlation between pitch span of the sentences and the evaluations given by American and Italian listeners (experiment 3); vi) the listeners’ evaluations of pitch span values in manipulated stimuli, whose F0 span was re-synthesized under three conditions: narrow span, original span, and wide span (experiment 4); vii) the different evaluations given to the sentences by male and female listeners. The results of this investigation supported the following generalizations. First, pitch span more than level was found to be a cue for non-nativeness, because L2 speakers of English used a narrower span, compared to the native norm. What is more, the experimental data in the production studies indicated that the mode of sentences was better captured by F0 span than level. Second, the Italian learners of English were influenced by their L1 and transferred L1 pitch range variation into their L2. The English sentences produced by the Italians had overall higher pitch levels and narrower pitch span than those produced by the Americans. In addition, the Italians used overall higher pitch levels when speaking Italian and lower levels when speaking English. Conversely, their pitch span was generally higher in English and lower in Italian. When comparing productions in English, the Italian females used higher F0 levels than the American females; vice versa, the Italian males showed slightly lower F0 levels than the American males. Third, there was a systematic relation between pitch span values and the listeners’ evaluations of the sentences. The two groups of listeners (the Americans and the Italians) rated the stimuli with larger pitch span as more interesting, exciting and credible than the stimuli with narrower pitch span. Thus, the listeners relied on the perceived pitch span to differentiate among the stimuli. Fourth, both the American and the Italian speakers were considered more friendly when the pitch span of their sentences was widened (wide span manipulation) and less friendly when the pitch span was narrowed (narrow span manipulation). This happened in all the stimuli regardless of the native language of the speakers (American vs. Italian)
    • …
    corecore