Search CORE

95 research outputs found

Activity Report 2004

Author
Publication venue: IDIAP
Publication date: 10/03/2006
Field of study

Infoscience - École polytechnique fédérale de Lausanne

Acoustic Modelling for Under-Resourced Languages

Author: Stüker Sebastian
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2009
Field of study

Automatic speech recognition systems have so far been developed only for very few languages out of the 4,000-7,000 existing ones. In this thesis we examine methods to rapidly create acoustic models in new, possibly under-resourced languages, in a time and cost effective manner. For this we examine the use of multilingual models, the application of articulatory features across languages, and the automatic discovery of word-like units in unwritten languages

KITopen

A multimodal real-time MRI articulatory corpus of French for speech research

Author: Douros I.
Felblinger J.
Frahm J.
Isaieva K.
Joseph A.
Laprie Y.
Odille F.
Tuskanova A.
Voit D.
Vuissoz P.
Publication venue
Publication date: 28/06/2019
Field of study

In this work we describe the creation of ArtSpeechMRIfr: a real-time as well as static magnetic resonance imaging (rtMRI, 3D MRI) database of the vocal tract. The database contains also processed data: denoised audio, its phonetically aligned annotation, articulatory contours, and vocal tract volume information , which provides a rich resource for speech research. The database is built on data from two male speakers of French. It covers a number of phonetic contexts in the controlled part, as well as spontaneous speech, 3D MRI scans of sustained vocalic articulations, and of the dental casts of the subjects. The corpus for rtMRI consists of 79 synthetic sentences constructed from a phonetized dictionary that makes possible to shorten the duration of acquisitions while keeping a very good coverage of the phonetic contexts which exist in French. The 3D MRI includes acquisitions for 12 French vowels and 10 consonants, each of which was pronounced in several vocalic contexts. Ar-ticulatory contours (tongue, jaw, epiglottis, larynx, velum, lips) as well as 3D volumes were manually drawn for a part of the images

MPG.PuRe

The Emotion Probe: On the Universality of Cross-Linguistic and Cross-Gender Speech Emotion Recognition via Machine Learning

Author: Daniele Casali
Emilia Parada-Cabaleiro
Giovanni Costantini
Valerio Cesarini
Publication venue: country:CH
Publication date: 01/03/2022
Field of study

Machine Learning (ML) algorithms within a human–computer framework are the leading force in speech emotion recognition (SER). However, few studies explore cross-corpora aspects of SER; this work aims to explore the feasibility and characteristics of a cross-linguistic, cross-gender SER. Three ML classifiers (SVM, Naïve Bayes and MLP) are applied to acoustic features, obtained through a procedure based on Kononenko’s discretization and correlation-based feature selection. The system encompasses five emotions (disgust, fear, happiness, anger and sadness), using the Emofilm database, comprised of short clips of English movies and the respective Italian and Spanish dubbed versions, for a total of 1115 annotated utterances. The results see MLP as the most effective classifier, with accuracies higher than 90% for single-language approaches, while the cross-language classifier still yields accuracies higher than 80%. The results show cross-gender tasks to be more difficult than those involving two languages, suggesting greater differences between emotions expressed by male versus female subjects than between different languages. Four feature domains, namely, RASTA, F0, MFCC and spectral energy, are algorithmically assessed as the most effective, refining existing literature and approaches based on standard sets. To our knowledge, this is one of the first studies encompassing cross-gender and cross-linguistic assessments on SER

Directory of Open Access Journals

PubMed Central

ART

Perceptual Continuity and Naturalness of Expressive Strength in Singing Voices Based on Speech Morphing

Author: Kenji Mase
Kiyoshi Kogure
Noriko Suzuki
Shinji Abe
Tomoko Yonezawa
Publication venue: Springer Nature
Publication date: 01/10/2007
Field of study

This paper experimentally shows the importance of perceptual continuity of the expressive strength in vocal timbre for natural change in vocal expression. In order to synthesize various and continuous expressive strengths with vocal timbre, we investigated gradually changing expressions by applying the STRAIGHT speech morphing algorithm to singing voices. Here, a singing voice without expression is used as the base of morphing, and singing voices with three different expressions are used as the target. Through statistical analyses of perceptual evaluations, we confirmed that the proposed morphing algorithm provides perceptual continuity of vocal timbre. Our results showed the following: (i) gradual strengths in absolute evaluations, and (ii) a perceptually linear strength provided by the calculation of corrected intervals of the morph ratio by the inverse (reciprocal) function of an equation that approximates the perceptual strength. Finally, we concluded that applying continuity was highly effective for achieving perceptual naturalness, judging from the results showing that (iii) our gradual transformation method can perform well for perceived naturalness

Springer - Publisher Connector

Directory of Open Access Journals

Translation and Prosody in Swiss Languages

Author: Clark Robert
Garner Philip N.
Goldman Jean-Philippe
Honnet Pierre-Edouard
Ivanova Maria
Lang Hui
Lazaridis Alexandros
Pfister Beat
Ribeiro Manuel Sam
Wehrli Eric
Yamagishi Junichi
Publication venue
Publication date: 19/06/2014
Field of study

Infoscience - École polytechnique fédérale de Lausanne

Edinburgh Research Explorer

Spoken dialog systems based on online generated stochastic finite-state transducers

Author: Hurtado Oliver Lluis Felip
Planells Lerma Joaquín
Sanchís Arnal Emilio
Segarra Soriano Encarnación
Publication venue: 'Elsevier BV'
Publication date: 01/10/2016
Field of study

This is the author’s version of a work that was accepted for publication in Speech Communication. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Speech Communication 83 (2016) 81–93. DOI 10.1016/j.specom.2016.07.011.In this paper, we present an approach for the development of spoken dialog systems based on the statistical modelization of the dialog manager. This work focuses on three points: the modelization of the dialog manager using Stochastic Finite-State Transducers, an unsupervised way to generate training corpora, and a mechanism to address the problem of coverage that is based on the online generation of synthetic dialogs. Our proposal has been developed and applied to a sport facilities booking task at the university. We present experimentation evaluating the system behavior on a set of dialogs that was acquired using the Wizard of Oz technique as well as experimentation with real users. The experimentation shows that the method proposed to increase the coverage of the Dialog System was useful to find new valid paths in the model to achieve the user goals, providing good results with real users. © 2016 Elsevier B.V. All rights reserved.This work is partially supported by the project ASLP-MULAN: Audio, Speech and Language Processing for Multimedia Analytics (MINECO TIN2014-54288-C4-3-R).Hurtado Oliver, LF.; Planells Lerma, J.; Segarra Soriano, E.; Sanchís Arnal, E. (2016). Spoken dialog systems based on online generated stochastic finite-state transducers. Speech Communication. 83:81-93. https://doi.org/10.1016/j.specom.2016.07.011S81938

RiuNet

Linguistically motivated parameter estimation methods for a superpositional intonation model

Author: Gurlekian Jorge Alberto
Mixdorff Hansjörg
Pfitzinger Hartmut
Torres Humberto Maximiliano
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/12/2014
Field of study

This paper proposes two novel approaches for parameter estimation of a superpositional intonation model. These approaches present linguistic and paralinguistic assumptions for initializing a pre-existing standard method. In addition, all restrictions on the configuration of commands were eliminated. The proposed linguistic hypotheses can be based on either pitch accents or lexical stress, which give rise to two different estimation methods. These two hypotheses were validated by comparison of the estimation performance relative to two standard methods, one manual and one automatic. The results of the experiments for German, English and Spanish corpora show that the proposed methods outperform the standard ones.Fil: Torres, Humberto Maximiliano. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Houssay. Instituto de Inmunología, Genética y Metabolismo. Universidad de Buenos Aires. Facultad de Medicina. Instituto de Inmunología, Genética y Metabolismo; ArgentinaFil: Gurlekian, Jorge Alberto. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Houssay. Instituto de Inmunología, Genética y Metabolismo. Universidad de Buenos Aires. Facultad de Medicina. Instituto de Inmunología, Genética y Metabolismo; ArgentinaFil: Mixdorff, Hansjörg. Beuth University Berlin; AlemaniaFil: Pfitzinger, Hartmut. Pfitzinger Voice Design; Alemani

CONICET Digital

Recognition of Noisy Speech: A Comparative Survey of Robust Model Architecture and Feature Enhancement

Author
Publication venue: Springer
Publication date: 24/05/2009
Field of study

Springer - Publisher Connector

Linguistically-constrained formant-based i-vectors for automatic speaker recognition

Author: Franco-Pedroso Javier
González-Rodríguez Joaquín
Publication venue: 'Elsevier BV'
Publication date: 01/02/2018
Field of study

This is the author’s version of a work that was accepted for publication in Speech Communication. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Speech Communication, VOL 76 (2016) DOI 10.1016/j.specom.2015.11.002This paper presents a large-scale study of the discriminative abilities of formant frequencies for automatic speaker recognition. Exploiting both the static and dynamic information in formant frequencies, we present linguistically-constrained formant-based i-vector systems providing well calibrated likelihood ratios per comparison of the occurrences of the same isolated linguistic units in two given utterances. As a first result, the reported analysis on the discriminative and calibration properties of the different linguistic units provide useful insights, for instance, to forensic phonetic practitioners. Furthermore, it is shown that the set of units which are more discriminative for every speaker vary from speaker to speaker. Secondly, linguistically-constrained systems are combined at score-level through average and logistic regression speaker-independent fusion rules exploiting the different speaker-distinguishing information spread among the different linguistic units. Testing on the English-only trials of the core condition of the NIST 2006 SRE (24,000 voice comparisons of 5 minutes telephone conversations from 517 speakers -219 male and 298 female-), we report equal error rates of 9.57 and 12.89% for male and female speakers respectively, using only formant frequencies as speaker discriminative information. Additionally, when the formant-based system is fused with a cepstral i-vector system, we obtain relative improvements of ∼6% in EER (from 6.54 to 6.13%) and ∼15% in minDCF (from 0.0327 to 0.0279), compared to the cepstral system alone.This work has been supported by the Spanish Ministry of Economy and Competitiveness (project CMC-V2: Caracterizacion, Modelado y Compensacion de Variabilidad en la Señal de Voz, TEC2012-37585-C02-01). Also, the authors would like to thank SRI for providing the Decipher phonetic transcriptions of the NIST 2004, 2005 and 2006 SREs that have allowed to carry out this work

Biblos-e Archivo