Search CORE

71 research outputs found

Methods for speaking style conversion from normal speech to high vocal effort speech

Author: Ramírez López Ana
Publication venue: Aalto University, School of Arts, Design and Architecture, Department of Arts
Publication date: 01/01/2020
Field of study

This thesis deals with vocal-effort-focused speaking style conversion (SSC). Specifically, we studied two topics on conversion of normal speech to high vocal effort. The first topic involves the conversion of normal speech to shouted speech. We employed this conversion in a speaker recognition system with vocal effort mismatch between test and enrollment utterances (shouted speech vs. normal speech). The mismatch causes a degradation of the system's speaker identification performance. As solution, we proposed a SSC system that included a novel spectral mapping, used along a statistical mapping technique, to transform the mel-frequency spectral energies of normal speech enrollment utterances towards their counterparts in shouted speech. We evaluated the proposed solution by comparing speaker identification rates for a state-of-the-art i-vector-based speaker recognition system, with and without applying SSC to the enrollment utterances. Our results showed that applying the proposed SSC pre-processing to the enrollment data improves considerably the speaker identification rates. The second topic involves a normal-to-Lombard speech conversion. We proposed a vocoder-based parametric SSC system to perform the conversion. This system first extracts speech features using the vocoder. Next, a mapping technique, robust to data scarcity, maps the features. Finally, the vocoder synthesizes the mapped features into speech. We used two vocoders in the conversion system, for comparison: a glottal vocoder and the widely used STRAIGHT. We assessed the converted speech from the two vocoder cases with two subjective listening tests that measured similarity to Lombard speech and naturalness. The similarity subjective test showed that, for both vocoder cases, our proposed SSC system was able to convert normal speech to Lombard speech. The naturalness subjective test showed that the converted samples using the glottal vocoder were clearly more natural than those obtained with STRAIGHT

Aaltodoc Publication Archive

Shouted Speech Compensation for Speaker Verification Robust to Vocal Effort Conditions

Author: Lleida Eduardo
López-Espejo Iván
Ortega Alfonso
Prieto Santi
Publication venue
Publication date: 01/01/2020
Field of study

The performance of speaker verification systems degrades when vocal effort conditions between enrollment and test (e.g., shouted vs. normal speech) are different. This is a potential situation in non-cooperative speaker verification tasks. In this paper, we present a study on different methods for linear compensation of embeddings making use of Gaussian mixture models to cluster shouted and normal speech domains. These compensation techniques are borrowed from the area of robustness for automatic speech recognition and, in this work, we apply them to compensate the mismatch between shouted and normal conditions in speaker verification. Before compensation, shouted condition is automatically detected by means of logistic regression. The process is computationally light and it is performed in the back-end of an x-vector system. Experimental results show that applying the proposed approach in the presence of vocal effort mismatch yields up to 13.8% equal error rate relative improvement with respect to a system that applies neither shouted speech detection nor compensation

arXiv.org e-Print Archive

Crossref

VBN

Mothers Reveal More of Their Vocal Identity When Talking to Infants

Author: Daum Moritz
Dellwo Volker
Dilley Laura
Hervais-Adelman Alexis
Kathiresan Thayabaran
Shi Rushen
Townsend Simon William
Publication venue: 'Elsevier BV'
Publication date: 22/04/2022
Field of study

Voice timbre – the unique acoustic information in a voice by which its speaker can be recognized – is particularly critical in mother-infant interaction. Correct identification of vocal timbre is necessary in order for infants to recognize their mothers as familiar both before and after birth, providing a basis for social bonding between infant and mother. The exact mechanisms underlying infant voice recognition remain ambiguous and have predominantly been studied in terms of cognitive voice recognition abilities of the infant. Here, we show – for the first time – that caregivers actively maximize their chances of being correctly recognized by presenting more details of their vocal timbre through adjustments to their voices known as infant-directed speech (IDS) or baby talk, a vocal register which is wide-spread through most of the world’s cultures. Using acoustic modelling (k-means clustering of Mel Frequency Cepstral Coefficients) of IDS in comparison with adult-directed speech (ADS), we found in two cohorts of speakers - US English and Swiss German mothers - that voice timbre clusters of in IDS are significantly larger to comparable clusters in ADS. This effect leads to a more detailed representation of timbre in IDS with subsequent benefits for recognition. Critically, an automatic speaker identification using a Gaussian-mixture model based on Mel Frequency Cepstral Coefficients showed significantly better performance in two experiments when trained with IDS as opposed to ADS. We argue that IDS has evolved as part of an adaptive set of evolutionary strategies that serve to promote indexical signalling by caregivers to their offspring which thereby promote social bonding via voice and acquiring linguistic systems

ZORA

Real-time noise-robust speech detection

Author: Luu Kevin Y
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2010
Field of study

Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 87-89).As part of the development of an autonomous forklift of the Agile Robotics Lab at MIT's Computer Science and Artificial Intelligence Lab (CSAIL), this thesis explores the effectiveness and application of various noise-robust techniques towards real-time speech detection in real environments. Dynamic noises in the environment (including motor noise, babble noise, and other noises in a warehouse setting) can dramatically alter the speech signal, making speech detection much more difficult. In addition to the noise environments, another issue is the urgent nature of the situation, leading to the production of shouted speech. Given these constraints, the forklift must be highly accurate in detecting speech at all times, since safety is a major concern in our application. This thesis analyzes different speech properties that would be useful in distinguishing speech from noise in various noise environments. We look at various features in an effort to optimize the overall shout detection system. In addition to identifying speech features, this thesis also uses common signal processing techniques to enhance the speech signals in audio waveforms. In addition to the optimal speech features and speech enhancement techniques, we present a shout detection algorithm that is optimized towards the application of the autonomous forklift. We measure the performance of the resulting system by comparing it to other baseline systems and show 38% improvement over a baseline task.by Kevin Y. Luu.M.Eng

DSpace@MIT

Detecting autism, emotions and social signals using AdaBoost

Author: Busa-Fekete Róbert
Gosztolya Gábor
Tóth László
Publication venue: Interspeech
Publication date: 01/01/2013
Field of study

SZTE Publicatio Repozitórium - SZTE - Repository of Publications

Quantitative assessment of spatial sound distortion by the semi-ideal recording point of a hear-through device

Author: Christensen Flemming
Hammershøi Dorte
Hoffmann Pablo F.
Publication venue: 'Acoustical Society of America (ASA)'
Publication date: 01/01/2013
Field of study

Crossref

VBN

Untersuchungen zur Auswirkung erhöhten Stimmaufwands auf Sprache unter Einbezug des Anwendungsfalls der automatischen Sprechererkennung

Author: Harwardt Corinna
Publication venue: Universitäts- und Landesbibliothek Bonn
Publication date
Field of study

Das Ziel dieser Arbeit war die Analyse der Auswirkung erhöhten Stimmaufwands auf Sprache und sprachverarbeitende Systeme. Als Beispielszenario wurde die automatische Sprechererkennung bei ungleichem Stimmaufwand in Trainings- und Testdaten ausgewählt. Hierbei wurden zwei Schwerpunkte gesetzt. Zunächst wurden die akustischen Veränderungen, bedingt durch erhöhten Stimmaufwand, analysiert. Dann wurde ein Sprecherverifikationssystem für nicht-übereinstimmenden Stimmaufwand realisiert. Die Untersuchung akustischer Veränderungen bei erhöhtem Stimmaufwand befasste sich mit der statistischen Analyse spektraler Veränderungen bei einer Erhöhung des Stimmaufwands. Es wurden unterschiedliche spektrale Parameter analysiert und miteinander verglichen. Diese Parameter wurden dann zur automatischen Klassifikation des Stimmaufwands genutzt und systematisch evaluiert. Da bekannt war, dass sich sowohl das Spektrum als auch die Grundfrequenz verändern, schloss sich eine Analyse der Zusammenhänge zwischen der Grundfrequenz und den spektralen Parametern an. Zur Realisierung eines Sprecherverifikationssystems für nicht-übereinstimmenden Stimmaufwand wurde zunächst ein Framework zur automatischen Sprechererkennung implementiert. Um die Sprechererkennung bei unterschiedlichem Stimmaufwand zu verbessern wurden verschiedene Standardmerkmale, grundfrequenzbasierte Merkmale und selbst entwickelte Merkmale hinsichtlich des Einflusses erhöhten Stimmaufwands evaluiert und die besten Merkmale sowie Merkmalskombinationen bestimmt. Zur weiteren Verbesserung wurden Verfahren zur Adaption der Testdaten oder der Modelle untersucht. Im Anschluss wurden die besten Ergebnisse aller Tests vergleichend dargestellt. Abschließend wurden die Ergebnisse der Arbeit zusammenfassend präsentiert und mögliche Fortführungen der Arbeit vorgestellt

bonndoc – Der Publikationsserver der Universität Bonn

Recommended from our members

The role of vowel hyperarticulation in clear speech to foreigners and infants

Author: Kangatharan Jayanthiny
Publication venue: Brunel University London
Publication date: 01/01/2015
Field of study

This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University LondonResearch on clear speech has shown that the type of clear speech produced can vary depending on the speaker, the listener and the medium. Although prior research has suggested that clear speech is more intelligible than conversational speech for normal-hearing listeners in noisy environments, it is not known which acoustic features of clear speech are the most responsible for enhanced intelligibility and comprehension. This thesis focused on investigating the acoustic characteristics that are produced in clear speech to foreigners and infants. Its aim was to assess the utility of these features in enhancing speech intelligibility and comprehension. The results of Experiment 1 showed that native speakers produced exaggerated vowel space in natural interactions with foreign-accented listeners compared to native-accented listeners. Results of Experiment 2 indicated that native speakers exaggerated vowel space and pitch to infants compared to clear read speech. Experiments 3 and 4 focused on speech perception and used transcription and clarity rating tasks. Experiment 3 contained speech directed at foreigners and showed that speech to foreign-accented speakers was rated clearer than speech to native-accented speakers. Experiment 4 contained speech directed at infants and showed that native speakers rated infant-directed speech as clearer than clear read speech. In the fifth and final experiment, naturally elicited clear speech towards foreign-accented interlocutors was used in speech comprehension tasks for native and non-native listeners with varying proficiency of English. It was revealed that speech with expanded vowel space improved listeners’ comprehension of speech in quiet and noise conditions. Results are discussed in terms of the Lindblom’s (1990) theory of Hyper and Hypoarticulation, an influential framework of speech production and perception.Brunel University Isambard Research Scholarshi

Brunel University Research Archive