71 research outputs found

    Methods for speaking style conversion from normal speech to high vocal effort speech

    Get PDF
    This thesis deals with vocal-effort-focused speaking style conversion (SSC). Specifically, we studied two topics on conversion of normal speech to high vocal effort. The first topic involves the conversion of normal speech to shouted speech. We employed this conversion in a speaker recognition system with vocal effort mismatch between test and enrollment utterances (shouted speech vs. normal speech). The mismatch causes a degradation of the system's speaker identification performance. As solution, we proposed a SSC system that included a novel spectral mapping, used along a statistical mapping technique, to transform the mel-frequency spectral energies of normal speech enrollment utterances towards their counterparts in shouted speech. We evaluated the proposed solution by comparing speaker identification rates for a state-of-the-art i-vector-based speaker recognition system, with and without applying SSC to the enrollment utterances. Our results showed that applying the proposed SSC pre-processing to the enrollment data improves considerably the speaker identification rates. The second topic involves a normal-to-Lombard speech conversion. We proposed a vocoder-based parametric SSC system to perform the conversion. This system first extracts speech features using the vocoder. Next, a mapping technique, robust to data scarcity, maps the features. Finally, the vocoder synthesizes the mapped features into speech. We used two vocoders in the conversion system, for comparison: a glottal vocoder and the widely used STRAIGHT. We assessed the converted speech from the two vocoder cases with two subjective listening tests that measured similarity to Lombard speech and naturalness. The similarity subjective test showed that, for both vocoder cases, our proposed SSC system was able to convert normal speech to Lombard speech. The naturalness subjective test showed that the converted samples using the glottal vocoder were clearly more natural than those obtained with STRAIGHT

    Shouted Speech Compensation for Speaker Verification Robust to Vocal Effort Conditions

    Get PDF
    The performance of speaker verification systems degrades when vocal effort conditions between enrollment and test (e.g., shouted vs. normal speech) are different. This is a potential situation in non-cooperative speaker verification tasks. In this paper, we present a study on different methods for linear compensation of embeddings making use of Gaussian mixture models to cluster shouted and normal speech domains. These compensation techniques are borrowed from the area of robustness for automatic speech recognition and, in this work, we apply them to compensate the mismatch between shouted and normal conditions in speaker verification. Before compensation, shouted condition is automatically detected by means of logistic regression. The process is computationally light and it is performed in the back-end of an x-vector system. Experimental results show that applying the proposed approach in the presence of vocal effort mismatch yields up to 13.8% equal error rate relative improvement with respect to a system that applies neither shouted speech detection nor compensation

    Mothers Reveal More of Their Vocal Identity When Talking to Infants

    Full text link
    Voice timbre – the unique acoustic information in a voice by which its speaker can be recognized – is particularly critical in mother-infant interaction. Correct identification of vocal timbre is necessary in order for infants to recognize their mothers as familiar both before and after birth, providing a basis for social bonding between infant and mother. The exact mechanisms underlying infant voice recognition remain ambiguous and have predominantly been studied in terms of cognitive voice recognition abilities of the infant. Here, we show – for the first time – that caregivers actively maximize their chances of being correctly recognized by presenting more details of their vocal timbre through adjustments to their voices known as infant-directed speech (IDS) or baby talk, a vocal register which is wide-spread through most of the world’s cultures. Using acoustic modelling (k-means clustering of Mel Frequency Cepstral Coefficients) of IDS in comparison with adult-directed speech (ADS), we found in two cohorts of speakers - US English and Swiss German mothers - that voice timbre clusters of in IDS are significantly larger to comparable clusters in ADS. This effect leads to a more detailed representation of timbre in IDS with subsequent benefits for recognition. Critically, an automatic speaker identification using a Gaussian-mixture model based on Mel Frequency Cepstral Coefficients showed significantly better performance in two experiments when trained with IDS as opposed to ADS. We argue that IDS has evolved as part of an adaptive set of evolutionary strategies that serve to promote indexical signalling by caregivers to their offspring which thereby promote social bonding via voice and acquiring linguistic systems

    Real-time noise-robust speech detection

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 87-89).As part of the development of an autonomous forklift of the Agile Robotics Lab at MIT's Computer Science and Artificial Intelligence Lab (CSAIL), this thesis explores the effectiveness and application of various noise-robust techniques towards real-time speech detection in real environments. Dynamic noises in the environment (including motor noise, babble noise, and other noises in a warehouse setting) can dramatically alter the speech signal, making speech detection much more difficult. In addition to the noise environments, another issue is the urgent nature of the situation, leading to the production of shouted speech. Given these constraints, the forklift must be highly accurate in detecting speech at all times, since safety is a major concern in our application. This thesis analyzes different speech properties that would be useful in distinguishing speech from noise in various noise environments. We look at various features in an effort to optimize the overall shout detection system. In addition to identifying speech features, this thesis also uses common signal processing techniques to enhance the speech signals in audio waveforms. In addition to the optimal speech features and speech enhancement techniques, we present a shout detection algorithm that is optimized towards the application of the autonomous forklift. We measure the performance of the resulting system by comparing it to other baseline systems and show 38% improvement over a baseline task.by Kevin Y. Luu.M.Eng

    Untersuchungen zur Auswirkung erhöhten Stimmaufwands auf Sprache unter Einbezug des Anwendungsfalls der automatischen Sprechererkennung

    Get PDF
    Das Ziel dieser Arbeit war die Analyse der Auswirkung erhöhten Stimmaufwands auf Sprache und sprachverarbeitende Systeme. Als Beispielszenario wurde die automatische Sprechererkennung bei ungleichem Stimmaufwand in Trainings- und Testdaten ausgewählt. Hierbei wurden zwei Schwerpunkte gesetzt. Zunächst wurden die akustischen Veränderungen, bedingt durch erhöhten Stimmaufwand, analysiert. Dann wurde ein Sprecherverifikationssystem für nicht-übereinstimmenden Stimmaufwand realisiert. Die Untersuchung akustischer Veränderungen bei erhöhtem Stimmaufwand befasste sich mit der statistischen Analyse spektraler Veränderungen bei einer Erhöhung des Stimmaufwands. Es wurden unterschiedliche spektrale Parameter analysiert und miteinander verglichen. Diese Parameter wurden dann zur automatischen Klassifikation des Stimmaufwands genutzt und systematisch evaluiert. Da bekannt war, dass sich sowohl das Spektrum als auch die Grundfrequenz verändern, schloss sich eine Analyse der Zusammenhänge zwischen der Grundfrequenz und den spektralen Parametern an. Zur Realisierung eines Sprecherverifikationssystems für nicht-übereinstimmenden Stimmaufwand wurde zunächst ein Framework zur automatischen Sprechererkennung implementiert. Um die Sprechererkennung bei unterschiedlichem Stimmaufwand zu verbessern wurden verschiedene Standardmerkmale, grundfrequenzbasierte Merkmale und selbst entwickelte Merkmale hinsichtlich des Einflusses erhöhten Stimmaufwands evaluiert und die besten Merkmale sowie Merkmalskombinationen bestimmt. Zur weiteren Verbesserung wurden Verfahren zur Adaption der Testdaten oder der Modelle untersucht. Im Anschluss wurden die besten Ergebnisse aller Tests vergleichend dargestellt. Abschließend wurden die Ergebnisse der Arbeit zusammenfassend präsentiert und mögliche Fortführungen der Arbeit vorgestellt
    • …
    corecore