269 research outputs found

    Intelligibility enhancement of synthetic speech in noise

    Get PDF
    EC Seventh Framework Programme (FP7/2007-2013)Speech technology can facilitate human-machine interaction and create new communication interfaces. Text-To-Speech (TTS) systems provide speech output for dialogue, notification and reading applications as well as personalized voices for people that have lost the use of their own. TTS systems are built to produce synthetic voices that should sound as natural, expressive and intelligible as possible and if necessary be similar to a particular speaker. Although naturalness is an important requirement, providing the correct information in adverse conditions can be crucial to certain applications. Speech that adapts or reacts to different listening conditions can in turn be more expressive and natural. In this work we focus on enhancing the intelligibility of TTS voices in additive noise. For that we adopt the statistical parametric paradigm for TTS in the shape of a hidden Markov model (HMM-) based speech synthesis system that allows for flexible enhancement strategies. Little is known about which human speech production mechanisms actually increase intelligibility in noise and how the choice of mechanism relates to noise type, so we approached the problem from another perspective: using mathematical models for hearing speech in noise. To find which models are better at predicting intelligibility of TTS in noise we performed listening evaluations to collect subjective intelligibility scores which we then compared to the models’ predictions. In these evaluations we observed that modifications performed on the spectral envelope of speech can increase intelligibility significantly, particularly if the strength of the modification depends on the noise and its level. We used these findings to inform the decision of which of the models to use when automatically modifying the spectral envelope of the speech according to the noise. We devised two methods, both involving cepstral coefficient modifications. The first was applied during extraction while training the acoustic models and the other when generating a voice using pre-trained TTS models. The latter has the advantage of being able to address fluctuating noise. To increase intelligibility of synthetic speech at generation time we proposed a method for Mel cepstral coefficient modification based on the glimpse proportion measure, the most promising of the models of speech intelligibility that we evaluated. An extensive series of listening experiments demonstrated that this method brings significant intelligibility gains to TTS voices while not requiring additional recordings of clear or Lombard speech. To further improve intelligibility we combined our method with noise-independent enhancement approaches based on the acoustics of highly intelligible speech. This combined solution was as effective for stationary noise as for the challenging competing speaker scenario, obtaining up to 4dB of equivalent intensity gain. Finally, we proposed an extension to the speech enhancement paradigm to account for not only energetic masking of signals but also for linguistic confusability of words in sentences. We found that word level confusability, a challenging value to predict, can be used as an additional prior to increase intelligibility even for simple enhancement methods like energy reallocation between words. These findings motivate further research into solutions that can tackle the effect of energetic masking on the auditory system as well as on higher levels of processing

    Why are some languages confused for others? Investigating data from the Great Language Game

    Get PDF
    In this paper we explore the results of a large-scale online game called 'the Great Language Game', in which people listen to an audio speech sample and make a forced-choice guess about the identity of the language from 2 or more alternatives. The data include 15 million guesses from 400 audio recordings of 78 languages. We investigate which languages are confused for which in the game, and if this correlates with the similarities that linguists identify between languages. This includes shared lexical items, similar sound inventories and established historical relationships. Our findings are, as expected, that players are more likely to confuse two languages that are objectively more similar. We also investigate factors that may affect players' ability to accurately select the target language, such as how many people speak the language, how often the language is mentioned in written materials and the economic power of the target language community. We see that non-linguistic factors affect players' ability to accurately identify the target. For example, languages with wider 'global reach' are more often identified correctly. This suggests that both linguistic and cultural knowledge influence the perception and recognition of languages and their similarity

    Pronunciation modeling for ASR - knowledge-based and data-derived methods.

    Get PDF
    This article focuses on modeling pronunciation variation in two different ways: data-derived and knowledge-based. The knowledge-based approach consists of using phonological rules to generate variants. The data-derived approach consists of performing phone recognition, followed by smoothing using decision trees (D-trees) to alleviate some of the errors in the phone recognition. Using phonological rules led to a small improvement in WER; a data-derived approach in which the phone recognition was smoothed using D-trees prior to lexicon generation led to larger improvements compared to the baseline. The lexicon was employed in two different recognition systems: a hybrid HMM/ANN system and a HMM-based system, to ascertain whether pronunciation variation was truly being modeled. This proved to be the case as no significant differences were found between the results obtained with the two systems. Furthermore, we found that 10% of variants generated by the phonological rules were also found using phone recognition, and this increased to 28% when the phone recognition output was smoothed by using D-trees. This indicates that the D-trees generalize beyond what has been seen in the training material, whereas when the phone recognition approach is employed directly, unseen pronunciations cannot be predicted. In addition, we propose a metric to measure confusability in the lexicon. Using this confusion metric to prune variants results in roughly the same improvement as using the D-tree method

    Language-specificity in auditory perception of Chinese tones

    Get PDF
    PL1213, LoC Subject Headings: Auditory perception, Chinese language--Tone, Chinese language--Phonolog

    Communicative efficiency in the lexicon

    Get PDF
    Thesis (Ph. D. in Linguistics)--Massachusetts Institute of Technology, Dept. of Linguistics and Philosophy, 2012.Cataloged from PDF version of thesis.Includes bibliographical references (p. 150-160).In this dissertation, I argue that a variety of probabilistic patterns in natural language phonology derive from communicative efficiency. I present evidence from phonetically transcribed dictionaries of 60 languages from 25 major language families showing that both probability distributions over phonological structures licensed by the categorical grammar, and the global organization of the phonological lexicon as a whole facilitate the efficient communication of intended messages from speaker to listener. Specifically, I show that the occurrence probabilities of different grammatical structures render natural language phonology an efficient code for communication given the effort involved in producing different categories and the specific kinds of noise introduced by the human language channel. I also present evidence that co-occurrence restrictions on consonants sharing place features serve a communicative purpose in that they facilitate the identification of words with respect to each other. Furthermore, I show that the organization of the phonological lexicon as a whole is subject to communicative efficiency. Concretely, I show that words in human language preferentially rely on highly perceptible contrasts for distinctness, beyond what is expected from the probabilistic patterning of the individual sounds that distinguish them. This shows that redundancy in the phonological code is not randomly distributed, but exists to supplement imperceptibile distinctions between larger units as needed. I argue that cross-linguistic biases in the distributions of individual sounds arise from humans using their language in ways that accommodate anticipated mistransmission (Jurafsky et al. 2001, van Son and Pols 2003, Aylett and Turk 2004) thus presenting a serious challenge to theories relegating the emergence of communicative efficiency in phonology to properties of the human language channel only (Ohala 1981, Blevins 2004, 2006). Furthermore, I present preliminary computational and experimental evidence that the optimization of the lexicon as a whole could have arisen from the aggregate effects of speakers' biases to use globally distinct word forms over the course of a language's history (cf Martin, 2007).by Peter Nepomuk Herwig Maria Graff.Ph.D.in Linguistic

    Systematicity, motivatedness, and the structure of the lexicon

    Get PDF
    For the majority of the 20th century, one of the central dogmas of linguistics was that, at the level of the lexicon, the relationship between words and meanings is arbitrary: there is nothing about the word ‘dog’ for example that makes it a particularly good label for a dog. However, in recent years it has become increasingly recognized that non-arbitrary associations between words and meanings make up a small, but potentially important portion of the lexicon. This thesis focuses on exploring the effect that non-arbitrary associations between words and meanings have on language learning and the structure of the lexicon. Based on a critical analysis of the existing literature, and the results of a number of experiments presented here, I suggest that the overall prevalence and developmental timing of two forms of non-arbitrariness in the lexicon– systematicity and motivatedness – is shaped by the pressure for languages to be learnable while remaining expressive. The effect of pressures for learnability and expressivity have been recognized to have important implications for the structure of language generally, but have so far not been applied to explain structure at the level of the lexicon. The central claim presented in this dissertation is that features of the perceptual and cognitive organization of humans results in specific types of associations between words and meanings being easier for naïve learners to acquire than others, and that the pressure for languages to be learnable results in lexica that leverage these human biases. Taking advantage of these biases, however, induces constraints on the structure of the lexicon that, left unchecked, might limit its expressivity or penalize subsequent learning. Thus, lexica are structured such that early-acquired words are able to leverage these biases while avoiding the limitations imposed by those biases when they are extended past a certain point

    Verbal learning, phonological processing and reading skills in normal and dyslexic readers.

    Get PDF
    Available from British Library Document Supply Centre-DSC:DX207218 / BLDSC - British Library Document Supply CentreSIGLEGBUnited Kingdo
    • 

    corecore