107 research outputs found
New Nonsense Syllables Database -- Analyses and Preliminary ASR Experiments
The paper presents analyses, modifications, and first experiments with a new nonsense syllables database. Results of preliminary experiments with phoneme recognition are given and discussed
Recommended from our members
Computational Approaches to Modeling Speaker State in the Medical Domain
Recently, researchers in computer science and engineering have begun to explore the possibility of finding speech-based correlates of various medical conditions using automatic, computational methods. If such language cues can be identified and quantified automatically, this information can be used to support diagnosis and treatment of medical conditions in clinical settings and to further fundamental research in understanding cognition. This chapter reviews computational approaches that explore communicative patterns of patients who suffer from medical conditions such as depression, autism spectrum disorders, schizophrenia, and cancer. There are two main approaches discussed: research that explores features extracted from the acoustic signal and research that focuses on lexical and semantic features. We also present some applied research that uses computational methods to develop assistive technologies. In the final sections we discuss issues related to and the future of this emerging field of research
SYNTHESIZING DYSARTHRIC SPEECH USING MULTI-SPEAKER TTS FOR DSYARTHRIC SPEECH RECOGNITION
Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech recognition (ASR) systems may help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech is required, which is not readily available for dysarthric talkers.
In this dissertation, we investigate dysarthric speech augmentation and synthesis methods. To better understand differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels, a comparative study between typical and dysarthric speech was conducted. These characteristics are important components for dysarthric speech modeling, synthesis, and augmentation. For augmentation, prosodic transformation and time-feature masking have been proposed. For dysarthric speech synthesis, this dissertation has introduced a modified neural multi-talker TTS by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels. In addition, we have extended this work by using a label propagation technique to create more meaningful control variables such as a continuous Respiration, Laryngeal and Tongue (RLT) parameter, even for datasets that only provide discrete dysarthria severity level information. This approach increases the controllability of the system, so we are able to generate more dysarthric speech with a broader range.
To evaluate their effectiveness for synthesis of training data, dysarthria-specific speech recognition was used. Results show that a DNN-HMM model trained on additional synthetic dysarthric speech achieves WER improvement of 12.2% compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5%, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has a significant impact on the dysarthric ASR systems
Automatic Feedback for L2 Prosody Learning
International audienceWe have designed automatic feedback for the realisation of the prosody of a foreign language. Besides classical F0 displays, two kinds of feedback are provided to learners, each of them based upon a comparison between a reference and the learner's production. The first feedback, a diagnosis, provided both in the form of a short text and visual displays such as arrows, comes from an acoustic evaluation of the learner's realisation; it deals with two prosodic cues: the melodic curve, and phoneme duration. The second feedback is perceptual and consists in a replacement of the learner's prosodic cues (duration and F0) by those of the reference. A pilot experiment has been undertaken to test the immediate impact of the "advanced" feedback proposed here. We have chosen to test the production of English lexical accent in isolated words by French speakers. It shows that feedback based upon diagnosis and speech modification enables French learners with a low production level to improve their realisations of English lexical accents more than (simple) auditory feedback. On the contrary, for advanced learners involved in this study, auditory feedback appears to be as efficient as more elaborated feedback
Directions for the future of technology in pronunciation research and teaching
This paper reports on the role of technology in state-of-the-art pronunciation research and instruction, and makes concrete suggestions for future developments. The point of departure for this contribution is that the goal of second language (L2) pronunciation research and teaching should be enhanced comprehensibility and intelligibility as opposed to native-likeness. Three main areas are covered here. We begin with a presentation of advanced uses of pronunciation technology in research with a special focus on the expertise required to carry out even small-scale investigations. Next, we discuss the nature of data in pronunciation research, pointing to ways in which future work can build on advances in corpus research and crowdsourcing. Finally, we consider how these insights pave the way for researchers and developers working to create research-informed, computer-assisted pronunciation teaching resources. We conclude with predictions for future developments
WHERE IS THE LOCUS OF DIFFICULTY IN RECOGNIZING FOREIGN-ACCENTED WORDS? NEIGHBORHOOD DENSITY AND PHONOTACTIC PROBABILITY EFFECTS ON THE RECOGNITION OF FOREIGN-ACCENTED WORDS BY NATIVE ENGLISH LISTENERS
This series of experiments (1) examined whether native listeners experience recognition difficulty in all kinds of foreign-accented words or only in a subset of words with certain lexical and sub-lexical characteristics-- neighborhood density and phonotactic probability; (2) identified the locus of foreign-accented word recognition difficulty, and (3) investigated how accent-induced mismatches impact the lexical retrieval process. Experiments 1 and 4 examined the recognition of native-produced and foreign-accented words varying in neighborhood density with auditory lexical decision and perceptual identification tasks respectively, which emphasize the lexical level of processing. Findings from Experiment 1 revealed increased accent-induced processing cost in reaction times, especially for words with many similar sounding words, implying that native listeners increase their reliance on top-down lexical knowledge during foreign-accented word recognition. Analysis of perception errors from Experiment 4 found the misperceptions in the foreign-accented condition to be more similar to the target words than those in the native-produced condition. This suggests that accent-induced mismatches tend to activate similar sounding words as alternative word candidates, which possibly pose increased lexical competition for the target word and result in greater processing costs for foreign-accented word recognition at the lexical level. Experiments 2 and 3 examined the sub-lexical processing of the foreign-accented words varying in neighborhood density and phonotactic probability respectively with a same-different matching task, which emphasizes the sub-lexical level of processing. Findings from both experiments revealed no extra processing costs , in either reaction times or accuracy rates, for the foreign-accented stimuli, implying that the sub-lexical processing of the foreign-accented words is as good as that of the native-produced words. Taken together, the overall recognition difficulty of foreign-accented stimuli, as well as the differentially increased processing difficulty for accented dense words (observed in Experiment 1), mainly stems from the lexical level, due to the increased lexical competition posed by the similar sounding word candidates
A Study of Accomodation of Prosodic and Temporal Features in Spoken Dialogues in View of Speech Technology Applications
Inter-speaker accommodation is a well-known property of human speech and human interaction in general. Broadly it refers to the behavioural patterns of two (or more) interactants and the effect of the (verbal and non-verbal) behaviour of each to that of the other(s). Implementation of thisbehavior in spoken dialogue systems is desirable as an improvement on the naturalness of humanmachine interaction. However, traditional qualitative descriptions of accommodation phenomena do not provide sufficient information for such an implementation. Therefore, a quantitativedescription of inter-speaker accommodation is required. This thesis proposes a methodology of monitoring accommodation during a human or humancomputer dialogue, which utilizes a moving average filter over sequential frames for each speaker. These frames are time-aligned across the speakers, hence the name Time Aligned Moving Average (TAMA). Analysis of spontaneous human dialogue recordings by means of the TAMA methodology reveals ubiquitous accommodation of prosodic features (pitch, intensity and speech rate) across interlocutors, and allows for statistical (time series) modeling of the behaviour, in a way which is meaningful for implementation in spoken dialogue system (SDS) environments.In addition, a novel dialogue representation is proposed that provides an additional point of view to that of TAMA in monitoring accommodation of temporal features (inter-speaker pause length and overlap frequency). This representation is a percentage turn distribution of individual speakercontributions in a dialogue frame which circumvents strict attribution of speaker-turns, by considering both interlocutors as synchronously active. Both TAMA and turn distribution metrics indicate that correlation of average pause length and overlap frequency between speakers can be attributed to accommodation (a debated issue), and point to possible improvements in SDS “turntaking” behaviour. Although the findings of the prosodic and temporal analyses can directly inform SDS implementations, further work is required in order to describe inter-speaker accommodation sufficiently, as well as to develop an adequate testing platform for evaluating the magnitude ofperceived improvement in human-machine interaction. Therefore, this thesis constitutes a first step towards a convincingly useful implementation of accommodation in spoken dialogue systems
Truncation Confusion Patterns in Onset Consonants
Confusion matrices and truncation experiments have long been a part of psychoacoustic experimentation. However confusion matrices are seldom used to analyze truncation experiments. A truncation experiment was conducted and the confusion patterns were analyzed for 6 consonant-vowels (CVs). The confusion patterns show significant structure as the CV is truncated from the onset of the consonant. These confusions show correlations with both articulatory, acoustic features, and other related CVs. These confusions patterns are shown and explored as they relate to human speech recognition
Recommended from our members
Speech rhythm: the language-specific integration of pitch and duration
Experimental phonetic research on speech rhythm seems to have reached an impasse. Recently, this research field has tended to investigate produced (rather than perceived) rhythm, focussing on timing, i.e. duration as an acoustic cue, and has not considered that rhythm perception might be influenced by native language. Yet evidence from other areas of phonetics, and other disciplines, suggests that an investigation of rhythm is needed which (i) focuses on listeners’ perception, (ii) acknowledges the role of several acoustic cues, and (iii) explores whether the relative significance of these cues differs between languages. This thesis, the originality of which derives from its adoption of these three perspectives combined, indicates new directions for progress. A series of perceptual experiments investigated the interaction of duration and f0 as perceptual cues to prosody in languages with different prosodic structures – Swiss German, Swiss French, and French (i.e. from France). The first experiment demonstrated that a dynamic f0 increases perceived syllable duration in contextually isolated pairs of monosyllables, for all three language groups. The second experiment found that dynamic f0 and increased duration interact as cues to rhythmic groups in series of monosyllabic digits and letters; the two cues were significantly more effective than one when heard simultaneously, but significantly less effective than one when heard in conflicting positions around the rhythmic-group boundary location, and native language influenced whether f0 or duration was the more effective cue.
These two experiments laid the basis for the third, which directly addressed rhythm. Listeners were asked to judge the rhythmicality of sentences with systematic duration and f0 manipulations; the results provide evidence that duration and f0 are interdependent cues in rhythm perception, and that the weighting of each cue varies in different languages. A fourth experiment applied the perceptual results to production data, to develop a rhythm metric which captures the multi-dimensional and language-specific nature of perceived rhythm in speech production. These findings have the important implication that if future phonetic research on rhythm follows these new perspectives, it may circumvent the impasse and advance our knowledge and model of speech rhythm.This work was funded by an AHRC doctoral award to the author
- …