107 research outputs found

    New Nonsense Syllables Database -- Analyses and Preliminary ASR Experiments

    Get PDF
    The paper presents analyses, modifications, and first experiments with a new nonsense syllables database. Results of preliminary experiments with phoneme recognition are given and discussed

    SYNTHESIZING DYSARTHRIC SPEECH USING MULTI-SPEAKER TTS FOR DSYARTHRIC SPEECH RECOGNITION

    Get PDF
    Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech recognition (ASR) systems may help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech is required, which is not readily available for dysarthric talkers. In this dissertation, we investigate dysarthric speech augmentation and synthesis methods. To better understand differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels, a comparative study between typical and dysarthric speech was conducted. These characteristics are important components for dysarthric speech modeling, synthesis, and augmentation. For augmentation, prosodic transformation and time-feature masking have been proposed. For dysarthric speech synthesis, this dissertation has introduced a modified neural multi-talker TTS by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels. In addition, we have extended this work by using a label propagation technique to create more meaningful control variables such as a continuous Respiration, Laryngeal and Tongue (RLT) parameter, even for datasets that only provide discrete dysarthria severity level information. This approach increases the controllability of the system, so we are able to generate more dysarthric speech with a broader range. To evaluate their effectiveness for synthesis of training data, dysarthria-specific speech recognition was used. Results show that a DNN-HMM model trained on additional synthetic dysarthric speech achieves WER improvement of 12.2% compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5%, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has a significant impact on the dysarthric ASR systems

    Automatic Feedback for L2 Prosody Learning

    Get PDF
    International audienceWe have designed automatic feedback for the realisation of the prosody of a foreign language. Besides classical F0 displays, two kinds of feedback are provided to learners, each of them based upon a comparison between a reference and the learner's production. The first feedback, a diagnosis, provided both in the form of a short text and visual displays such as arrows, comes from an acoustic evaluation of the learner's realisation; it deals with two prosodic cues: the melodic curve, and phoneme duration. The second feedback is perceptual and consists in a replacement of the learner's prosodic cues (duration and F0) by those of the reference. A pilot experiment has been undertaken to test the immediate impact of the "advanced" feedback proposed here. We have chosen to test the production of English lexical accent in isolated words by French speakers. It shows that feedback based upon diagnosis and speech modification enables French learners with a low production level to improve their realisations of English lexical accents more than (simple) auditory feedback. On the contrary, for advanced learners involved in this study, auditory feedback appears to be as efficient as more elaborated feedback

    Directions for the future of technology in pronunciation research and teaching

    Get PDF
    This paper reports on the role of technology in state-of-the-art pronunciation research and instruction, and makes concrete suggestions for future developments. The point of departure for this contribution is that the goal of second language (L2) pronunciation research and teaching should be enhanced comprehensibility and intelligibility as opposed to native-likeness. Three main areas are covered here. We begin with a presentation of advanced uses of pronunciation technology in research with a special focus on the expertise required to carry out even small-scale investigations. Next, we discuss the nature of data in pronunciation research, pointing to ways in which future work can build on advances in corpus research and crowdsourcing. Finally, we consider how these insights pave the way for researchers and developers working to create research-informed, computer-assisted pronunciation teaching resources. We conclude with predictions for future developments

    Phonetics of segmental FO and machine recognition of Korean speech

    Get PDF

    WHERE IS THE LOCUS OF DIFFICULTY IN RECOGNIZING FOREIGN-ACCENTED WORDS? NEIGHBORHOOD DENSITY AND PHONOTACTIC PROBABILITY EFFECTS ON THE RECOGNITION OF FOREIGN-ACCENTED WORDS BY NATIVE ENGLISH LISTENERS

    Get PDF
    This series of experiments (1) examined whether native listeners experience recognition difficulty in all kinds of foreign-accented words or only in a subset of words with certain lexical and sub-lexical characteristics-- neighborhood density and phonotactic probability; (2) identified the locus of foreign-accented word recognition difficulty, and (3) investigated how accent-induced mismatches impact the lexical retrieval process. Experiments 1 and 4 examined the recognition of native-produced and foreign-accented words varying in neighborhood density with auditory lexical decision and perceptual identification tasks respectively, which emphasize the lexical level of processing. Findings from Experiment 1 revealed increased accent-induced processing cost in reaction times, especially for words with many similar sounding words, implying that native listeners increase their reliance on top-down lexical knowledge during foreign-accented word recognition. Analysis of perception errors from Experiment 4 found the misperceptions in the foreign-accented condition to be more similar to the target words than those in the native-produced condition. This suggests that accent-induced mismatches tend to activate similar sounding words as alternative word candidates, which possibly pose increased lexical competition for the target word and result in greater processing costs for foreign-accented word recognition at the lexical level. Experiments 2 and 3 examined the sub-lexical processing of the foreign-accented words varying in neighborhood density and phonotactic probability respectively with a same-different matching task, which emphasizes the sub-lexical level of processing. Findings from both experiments revealed no extra processing costs , in either reaction times or accuracy rates, for the foreign-accented stimuli, implying that the sub-lexical processing of the foreign-accented words is as good as that of the native-produced words. Taken together, the overall recognition difficulty of foreign-accented stimuli, as well as the differentially increased processing difficulty for accented dense words (observed in Experiment 1), mainly stems from the lexical level, due to the increased lexical competition posed by the similar sounding word candidates

    A Study of Accomodation of Prosodic and Temporal Features in Spoken Dialogues in View of Speech Technology Applications

    Get PDF
    Inter-speaker accommodation is a well-known property of human speech and human interaction in general. Broadly it refers to the behavioural patterns of two (or more) interactants and the effect of the (verbal and non-verbal) behaviour of each to that of the other(s). Implementation of thisbehavior in spoken dialogue systems is desirable as an improvement on the naturalness of humanmachine interaction. However, traditional qualitative descriptions of accommodation phenomena do not provide sufficient information for such an implementation. Therefore, a quantitativedescription of inter-speaker accommodation is required. This thesis proposes a methodology of monitoring accommodation during a human or humancomputer dialogue, which utilizes a moving average filter over sequential frames for each speaker. These frames are time-aligned across the speakers, hence the name Time Aligned Moving Average (TAMA). Analysis of spontaneous human dialogue recordings by means of the TAMA methodology reveals ubiquitous accommodation of prosodic features (pitch, intensity and speech rate) across interlocutors, and allows for statistical (time series) modeling of the behaviour, in a way which is meaningful for implementation in spoken dialogue system (SDS) environments.In addition, a novel dialogue representation is proposed that provides an additional point of view to that of TAMA in monitoring accommodation of temporal features (inter-speaker pause length and overlap frequency). This representation is a percentage turn distribution of individual speakercontributions in a dialogue frame which circumvents strict attribution of speaker-turns, by considering both interlocutors as synchronously active. Both TAMA and turn distribution metrics indicate that correlation of average pause length and overlap frequency between speakers can be attributed to accommodation (a debated issue), and point to possible improvements in SDS “turntaking” behaviour. Although the findings of the prosodic and temporal analyses can directly inform SDS implementations, further work is required in order to describe inter-speaker accommodation sufficiently, as well as to develop an adequate testing platform for evaluating the magnitude ofperceived improvement in human-machine interaction. Therefore, this thesis constitutes a first step towards a convincingly useful implementation of accommodation in spoken dialogue systems

    Truncation Confusion Patterns in Onset Consonants

    Get PDF
    Confusion matrices and truncation experiments have long been a part of psychoacoustic experimentation. However confusion matrices are seldom used to analyze truncation experiments. A truncation experiment was conducted and the confusion patterns were analyzed for 6 consonant-vowels (CVs). The confusion patterns show significant structure as the CV is truncated from the onset of the consonant. These confusions show correlations with both articulatory, acoustic features, and other related CVs. These confusions patterns are shown and explored as they relate to human speech recognition
    corecore