2,464 research outputs found

    Improving the Speech Intelligibility By Cochlear Implant Users

    Get PDF
    In this thesis, we focus on improving the intelligibility of speech for cochlear implants (CI) users. As an auditory prosthetic device, CI can restore hearing sensations for most patients with profound hearing loss in both ears in a quiet background. However, CI users still have serious problems in understanding speech in noisy and reverberant environments. Also, bandwidth limitation, missing temporal fine structures, and reduced spectral resolution due to a limited number of electrodes are other factors that raise the difficulty of hearing in noisy conditions for CI users, regardless of the type of noise. To mitigate these difficulties for CI listener, we investigate several contributing factors such as the effects of low harmonics on tone identification in natural and vocoded speech, the contribution of matched envelope dynamic range to the binaural benefits and contribution of low-frequency harmonics to tone identification in quiet and six-talker babble background. These results revealed several promising methods for improving speech intelligibility for CI patients. In addition, we investigate the benefits of voice conversion in improving speech intelligibility for CI users, which was motivated by an earlier study showing that familiarity with a talker’s voice can improve understanding of the conversation. Research has shown that when adults are familiar with someone’s voice, they can more accurately – and even more quickly – process and understand what the person is saying. This theory identified as the “familiar talker advantage” was our motivation to examine its effect on CI patients using voice conversion technique. In the present research, we propose a new method based on multi-channel voice conversion to improve the intelligibility of transformed speeches for CI patients

    Articulatory-WaveNet: Deep Autoregressive Model for Acoustic-to-Articulatory Inversion

    Get PDF
    Acoustic-to-Articulatory Inversion, the estimation of articulatory kinematics from speech, is an important problem which has received significant attention in recent years. Estimated articulatory movements from such models can be used for many applications, including speech synthesis, automatic speech recognition, and facial kinematics for talking-head animation devices. Knowledge about the position of the articulators can also be extremely useful in speech therapy systems and Computer-Aided Language Learning (CALL) and Computer-Aided Pronunciation Training (CAPT) systems for second language learners. Acoustic-to-Articulatory Inversion is a challenging problem due to the complexity of articulation patterns and significant inter-speaker differences. This is even more challenging when applied to non-native speakers without any kinematic training data. This dissertation attempts to address these problems through the development of up-graded architectures for Articulatory Inversion. The proposed Articulatory-WaveNet architecture is based on a dilated causal convolutional layer structure that improves the Acoustic-to-Articulatory Inversion estimated results for both speaker-dependent and speaker-independent scenarios. The system has been evaluated on the ElectroMagnetic Articulography corpus of Mandarin Accented English (EMA-MAE) corpus, consisting of 39 speakers including both native English speakers and Mandarin accented English speakers. Results show that Articulatory-WaveNet improves the performance of the speaker-dependent and speaker-independent Acoustic-to-Articulatory Inversion systems significantly compared to the previously reported results

    Exploring the prosodic and syntactic aspects of Mandarin-English Code switching

    Full text link
    L’alternance codique (Code-switching, CS) est l’un des comportements naturels les plus courants chez les bilingues. Les linguistes ont exploré les contraintes derrière l’alternance codique (CS) pour expliquer ce comportement. Au cours des dernières décennies, la recherche a plutôt été axée sur les contraintes syntaxiques et ce n’est que récemment que les contraintes prosodiques ont commencé à attirer l’attention des linguistes. Puisque la paire de langues choisie est moins étudiée dans le domaine de recherche sur la CS, les études sur la CS mandarin-anglais sont limitées en ce qui concerne les deux contraintes. Ainsi, cette étude explore à la fois les contraintes prosodiques et les schémas syntaxiques de cette paire de langues grâce à une base de données naturelle sur l’alternance codique. Prosodiquement, l’étude applique une approche fondée sur l’information (information-based approach) et utilise une unité fondamentale, l’unité d’intonation (Intonation Unit, IU), pour mener l’analyse. Le résultat de 10,6 % d’IU bilingue (BIU) se révèle fiable et offre des preuves solides que l’alternance codique a tendance à avoir lieu aux frontières de l’IU chez les bilingues. Les résultats soutiennent le travail précurseur de Shenk (2006) à partir d’une paire de langues inexplorée (mandarin-anglais). De plus, cette étude développe des solutions au problème de subjectivité et au problème d’adéquation de la base de données afin de renforcer la fiabilité des résultats. D’un point de vue syntaxique, l’étude examine les schémas syntaxiques aux points de CS de la paire de langues mandarin-anglais en utilisant des données recueillies auprès d’une communauté bilingue rarement étudiée. Un schéma syntaxique spécifique à cette paire de langues a été observé en fonction des résultats, mais l’étude suggère que ce schéma ait perturbé les résultats finaux. L’étude comporte une analyse avec les résultats de l’aspect prosodique et de l’aspect syntaxique. Lorsque les résultats divergents sont éliminés, on peut observer un résultat plus solide qui soutient davantage l’argument de la contrainte prosodique.Code-switching (CS) is one of the most common natural behaviors among bilinguals. Linguists have been exploring the constraints behind CS to explain this behaviour, and while syntactic constraints have been the focus for decades, prosodic constraints were only studied more in depth recently. As a less common language pair in CS research, studies on Mandarin-English CS are limited for both constraints. Thus, this study explores the prosodic constraints and syntactic patterns of this language pair with a natural CS database. Prosodically, this study applies the information-based approach and its fundamental unit, Intonation Unit (IU), to conduct the analysis. The result of 10.6% bilingual IU (BIU) proves to be reliable and offers solid evidence that bilinguals tend to code-switch at IU boundaries. This supports the pioneer work of Shenk (2006) from the unexplored Mandarin-English language pair. In addition to this, the study develops solutions to deal with the subjectivity problem and the database appropriateness problem in this approach to strengthen the validity of the results. Syntactically, this study investigates the syntactic patterns at switching points on the Mandarin-English language pair using data collected from a rarely investigated bilingual community. Based on the results, a syntactic pattern specific to this language pair was observed and this study suggests it disrupted the final results. This study conducts an analysis with the results of both the prosodic aspect and the syntactic aspect. When the interfering results are eliminated, a more solid outcome can be observed which provides greater support to the prosodic constraint argument

    Tone classification of syllable -segmented Thai speech based on multilayer perceptron

    Get PDF
    Thai is a monosyllabic and tonal language. Thai makes use of tone to convey lexical information about the meaning of a syllable. Thai has five distinctive tones and each tone is well represented by a single F0 contour pattern. In general, a Thai syllable with a different tone has a different lexical meaning. Thus, to completely recognize a spoken Thai syllable, a speech recognition system has not only to recognize a base syllable but also to correctly identify a tone. Hence, tone classification of Thai speech is an essential part of a Thai speech recognition system.;In this study, a tone classification of syllable-segmented Thai speech which incorporates the effects of tonal coarticulation, stress and intonation was developed. Automatic syllable segmentation, which performs the segmentation on the training and test utterances into syllable units, was also developed. The acoustical features including fundamental frequency (F0), duration, and energy extracted from the processing syllable and neighboring syllables were used as the main discriminating features. A multilayer perceptron (MLP) trained by backpropagation method was employed to classify these features. The proposed system was evaluated on 920 test utterances spoken by five male and three female Thai speakers who also uttered the training speech. The proposed system achieved an average accuracy rate of 91.36%

    Articulatory Copy Synthesis Based on the Speech Synthesizer VocalTractLab

    Get PDF
    Articulatory copy synthesis (ACS), a subarea of speech inversion, refers to the reproduction of natural utterances and involves both the physiological articulatory processes and their corresponding acoustic results. This thesis proposes two novel methods for the ACS of human speech using the articulatory speech synthesizer VocalTractLab (VTL) to address or mitigate the existing problems of speech inversion, such as non-unique mapping, acoustic variation among different speakers, and the time-consuming nature of the process. The first method involved finding appropriate VTL gestural scores for given natural utterances using a genetic algorithm. It consisted of two steps: gestural score initialization and optimization. In the first step, gestural scores were initialized using the given acoustic signals with speech recognition, grapheme-to-phoneme (G2P), and a VTL rule-based method for converting phoneme sequences to gestural scores. In the second step, the initial gestural scores were optimized by a genetic algorithm via an analysis-by-synthesis (ABS) procedure that sought to minimize the cosine distance between the acoustic features of the synthetic and natural utterances. The articulatory parameters were also regularized during the optimization process to restrict them to reasonable values. The second method was based on long short-term memory (LSTM) and convolutional neural networks, which were responsible for capturing the temporal dependence and the spatial structure of the acoustic features, respectively. The neural network regression models were trained, which used acoustic features as inputs and produced articulatory trajectories as outputs. In addition, to cover as much of the articulatory and acoustic space as possible, the training samples were augmented by manipulating the phonation type, speaking effort, and the vocal tract length of the synthetic utterances. Furthermore, two regularization methods were proposed: one based on the smoothness loss of articulatory trajectories and another based on the acoustic loss between original and predicted acoustic features. The best-performing genetic algorithms and convolutional LSTM systems (evaluated in terms of the difference between the estimated and reference VTL articulatory parameters) obtained average correlation coefficients of 0.985 and 0.983 for speaker-dependent utterances, respectively, and their reproduced speech achieved recognition accuracies of 86.25% and 64.69% for speaker-independent utterances of German words, respectively. When applied to German sentence utterances, as well as English and Mandarin Chinese word utterances, the neural network based ACS systems achieved recognition accuracies of 73.88%, 52.92%, and 52.41%, respectively. The results showed that both of these methods not only reproduced the articulatory processes but also reproduced the acoustic signals of reference utterances. Moreover, the regularization methods led to more physiologically plausible articulatory processes and made the estimated articulatory trajectories be more articulatorily preferred by VTL, thus reproducing more natural and intelligible speech. This study also found that the convolutional layers, when used in conjunction with batch normalization layers, automatically learned more distinctive features from log power spectrograms. Furthermore, the neural network based ACS systems trained using German data could be generalized to the utterances of other languages

    An ear for pitch: On the effects of experience and aptitude in processing pitch in language and music

    Get PDF
    • …
    corecore