59 research outputs found

    Articulatory Copy Synthesis Based on the Speech Synthesizer VocalTractLab

    Get PDF
    Articulatory copy synthesis (ACS), a subarea of speech inversion, refers to the reproduction of natural utterances and involves both the physiological articulatory processes and their corresponding acoustic results. This thesis proposes two novel methods for the ACS of human speech using the articulatory speech synthesizer VocalTractLab (VTL) to address or mitigate the existing problems of speech inversion, such as non-unique mapping, acoustic variation among different speakers, and the time-consuming nature of the process. The first method involved finding appropriate VTL gestural scores for given natural utterances using a genetic algorithm. It consisted of two steps: gestural score initialization and optimization. In the first step, gestural scores were initialized using the given acoustic signals with speech recognition, grapheme-to-phoneme (G2P), and a VTL rule-based method for converting phoneme sequences to gestural scores. In the second step, the initial gestural scores were optimized by a genetic algorithm via an analysis-by-synthesis (ABS) procedure that sought to minimize the cosine distance between the acoustic features of the synthetic and natural utterances. The articulatory parameters were also regularized during the optimization process to restrict them to reasonable values. The second method was based on long short-term memory (LSTM) and convolutional neural networks, which were responsible for capturing the temporal dependence and the spatial structure of the acoustic features, respectively. The neural network regression models were trained, which used acoustic features as inputs and produced articulatory trajectories as outputs. In addition, to cover as much of the articulatory and acoustic space as possible, the training samples were augmented by manipulating the phonation type, speaking effort, and the vocal tract length of the synthetic utterances. Furthermore, two regularization methods were proposed: one based on the smoothness loss of articulatory trajectories and another based on the acoustic loss between original and predicted acoustic features. The best-performing genetic algorithms and convolutional LSTM systems (evaluated in terms of the difference between the estimated and reference VTL articulatory parameters) obtained average correlation coefficients of 0.985 and 0.983 for speaker-dependent utterances, respectively, and their reproduced speech achieved recognition accuracies of 86.25% and 64.69% for speaker-independent utterances of German words, respectively. When applied to German sentence utterances, as well as English and Mandarin Chinese word utterances, the neural network based ACS systems achieved recognition accuracies of 73.88%, 52.92%, and 52.41%, respectively. The results showed that both of these methods not only reproduced the articulatory processes but also reproduced the acoustic signals of reference utterances. Moreover, the regularization methods led to more physiologically plausible articulatory processes and made the estimated articulatory trajectories be more articulatorily preferred by VTL, thus reproducing more natural and intelligible speech. This study also found that the convolutional layers, when used in conjunction with batch normalization layers, automatically learned more distinctive features from log power spectrograms. Furthermore, the neural network based ACS systems trained using German data could be generalized to the utterances of other languages

    Perceptual strategies underlying second language acquisition

    Get PDF
    The literature suggests that listeners do not pay equal attention to all available acoustic information. Instead, when perceiving speech, they place more importance on some acoustic cues than others (Francis & Nusbaum, 2002). The patterns of weights assigned to different cues appear to change with increased linguistic experience, not only in the first language (L1; Mayo et al., 2003) but also in the second language (L2; Chandrasekaran et al., 2010). However, the role of attention and salience in cue weighting is still under discussion. This thesis presents a series of experiments designed to test this hypothesis in the context of native English speakers and Mandarin Chinese learners of English. First, we compared how prior experience (language background, musical training, and their interaction) shapes cue weighting strategies and tested whether the cue weighting of different cues reflects the direction of attention towards them or their salience. Compared to English speakers, Mandarin speakers showed enhanced attention to and preferential use of pitch across tasks but no increased pitch salience. Effects of musical training were contingent upon participants’ L1. We also demonstrated that perceptual strategies are not consistent across tasks, suggesting they are not driven by domain-general abilities. Second, since acoustic cues play different roles across languages, learning a new language might require listeners to make greater use of L1-irrelevant dimensions. We designed a targeted training focused on redirecting listeners’ attention towards an L2-relevant acoustic cue. Although the observed training effects were not long-lasting, we showed that perceptual strategies in categorizing L2 prosody could be adjusted with as little as three hours of training. This finding has the potential to inform the development of L2 learning paradigms targeting specific auditory challenges experienced by learners. Overall, this thesis provides novel insights into the long-debated role of dimension-selective attention and dimensional salience in cue weighting

    Exploring the adaptive structure of the mental lexicon

    Get PDF
    The mental lexicon is a complex structure organised in terms of phonology, semantics and syntax, among other levels. In this thesis I propose that this structure can be explained in terms of the pressures acting on it: every aspect of the organisation of the lexicon is an adaptation ultimately related to the function of language as a tool for human communication, or to the fact that language has to be learned by subsequent generations of people. A collection of methods, most of which are applied to a Spanish speech corpus, reveal structure at different levels of the lexicon.• The patterns of intra-word distribution of phonological information may be a consequence of pressures for optimal representation of the lexicon in the brain, and of the pressure to facilitate speech segmentation.• An analysis of perceived phonological similarity between words shows that the sharing of different aspects of phonological similarity is related to different functions. Phonological similarity perception sometimes relates to morphology (the stressed final vowel determines verb tense and person) and at other times shows processing biases (similarity in the word initial and final segments is more readily perceived than in word-internal segments).• Another similarity analysis focuses on cooccurrence in speech to create a representation of the lexicon where the position of a word is determined by the words that tend to occur in its close vicinity. Variations of context-based lexical space naturally categorise words syntactically and semantically.• A higher level of lexicon structure is revealed by examining the relationships between the phonological and the cooccurrence similarity spaces. A study in Spanish supports the universality of the small but significant correlation between these two spaces found in English by Shillcock, Kirby, McDonald and Brew (2001). This systematicity across levels of representation adds an extra layer of structure that may help lexical acquisition and recognition. I apply it to a new paradigm to determine the function of parameters of phonological similarity based on their relationships with the syntacticsemantic level. I find that while some aspects of a language's phonology maintain systematicity, others work against it, perhaps responding to the opposed pressure for word identification.This thesis is an exploratory approach to the study of the mental lexicon structure that uses existing and new methodology to deepen our understanding of the relationships between language use and language structure

    The effect of high variability and individual differences on phonetic training of Mandarin tones

    Get PDF
    High variability phonetic training (HVPT) has been found to be more effective than low variability phonetic training (LVPT) in learning various non-native phonetic contrasts. However, little research has considered whether this applies to the learning of tone contrasts. Two relevant studies suggested that the effect of high variability training depends on the perceptual aptitude of participants (Perrachione, Lee, Ha, & Wong, 2011; Sadakata & McQueen, 2014). It is also unclear how different types of individual difference measures interact with the learning of tonal language. What work there is, suggests that musical ability is related to discriminating tonal information and in general attention and working memory are linked to language learning. The present study extends these findings by examining the interaction between individual aptitude and input variability and between learning outcomes and individual measures using natural, meaningful L2 input (both previous studies used pseudowords). In Study 1, forty English speakers took part in an eight-session phonetic training paradigm. They were assigned to high/low variability training groups. High variability used four speakers during the training sessions while low variability used one. All participants learned real Mandarin tones and words. Individual aptitude was measured using an identification and a categorisation task. Learning was measured using a categorical discrimination task, an identification task and two production tasks. Overall, all groups improved in both production and perception of tones which transferred to novel voices and items, demonstrating the effectiveness of training despite the increased complexity of the training material compared with previous research. Although the low variability group exhibited better learning during training than the high variability group, there was no evidence that the different variability training conditions led to different performances in any of the tests of generalisation. Moreover, although performance on one of the aptitude tasks significantly predicted overall performance in categorical discrimination, identification and training tasks, it did not predict improvement from pre- to post- test. Critically, there was also no interaction between individual aptitude and variability-condition, contradicting with previous findings. One possibility was that the high variability condition was too difficult as speakers were randomly presented during training, resulting in low trial-by-trial consistency. This greater difficulty might block any advantage of variability for generalisation. In order to examine this, Study 2 recruited additional 20 native English speakers and tested them in a further condition, identical to the previous high variability condition except that each speaker was presented in their own block during the training. Although participants performed better in training compared with the high variability group from study 1, there was again no difference in generalisation compared with the previous conditions, and again no interaction between individual aptitude and variability-condition was found. Bayes Factors were also used to assess the null results. There was evidence for the null for the benefits of high variability for generalisation but only ambiguous evidence regarding whether there was interaction between variability and individual aptitude. The HPVT used in Study 1 and Study 2 did not replicate the interaction between variability-condition and aptitude found in previous studies. Moreover, although one of the measures of aptitude did correlate with the baseline measures of performance, there was no evidence that it predicted learning due to training. Additionally, the two individual aptitude measures used in Study 1 and 2 – taken from Perrachione, et al. (2011) and Sadakata and McQueen (2013) – are not comprehensive. They are natural language-related tasks which directly measure tone perception itself, rather than the underlying cognitive factors which could underpin this ability. Another interesting question is whether these different cognitive factors might contribute to learners at different stages differently, particularly since language training studies vary as to whether they use current learners of the language or naïve participants, a factor may contribute towards differing findings in the literature. To explore these issues, Study 3 investigated the relationship between a battery of cognitive individual difference measures and Mandarin tone learning. Sixty native English speakers (forty of whom were currently studying Mandarin at undergraduate level, twenty of whom were naïve learners) took part in a six-session training paradigm. With high-variability training stimuli similar to that used in Study 2 (four speakers blocked), their learning outcomes were assessed by identification, categorical discrimination and production tasks similar to Study 1. Their working memory, attention and musical ability were also measured. Overall, both groups showed improvements during training and in the generalisation tasks. Although Mandarin learner participants performed better than naïve participants overall, the improvements were not generally greater than naïve participants. Each of the individual difference measures was used to predict participant’s performance at pre-test and their improvement due to training. Bayes Factors were used as the key method of inference. For Mandarin learner participants, both performances at pre-test and pre- to- post improvement were strongly predicted by attention measures while for naïve speakers, musical ability was the dominant predictor for pre- to- post improvement. This series of studies demonstrates that Mandarin lexical tones can be trained using natural stimuli embedded in a word learning task and learning generalises to untrained voices and items as well as to production. Although there is no evidence in the current data that the type of training materials affected learning outcomes, tone learning is indeed affected by individual cognitive factors, such as attention and musical ability, with these playing a different role for learners at different stages

    Theoretical results on a weightless neural classifier and application to computational linguistics

    Get PDF
    WiSARD é um classificador n-upla, historicamente usado em tarefas de reconhecimento de padrões em imagens em preto e branco. Infelizmente, não era comum que este fosse usado em outras tarefas, devido á sua incapacidade de arcar com grandes volumes de dados por ser sensível ao conteúdo aprendido. Recentemente, a técnica de bleaching foi concebida como uma melhoria à arquitetura do classificador n-upla, como um meio de coibir a sensibilidade da WiSARD. Desde então, houve um aumento na gama de aplicações construídas com este sistema de aprendizado. Pelo uso frequente de corpora bastante grandes, a etiquetação gramatical multilíngue encaixa-se neste grupo de aplicações. Esta tese aprimora o mWANN-Tagger, um etiquetador gramatical sem peso proposto em 2012. Este texto mostra que a pesquisa em etiquetação multilíngue com WiSARD foi intensificada através do uso de linguística quantitativa e que uma configuração de parâmetros universal foi encontrada para o mWANN-Tagger. Análises e experimentos com as bases da Universal Dependencies (UD) mostram que o mWANN-Tagger tem potencial para superar os etiquetadores do estado da arte dada uma melhor representação de palavra. Esta tese também almeja avaliar as vantagens do bleaching em relação ao modelo tradicional através do arcabouço teórico da teoria VC. As dimensões VC destes foram calculadas, atestando-se que um classificador n-upla, seja WiSARD ou com bleaching, que possua N memórias endereçadas por n-uplas binárias tem uma dimensão VC de exatamente N (2n − 1) + 1. Um paralelo foi então estabelecido entre ambos os modelos, onde deduziu-se que a técnica de bleaching é uma melhoria ao método n-upla que não causa prejuízos à sua capacidade de aprendizado.WiSARD é um classificador n-upla, historicamente usado em tarefas de reconhecimento de padrões em imagens em preto e branco. Infelizmente, não era comum que este fosse usado em outras tarefas, devido á sua incapacidade de arcar com grandes volumes de dados por ser sensível ao conteúdo aprendido. Recentemente, a técnica de bleaching foi concebida como uma melhoria à arquitetura do classificador n-upla, como um meio de coibir a sensibilidade da WiSARD. Desde então, houve um aumento na gama de aplicações construídas com este sistema de aprendizado. Pelo uso frequente de corpora bastante grandes, a etiquetação gramatical multilíngue encaixa-se neste grupo de aplicações. Esta tese aprimora o mWANN-Tagger, um etiquetador gramatical sem peso proposto em 2012. Este texto mostra que a pesquisa em etiquetação multilíngue com WiSARD foi intensificada através do uso de linguística quantitativa e que uma configuração de parâmetros universal foi encontrada para o mWANN-Tagger. Análises e experimentos com as bases da Universal Dependencies (UD) mostram que o mWANN-Tagger tem potencial para superar os etiquetadores do estado da arte dada uma melhor representação de palavra. Esta tese também almeja avaliar as vantagens do bleaching em relação ao modelo tradicional através do arcabouço teórico da teoria VC. As dimensões VC destes foram calculadas, atestando-se que um classificador n-upla, seja WiSARD ou com bleaching, que possua N memórias endereçadas por n-uplas binárias tem uma dimensão VC de exatamente N (2n − 1) + 1. Um paralelo foi então estabelecido entre ambos os modelos, onde deduziu-se que a técnica de bleaching é uma melhoria ao método n-upla que não causa prejuízos à sua capacidade de aprendizado

    Semantic radical consistency and character transparency effects in Chinese: an ERP study

    Get PDF
    BACKGROUND: This event-related potential (ERP) study aims to investigate the representation and temporal dynamics of Chinese orthography-to-semantics mappings by simultaneously manipulating character transparency and semantic radical consistency. Character components, referred to as radicals, make up the building blocks used dur...postprin
    corecore