152 research outputs found

    Articulatory representations to address acoustic variability in speech

    Get PDF
    The past decade has seen phenomenal improvement in the performance of Automatic Speech Recognition (ASR) systems. In spite of this vast improvement in performance, the state-of-the-art still lags significantly behind human speech recognition. Even though certain systems claim super-human performance, this performance often is sub-par across domains and across datasets. This gap is predominantly due to the lack of robustness against speech variability. Even clean speech is extremely variable due to a large number of factors such as voice characteristics, speaking style, speaking rate, accents, casualness, emotions and more. The goal of this thesis is to investigate the variability of speech from the perspective of speech production, put forth robust articulatory features to address this variability, and to incorporate these features in state-of-the-art ASR systems in the best way possible. ASR systems model speech as a sequence of distinctive phone units like beads on a string. Although phonemes are distinctive units in the cognitive domain, their physical realizations are extremely varied due to coarticulation and lenition which are commonly observed in conversational speech. The traditional approaches deal with this issue by performing di-, tri- or quin-phone based acoustic modeling but are insufficient to model longer contextual dependencies. Articulatory phonology analyzes speech as a constellation of coordinated articulatory gestures performed by the articulators in the vocal tract (lips, tongue tip, tongue body, jaw, glottis and velum). In this framework, acoustic variability is explained by the temporal overlap of gestures and their reduction in space. In order to analyze speech in terms of articulatory gestures, the gestures need to be estimated from the speech signal. The first part of the thesis focuses on a speaker independent acoustic-to-articulatory inversion system that was developed to estimate vocal tract constriction variables (TVs) from speech. The mapping from acoustics to TVs was learned from the multi-speaker X-ray Microbeam (XRMB) articulatory dataset. Constriction regions from TV trajectories were defined as articulatory gestures using articulatory kinematics. The speech inversion system combined with the TV kinematics based gesture annotation provided a system to estimate articulatory gestures from speech. The second part of this thesis deals with the analysis of the articulatory trajectories under different types of variability such as multiple speakers, speaking rate, and accents. It was observed that speaker variation degraded the performance of the speech inversion system. A Vocal Tract Length Normalization (VTLN) based speaker normalization technique was therefore developed to address the speaker variability in the acoustic and articulatory domains. The performance of speech inversion systems was analyzed on an articulatory dataset containing speaking rate variations to assess if the model was able to reliably predict the TVs in challenging coarticulatory scenarios. The performance of the speech inversion system was analyzed in cross accent and cross language scenarios through experiments on a Dutch and British English articulatory dataset. These experiments provide a quantitative measure of the robustness of the speech inversion systems to different speech variability. The final part of this thesis deals with the incorporation of articulatory features in state-of-the-art medium vocabulary ASR systems. A hybrid convolutional neural network (CNN) architecture was developed to fuse the acoustic and articulatory feature streams in an ASR system. ASR experiments were performed on the Wall Street Journal (WSJ) corpus. Several articulatory feature combinations were explored to determine the best feature combination. Cross-corpus evaluations were carried out to evaluate the WSJ trained ASR system on the TIMIT and another dataset containing speaking rate variability. Results showed that combining articulatory features with acoustic features through the hybrid CNN improved the performance of the ASR system in matched and mismatched evaluation conditions. The findings based on this dissertation indicate that articulatory representations extracted from acoustics can be used to address acoustic variability in speech observed due to speakers, accents, and speaking rates and further be used to improve the performance of Automatic Speech Recognition systems

    Phonetically transparent technique for the automatic transcription of speech

    Get PDF

    Sociololinguistic competence and the bilingual's adoption of phonetic variants: auditory and instrumental data from English-Arabic bilinguals

    Get PDF
    This study is an auditory and acoustic investigation of the speech production patterns developed by English-Arabic bilingual children. The subjects are three Lebanese children aged five, seven and ten, all born and raised in Yorkshire, England. Monolingual friends of the same age were chosen as controls, and the parents of all bilingual and monolingual children were also taped to obtain a detailed assessment of the sound patterns available in the subjects' environment. The study addresses the question of interaction between the bilingual's phonological systems by calling for a refinement of the notion of a `phonological system' using insights from recent phonetic and sociolinguistic work on variability in speech (e. g. Docherty, Foulkes, Tillotson, & Watt, 2002; Docherty & Foulkes, 2000; Local, 1983; Pisoni, 1997; Roberts, 1997; Scobbie, 2002). The variables under study include /1/, In, and VOT production. These were chosen due to the existence of different patterns in their production in English and Arabic that vary according to contextual and dialectal factors. Data were collected using a variety of picture-naming, story-telling, and free-play activities for the children, and reading lists, story-telling, and interviews for the adults. To control for language mode (Grosjean, 1998), the bilinguals were recorded in different language sessions with different interviewers. Results for the monolingual children and adults in this study underline the importance of including controls in any study of bilingual speech development for a better interpretation of the bilinguals' patterns. Input from the adults proved highly variable and at times conflicted with published patterns normally found in the literature for the variables under study. Results for the bilinguals show that they have developed separate sociolinguistically-appropriate production patterns for each of their languages that are on the whole similar to those of monolinguals but that also reflect the bilinguals' rich socio-phonetic repertoire. The interaction between the bilinguals' languages is mainly restricted to the bilingual mode and is a sign of their developing sociolinguistic competence

    Dysarthric speech analysis and automatic recognition using phase based representations

    Get PDF
    Dysarthria is a neurological speech impairment which usually results in the loss of motor speech control due to muscular atrophy and poor coordination of articulators. Dysarthric speech is more difficult to model with machine learning algorithms, due to inconsistencies in the acoustic signal and to limited amounts of training data. This study reports a new approach for the analysis and representation of dysarthric speech, and applies it to improve ASR performance. The Zeros of Z-Transform (ZZT) are investigated for dysarthric vowel segments. It shows evidence of a phase-based acoustic phenomenon that is responsible for the way the distribution of zero patterns relate to speech intelligibility. It is investigated whether such phase-based artefacts can be systematically exploited to understand their association with intelligibility. A metric based on the phase slope deviation (PSD) is introduced that are observed in the unwrapped phase spectrum of dysarthric vowel segments. The metric compares the differences between the slopes of dysarthric vowels and typical vowels. The PSD shows a strong and nearly linear correspondence with the intelligibility of the speaker, and it is shown to hold for two separate databases of dysarthric speakers. A systematic procedure for correcting the underlying phase deviations results in a significant improvement in ASR performance for speakers with severe and moderate dysarthria. In addition, information encoded in the phase component of the Fourier transform of dysarthric speech is exploited in the group delay spectrum. Its properties are found to represent disordered speech more effectively than the magnitude spectrum. Dysarthric ASR performance was significantly improved using phase-based cepstral features in comparison to the conventional MFCCs. A combined approach utilising the benefits of PSD corrections and phase-based features was found to surpass all the previous performance on the UASPEECH database of dysarthric speech

    Semantic radical consistency and character transparency effects in Chinese: an ERP study

    Get PDF
    BACKGROUND: This event-related potential (ERP) study aims to investigate the representation and temporal dynamics of Chinese orthography-to-semantics mappings by simultaneously manipulating character transparency and semantic radical consistency. Character components, referred to as radicals, make up the building blocks used dur...postprin

    Statistical Parametric Methods for Articulatory-Based Foreign Accent Conversion

    Get PDF
    Foreign accent conversion seeks to transform utterances from a non-native speaker (L2) to appear as if they had been produced by the same speaker but with a native (L1) accent. Such accent-modified utterances have been suggested to be effective in pronunciation training for adult second language learners. Accent modification involves separating the linguistic gestures and voice-quality cues from the L1 and L2 utterances, then transposing them across the two speakers. However, because of the complex interaction between these two sources of information, their separation in the acoustic domain is not straightforward. As a result, vocoding approaches to accent conversion results in a voice that is different from both the L1 and L2 speakers. In contrast, separation in the articulatory domain is straightforward since linguistic gestures are readily available via articulatory data. However, because of the difficulty in collecting articulatory data, conventional synthesis techniques based on unit selection are ill-suited for accent conversion given the small size of articulatory corpora and the inability to interpolate missing native sounds in L2 corpus. To address these issues, this dissertation presents two statistical parametric methods to accent conversion that operate in the acoustic and articulatory domains, respectively. The acoustic method uses a cross-speaker statistical mapping to generate L2 acoustic features from the trajectories of L1 acoustic features in a reference utterance. Our results show significant reductions in the perceived non-native accents compared to the corresponding L2 utterance. The results also show a strong voice-similarity between accent conversions and the original L2 utterance. Our second (articulatory-based) approach consists of building a statistical parametric articulatory synthesizer for a non-native speaker, then driving the synthesizer with the articulators from the reference L1 speaker. This statistical approach not only has low data requirements but also has the flexibility to interpolate missing sounds in the L2 corpus. In a series of listening tests, articulatory accent conversions were rated more intelligible and less accented than their L2 counterparts. In the final study, we compare the two approaches: acoustic and articulatory. Our results show that the articulatory approach, despite the direct access to the native linguistic gestures, is less effective in reducing perceived non-native accents than the acoustic approach

    Multi-Sensoriality In Language Acquisition: The Relationship Between Selective Visual Attention Towards The Adult’s Face And Language Skills

    Get PDF
    Introduzione Le componenti uditive e visive del linguaggio offrono al bambino informazioni cruciali per il processamento del parlato. L’abilità del bambino di integrare informazioni da diverse fonti multimodali (audio e visive) e di focalizzare l’attenzione sui segnali rilevanti presenti nell’ambiente circostante (selective visual attention) sono aspetti importanti che influenzano le prime fasi di acquisizione di una lingua. Alcuni recenti studi hanno ipotizzato e testato la relazione tra attenzione selettiva visiva verso specifiche aree del volto parlante (occhi o bocca) e le abilità linguistiche di bambini nei primi anni di vita. Molti ricercatori hanno speculato su come questa relazione potesse essere mediata dal livello di expetise del bambino, a livello linguistico (language expertise hypothesis), ma nessuno studio, fin ad ora, ha cercato di approfondire questa ipotesi, andando ad investigare le abilità linguistiche dei bambini usando misure di linguaggio spontaneo. Altri studi, hanno cercato di esplorare come diversi comportamenti attentivi verso specifiche aree del volto (occhi o bocca) fossero correlati alle abilità linguistiche concomitanti o longitudinali dei partecipanti. In molti casi, i risultati di questi studi hanno confermato l’esistenza di relazioni significative tra attenzione visiva selettiva e abilità linguistiche al tempo dell’esperimento o qualche mese dopo. Obiettivi L’obiettivo generale di questa tesi è quello di esaminare il fenomeno dell’attenzione selettiva visiva verso il volto e la sua relazione con lo sviluppo del linguaggio sia in un setting di laboratorio sia in un contesto naturalistico. In particolare, tre sono gli obiettivi specifici: - il primo obiettivo specifico è quello di sintetizzare e analizzare i fattori individuati dalla letteratura di riferimento che possono determinare diversi patterns di attenzione selettiva visiva nei bambini durante un compito audiovisivo. Ed in particolare, descrivere come la letteratura spiega questi patterns in relazione agli aspetti dello sviluppo del linguaggio; 8 - il secondo obiettivo specifico è quello di analizzare sperimentalmente l’attenzione selettiva visiva del bambino verso specifiche aree del volto (occhi e bocca) durante un compito di esposizione audiovisivo. In particolare, lo studio è volto ad indagare due aspetti. Il primo aspetto riguarda l’età e la condizione linguistica (esposizione ad una lingua nativa vs una lingua non nativa) dei partecipanti e come queste influenzano l’attenzione selettiva visiva verso specifiche aree del volto. Il secondo aspetto riguarda l’esplorazione dell’esistenza di una correlazione tra comportamento attentivo dei bambini la produzione vocale al tempo dell’esperimento e all’ampiezza del vocabolario tre mesi dopo; - il terzo obiettivo specifico è quello di capire se l’attenzione a volti o altre parti della scena visiva (oggetto, altre parti della stanza) è influenzato o spigato dalle abilità vocali del bambino al tempo del task e se gli episodi di fissazione al volto adulto possono essere predetti da specifiche proprietà fonologiche e semantiche del parlato del bambino. Metodo Per quanto concerne il primo studio, una rassegna sistematica della letteratura è stata condotta esplorando quattro fonti bibliografiche e usando specifici criteri di inclusione per selezionare la letteratura scientifica di interesse. Per quanto riguarda il secondo studio, i movimenti oculari verso un volto parlante la lingua nativa (Italiano) e non-nativa (Inglese) di 26 bambini tra i 6 e i 14 mesi sono stati tracciati usando l’eye tracker. Due gruppi sono stati creati sulla base dell’età (G1, M = 7 mesi, N = 15 bambini; G2, M = 12 mesi, N = 11 bambini). Ogni competenza linguistica del bambino è stata valutata due volte, al tempo dell’esperimento, attraverso l’osservazione diretta e tre mesi dopo, attraverso il MB-CDI. Due gruppi sono stati creati sulla base della produzione vocale dei bambini (vocalizzi pre-canonici, babbling, parole) attraverso un latent class cluster analysis: una classe vocale “alta” (percentuale di babbling e parole più alta) e una classe vocale “bassa” (percentuale maggiore di produzioni pre-canoniche). Per quanto concerne il terzo studio, il comportamento attentivo di 29 bambini tra i 12 e i 19 mesi è stato esplorato utilizzando sia una videocamera stazionaria 9 (posizionata di fronte alla diade) e una go-pro (posizionata sulla fronte del caregiver di riferimento) durante un semplice task linguistico (single object task). Durante il task i bambini sono stati esposti ad un set di stimoli audiovisivi, parole vere e non parole, scelte sulla base dei report dei genitori e sulle risposte al MB-CDI. Il comportamento attentivo dei bambini è stato codificato offline, secondo per secondo per un totale di 116 sessioni. La codifica ha riguardato specifiche aree di interesse (il volto, l’oggetto, o altre parti della stanza). La produzione vocale per ogni bambino è stata quantificata usando LENA e le produzioni del bambino (vocalizzi pre-canonici, babbling, parole) durante un periodo di gioco con la mamma sono state trascritte foneticamente. Risultati La rassegna sistematica della letteratura (Capitolo 2) ha portato all’identificazione di 19 articoli. Alcuni dei quali volti a chiarire il ruolo giocato da diversi fattori nel spiegare diversi patterns attentivi. Altri interessati ad indagare la correlazione tra l’attenzione selettiva visiva verso specifiche aree del volto alle competenze linguistiche o sociali dei partecipanti, aprendo le porte a diverse linee interpretative. Il primo studio empirico (Capitolo 3) ha messo in luce che i bambini italiani con età superiore ai 12 mesi, mostrano maggiore interesse verso l’area della bocca, specialmente quando esposti alla lingua nativa. Questo è in accordo con la recente letteratura, ma contrasta con la language expertise hypotesis (secondo la quale bambini attorno all’anno di età dovrebbero spostare il focus attentivo dalla bocca agli occhi). Il secondo risultato emerso in questo lavoro empirico riguarda l’interesse verso l’area della bocca per i bambini che hanno maggiori livelli di produzione in termini di babbling e parole al tempo dell’esperimento. Il terzo risultato riguarda l’associazione positiva tra il comportamento attentivo verso la bocca ed il vocabolario espressivo dei bambini misurato tramite questionario (MB-CDI) tre mesi dopo l’esperimento. Dal secondo studio empirico (Capitolo 4) emerge una differenza significativa in termini di tempo attentivo verso il volto adulto tra i bambini del gruppo linguistico “alto” e “basso” durante un task condotto in un contesto naturalistico. 10 In particolare, da questo studio emergono due risultati interessanti: il primo è che i bambini che producono forme vocaliche più avanzate (babbling e parole) guardano di più verso il volto adulto, specialmente quando esposti alle non-parole. Il secondo riguarda l’esistenza di una relazione significativa tra gli episodi di fissazione al volto e le abilità vocaliche del bambino al tempo del task (vocalizzi pre-canonici, babbling e parole). In particolare, emerge che la quantità di babbling prodotto ha un ruolo nel predire gli episodi di fissazione al volto durante il task, sia per le parole sia per le non parole. Conclusioni Diverse ipotesi linguistiche e sociali sono state avanzate per spiegare le differenze emerse dalla rassegna della letteratura in relazione al fenomeno dell’attenzione selettiva visiva. Gli studi empirici presentati in questa tesi hanno portato due contributi originali in quest’ambito di ricerca. Da un lato, i nostri risultati confermano l’idea che la bocca e, più in generale, il volto forniscono segnali visivi cruciali nelle prime fasi di acquisizione del linguaggio. Dall’altro lato, i risultati hanno messo in luce che la conoscenza linguistica e le abilità linguistiche dei partecipanti aiutano a spiegare diversi comportamenti attentivi. In altre parole, è possibile dire che l’attenzione selettiva ai volti, o a specifiche aree di questi, è spiegata dalle conoscenze e abilità linguistiche attuali dei partecipanti.Introduction Speech is the result of multimodal or multi-sensorial processes. The auditory and visual components of language provide the child with information crucial to the processing of speech. The language acquisition process is influenced by the child’s ability to integrate information from multimodal (audio and visual) sources and to focus attention on the relevant cues in the environment; this is selective visual attention. This dissertation will explore the relationship between children’s selective visual attention and their early language skills. Several recent studies with infant populations have hypothesised or tested the relationship between children’s selective visual attention towards specific regions of the talking face (i.e., the eyes or the mouth) and their language skills. These studies have tried to show how concomitant or longitudinal language skills can explain looking behaviours. In most cases, these studies have speculated on how this relationship is mediated by the child’s level of language expertise (this is known as the language expertise hypothesis). However, no studies until now, to the best of our knowledge, have investigated the child’s linguistic skills using spontaneous language measures. Aims The dissertation has one broad aim, within which there are three particular aims. The broad aim is to examine the phenomenon of selective visual attention toward the face in both a laboratory and a naturalistic setting, and its relationship with language development. The three particular aims are as follows. The first aim is to synthesise and analyse the factors that might determine different looking patterns in infants during audiovisual tasks using dynamic faces; it describes how the literature explains these patterns in relation to aspects of language development. The second aim is to experimentally investigate the child’s selective visual attention towards a specific region of the adult’s face (the eyes and the mouth) in a task using the eye-tracking method. In particular, the study will explore two 12 questions: First, how do age and language condition (exposure to native vs non-native speech) affect looking behaviour in children? Second, are a child’s looking behaviours related to vocal production at the time of the experiment and to vocabulary rates three months later, and if so, how? The third aim is to understand whether selective attention towards the face or other parts of the visual scene (i.e. the object or elsewhere) is influenced or explained by the child’s vocal skills at the time of the task. And can the episodes of fixation towards the adult’s face be predicted by specific phonological and semantic properties (i.e., pre-canonical vocalisations, babbling, words) of the child’s speech? Method For the first study, a systematic review of the literature was conducted, exploring four bibliographic databases and using specific inclusion criteria to select the records. For the second study, eye movements towards a dynamic face (on a screen), speaking in the child’s native language (Italian) and a non-native language (English), were tracked using an eye-tracker in 26 infants between 6 and 14 months. Two groups were created based on age (G1, M = 7 months, N = 15 infants; G2, M = 12 months, N = 11 infants). Each child’s language skill was assessed twice: at the time of the experiment (through direct observation, Time 1) and three months later (through MB-CDI, Time 2). Two groups were created, based on the child’s vocal production (Time 1, latent class cluster analysis): a high class (higher percentage of babbling and words) vs a low class (higher percentage of pre-canonical vocalisations). For the third study, the looking behaviour of the same 29 children between 12 and 19 months was tracked, using both a stationary video camera and a head-mounted camera on the mother’s head during a single object task. During the task, children were exposed to a set of audiovisual stimuli, real words and non-words, chosen based on the parents’ reports and their MB-CDI answers. The child’s looking behaviour was coded offline second-by-second for a total of 116 sessions. The coding relates to specific areas of interest, i.e., the face, the object or 13 elsewhere. The vocal production of each child was quantified using a LENA device, and their speech during a play period with their mothers was transcribed phonetically. Results The systematic search of the literature (Chapter 2) identified 19 papers. Some tried to clarify the role played by audiovisual factors in support of speech perception (provided by looking towards the eyes or the mouth of a talking face). Others related selective visual attention towards specific areas of the adult’s face to the child’s competence in terms of linguistic or social skills, this leads to correspondingly different lines of interpretation. The first empirical study (Chapter 3) shows that Italian children older than 12 months displayed a greater interest in the mouth area, especially when they were exposed to their native language. This accords with the more recent literature but contrasts with the language expertise hypothesis. The second significant result of Chapter 3 is that children who had a higher level of production in terms of babbling and words at the time of the experiment looked more towards the mouth area. The study reported in Chapter 3 also demonstrated a positive association between the child’s looking to the mouth and their expressive vocabulary as measured (using the MB-CDI) three months after the experiment The second empirical study (Chapter 4) shows a significant difference in the looking time towards the adult’s face between children with low- and high-vocal production in a naturalistic setting. More specifically, from this study, we find two things. Firstly, we found that the children who produced more advanced vocal forms (higher amount of babbling and word production) looked more towards the adult’s face, especially when exposed to non-words. Secondly, that a significant relationship exists between the episodes of fixation towards the adult’s face and the child’s vocal skills (i.e., pre-canonical vocalisations, babbling, words); babbling productions predicted the episodes of face fixation in the task as a whole, for both words and non-words. 14 Conclusion Linguistic and social-based hypotheses attempting to explain the differences in the selective visual attention phenomenon emerged from the literature review. The empirical studies presented in this thesis bring two original contributions to this research field. First, our findings reinforce the idea that the mouth and, more generally the face, provide crucial visual cues when acquiring a language. Secondly, our results demonstrate that language knowledge and language skills at the time the child was observed significantly help to explain different looking behaviours. In other words, we can conclude that each child’s attention to faces is shaped by their own linguistic characteristics
    corecore