465 research outputs found

    Subspace Gaussian Mixture Models for Language Identification and Dysarthric Speech Intelligibility Assessment

    Get PDF
    En esta Tesis se ha investigado la aplicación de técnicas de modelado de subespacios de mezclas de Gaussianas en dos problemas relacionados con las tecnologías del habla, como son la identificación automática de idioma (LID, por sus siglas en inglés) y la evaluación automática de inteligibilidad en el habla de personas con disartria. Una de las técnicas más importantes estudiadas es el análisis factorial conjunto (JFA, por sus siglas en inglés). JFA es, en esencia, un modelo de mezclas de Gaussianas en el que la media de cada componente se expresa como una suma de factores de dimensión reducida, y donde cada factor representa una contribución diferente a la señal de audio. Esta factorización nos permite compensar nuestros modelos frente a contribuciones indeseadas presentes en la señal, como la información de canal. JFA se ha investigado como clasficador y como extractor de parámetros. En esta última aproximación se modela un solo factor que representa todas las contribuciones presentes en la señal. Los puntos en este subespacio se denominan i-Vectors. Así, un i-Vector es un vector de baja dimensión que representa una grabación de audio. Los i-Vectors han resultado ser muy útiles como vector de características para representar señales en diferentes problemas relacionados con el aprendizaje de máquinas. En relación al problema de LID, se han investigado dos sistemas diferentes de acuerdo al tipo de información extraída de la señal. En el primero, la señal se parametriza en vectores acústicos con información espectral a corto plazo. En este caso, observamos mejoras de hasta un 50% con el sistema basado en i-Vectors respecto al sistema que utilizaba JFA como clasificador. Se comprobó que el subespacio de canal del modelo JFA también contenía información del idioma, mientras que con los i-Vectors no se descarta ningún tipo de información, y además, son útiles para mitigar diferencias entre los datos de entrenamiento y de evaluación. En la fase de clasificación, los i-Vectors de cada idioma se modelaron con una distribución Gaussiana en la que la matriz de covarianza era común para todos. Este método es simple y rápido, y no requiere de ningún post-procesado de los i-Vectors. En el segundo sistema, se introdujo el uso de información prosódica y formántica en un sistema de LID basado en i-Vectors. La precisión de éste estaba por debajo de la del sistema acústico. Sin embargo, los dos sistemas son complementarios, y se obtuvo hasta un 20% de mejora con la fusión de los dos respecto al sistema acústico solo. Tras los buenos resultados obtenidos para LID, y dado que, teóricamente, los i-Vectors capturan toda la información presente en la señal, decidimos usarlos para la evaluar de manera automática la inteligibilidad en el habla de personas con disartria. Los logopedas están muy interesados en esta tecnología porque permitiría evaluar a sus pacientes de una manera objetiva y consistente. En este caso, los i-Vectors se obtuvieron a partir de información espectral a corto plazo de la señal, y la inteligibilidad se calculó a partir de los i-Vectors obtenidos para un conjunto de palabras dichas por el locutor evaluado. Comprobamos que los resultados eran mucho mejores si en el entrenamiento del sistema se incorporaban datos de la persona que iba a ser evaluada. No obstante, esta limitación podría aliviarse utilizando una mayor cantidad de datos para entrenar el sistema.In this Thesis, we investigated how to effciently apply subspace Gaussian mixture modeling techniques onto two speech technology problems, namely automatic spoken language identification (LID) and automatic intelligibility assessment of dysarthric speech. One of the most important of such techniques in this Thesis was joint factor analysis (JFA). JFA is essentially a Gaussian mixture model where the mean of the components is expressed as a sum of low-dimension factors that represent different contributions to the speech signal. This factorization makes it possible to compensate for undesired sources of variability, like the channel. JFA was investigated as final classiffer and as feature extractor. In the latter approach, a single subspace including all sources of variability is trained, and points in this subspace are known as i-Vectors. Thus, one i-Vector is defined as a low-dimension representation of a single utterance, and they are a very powerful feature for different machine learning problems. We have investigated two different LID systems according to the type of features extracted from speech. First, we extracted acoustic features representing short-time spectral information. In this case, we observed relative improvements with i-Vectors with respect to JFA of up to 50%. We realized that the channel subspace in a JFA model also contains language information whereas i-Vectors do not discard any language information, and moreover, they help to reduce mismatches between training and testing data. For classification, we modeled the i-Vectors of each language with a Gaussian distribution with covariance matrix shared among languages. This method is simple and fast, and it worked well without any post-processing. Second, we introduced the use of prosodic and formant information with the i-Vectors system. The performance was below the acoustic system but both were found to be complementary and we obtained up to a 20% relative improvement with the fusion with respect to the acoustic system alone. Given the success in LID and the fact that i-Vectors capture all the information that is present in the data, we decided to use i-Vectors for other tasks, specifically, the assessment of speech intelligibility in speakers with different types of dysarthria. Speech therapists are very interested in this technology because it would allow them to objectively and consistently rate the intelligibility of their patients. In this case, the input features were extracted from short-term spectral information, and the intelligibility was assessed from the i-Vectors calculated from a set of words uttered by the tested speaker. We found that the performance was clearly much better if we had available data for training of the person that would use the application. We think that this limitation could be relaxed if we had larger databases for training. However, the recording process is not easy for people with disabilities, and it is difficult to obtain large datasets of dysarthric speakers open to the research community. Finally, the same system architecture for intelligibility assessment based on i-Vectors was used for predicting the accuracy that an automatic speech recognizer (ASR) system would obtain with dysarthric speakers. The only difference between both was the ground truth label set used for training. Predicting the performance response of an ASR system would increase the confidence of speech therapists in these systems and would diminish health related costs. The results were not as satisfactory as in the previous case, probably because an ASR is a complex system whose accuracy can be very difficult to be predicted only with acoustic information. Nonetheless, we think that we opened a door to an interesting research direction for the two problems

    NIST 2007 Language Recognition Evaluation: From the Perspective of IIR

    Get PDF
    PACLIC / The University of the Philippines Visayas Cebu College Cebu City, Philippines / November 20-22, 200

    L2 speech learning of European Portuguese /l/ and /ɾ/ by L1-Mandarin learners: experimental evidence and theoretical modelling

    Get PDF
    It has been long recognized that the poor distinction between /l/ and /ɾ/ is one of the most perceptible characteristics in Chinese-accented Portuguese. Recent empirical research revealed that this notorious L2 speech learning difficulty goes beyond the confusion between two L2 categories, as L1-Mandarin learners’ acquisition of Portuguese /l/ and /ɾ/ seems to be subject to the interaction among different prosodic positions, speech modalities and representational levels. This thesis aims to deepen our current understanding of this L2 speech learning process, by exploring what constrains the development of L2 phonological categories across syllable positions and how different modalities interact during this process. To achieve this goal, both experimental tasks and theoretical modelling were employed. The first study of this thesis explores the role of cross-linguistic influence and orthography on L2 category formation. In order to elicit cross-linguistic influence directly, a delayed-imitation task was performed with L1-Mandarin naïve listeners. This task examined how the Mandarin phonology parses the Portuguese input ([l], [ɾ]) in intervocalic onset and in word-internal coda position. Moreover, whether orthography plays a role during the construction of L2 phonological representation was tested by manipulating the input types that were given in the experiment (auditory input alone vs. auditory + written input). Our study shows that naïve Mandarin listeners’ responses corroborated with that of L1-Mandarin learners, suggesting that cross-linguistic influence is responsible for the observed L2 prosodic effects. Moreover, the Mandarin [ɻ] (a repair strategy for /ɾ/) occurred almost exclusively when the written form was given, providing evidence for the cross-linguistic interaction between phonological categorization and orthography during the construction of L2 categories. In the second study, we first investigate the interaction between speech perception and production in L2 speech learning, by examining whether the L2 deviant productions stem from misperception and whether the order of acquisition in L2 speech perception mirrors that in production. Secondly, we test whether L2 phonological categories remain malleable at a mid-late stage of L2 speech learning. Two perceptual experiments were performed to test L1-Mandarin learners on their discrimination ability between the target Portuguese form and the deviant form employed in L2 production. Expanding on prior research, in this study, the perceptual motivation for L2 speech difficulties was assessed in different syllable constituents (onset and coda) and at both segmental and suprasegmental levels (structural modification). The results demonstrate that some deviant forms observed in L2 production indeed have a perceptual motivation ([w] for the velarised lateral; [l] and [ɾə] for the tap), while some others cannot be attributed to misperception (deletion of syllable-final tap). Furthermore, learners confused the intervocalic /l/ and /ɾ/ bidirectionally in perception, while in production they never misproduced the lateral (/ɾ/ → [l], */l/ → [ɾ]), revealing a mismatch between two speech modalities. By contrast, the order of acquisition (/ɾ/coda > /ɾ/onset) was shown to be consistent in L2 perception and production. The correspondence and discrepancy between the two speech modalities signal a complex relationship between L2 speech perception and production. To assess the plasticity of L2 categories /l/ and /ɾ/, two groups of L1-Mandarin learners who differ substantially in terms of L2 experience were recruited in the perceptual tasks. Our study shows that both groups behaved similarly in terms of the discrimination performance. No evidence for a role of L2 experience was found. The implication of this null result on L2 phonological development is discussed. The third study of the thesis aims to contribute to bridging the gap between the L2 experimental evidence and formal theories. Adopting the Bidirectional Phonology and Phonetics Model, we formalise some of the experimental findings that cannot be elucidated by current L2 speech theories, namely, the between and within-subject variation in L2 phonological categorization; the interaction between phonological categorization and orthography during L2 category construction; and the asymmetry between L2 perception and production. Overall, this thesis sheds light on the complex nature of L2 phonological acquisition and provides a formal account of how different modalities interact in shaping L2 speech learning. Moreover, it puts forward testable predictions for future research and suggestions for improving foreign language teaching/training methodologies.É bem conhecido o facto de as trocas associadas a /l/ e /ɾ/ constituírem uma das caraterísticas mais percetíveis no português articulado pelos aprendentes chineses. Recentemente, estudos empíricos revelam que a dificuldade por parte dos aprendentes chineses não se restringe à discriminação moderada entre as duas categorias da L2, dado que a aquisição de /l/ e /ɾ/ do português por aprendentes chineses parece estar sujeita à interação entre contextos prosódicos, entre modalidades de fala e entre níveis representacionais diferentes. Esta tese visa aprofundar a nossa compreensão deste processo da aquisição fonológica L2, explorando o que condiciona o desenvolvimento das categorias fonológicas L2 em diferentes constituintes silábicos e de que modo as modalidades interagem durante este processo, recorrendo para tal a tarefas experimentais bem como a formalização teórica. O primeiro estudo averigua o papel da influência interlinguística e o da ortografia na construção das categorias de L2. Para elicitar a influência interlinguística diretamente, uma tarefa de imitação retardada foi aplicada aos falantes nativos do mandarim sem conhecimento de português, investigando assim como a fonologia do mandarim categoriza o input do português ([l], [ɾ]) em ataque simples intervocálico e em coda medial. Para além disso, a influência ortográfica na construção de representações fonológicas em L2 foi examinada através da manipulação do tipo do input apresentado na experiência (input auditivo vs. input auditivo + ortográfico). Os resultados da situação experimental em que os participantes receberam input de ambos os tipos replicaram o efeito prosódico observado na literatura, evidenciando a interação entre categorização fonológica e ortografia na construção das categorias de L2. No segundo estudo, investigamos a interação entre a perceção e a produção de fala na aquisição das líquidas do PE por aprendentes chineses e a plasticidade destas categorias fonológicas, respondendo às questões seguintes: 1) as produções desviantes de L2 resultam da perceção incorreta? 2) a ordem da aquisição em L2 é consistente na perceção e na produção? 3) as categorias da L2 permanecem maleáveis numa fase intermédia da aquisição? Duas tarefas percetivas foram conduzidas para testar a capacidade percetiva dos aprendentes nativos do mandarim em relação à discriminação entre a forma alvo do português e as formas desviantes utilizadas na produção. No presente estudo, a motivação percetiva das dificuldades em L2 foi testada nos constituintes silábicos diferentes (ataque simples e coda) e nos níveis segmental e suprassegmental (modificação estrutural). Os resultados demonstram que algumas formas desviantes que os aprendentes chineses produzem têm uma motivação percetiva (i.e. [w] para a lateral velarizada; [l] e [ɾə] para a vibrante alveolar), enquanto outras não podem ser analisadas como casos de perceção incorreta (como é o caso do o apagamento da vibrante em coda). Para além disso, na posição intervocálica, os aprendentes manifestam dificuldade na discriminação entre /l/ e /ɾ/ de forma bidirecional, mas, na produção, a lateral nunca é produzida incorretamente (/ɾ/ → [l], */l/ → [ɾ]). Tal revela uma divergência entre as duas modalidades de fala. Por contraste, mostrou-se que a ordem da aquisição (/ɾ/coda > /ɾ/ataque) é consistente na perceção e na produção da L2. A correspondência e a discrepância entre as duas modalidades de fala, sinalizam uma relação complexa entre a perceção e a produção na aquisição fonológica de L2. Em relação à questão da plasticidade das categorias de L2, recrutaram-se para as tarefas percetivas dois grupos de aprendentes nativos do mandarim que se diferenciavam substancialmente em termos da experiência em L2. Não se encontrou um efeito significativo da experiência da L2. A implicação deste resultado nulo no desenvolvimento fonológico de L2 foi discutida. O terceiro estudo desta tese tem como objetivo contribuir para a colmatação das lacunas entre estudos empíricos de L2 e as teorias formais. Adotando o Modelo Bidirecional de Fonologia e Fonética, formalizamos os resultados experimentais que as teorias atuais da aquisição fonológica de L2 não conseguem explicar, nomeadamente, a variação inter e intra-sujeitos na categorização fonológica em L2; a interação entre categorização fonológica e ortografia na construção das categorias na L2; a assimetria entre a perceção e a produção na L2. Em suma, esta tese contribui com dados empíricos para a discussão da relação complexa entre a perceção, produção e ortografia na aquisição fonológica de L2 e formaliza a interação entre essas modalidades através de um modelo linguístico generativo. Além disso, apresentam-se predições testáveis para investigação futura e sugestões para o aperfeiçoamento das metodologias de ensino/treino da língua não materna

    Current trends in multilingual speech processing

    Get PDF
    In this paper, we describe recent work at Idiap Research Institute in the domain of multilingual speech processing and provide some insights into emerging challenges for the research community. Multilingual speech processing has been a topic of ongoing interest to the research community for many years and the field is now receiving renewed interest owing to two strong driving forces. Firstly, technical advances in speech recognition and synthesis are posing new challenges and opportunities to researchers. For example, discriminative features are seeing wide application by the speech recognition community, but additional issues arise when using such features in a multilingual setting. Another example is the apparent convergence of speech recognition and speech synthesis technologies in the form of statistical parametric methodologies. This convergence enables the investigation of new approaches to unified modelling for automatic speech recognition and text-to-speech synthesis (TTS) as well as cross-lingual speaker adaptation for TTS. The second driving force is the impetus being provided by both government and industry for technologies to help break down domestic and international language barriers, these also being barriers to the expansion of policy and commerce. Speech-to-speech and speech-to-text translation are thus emerging as key technologies at the heart of which lies multilingual speech processin

    Phonotactic probability and phonotactic constraints :processing and lexical segmentation by Arabic learners of English as a foreign language

    Get PDF
    PhD ThesisA fundamental skill in listening comprehension is the ability to recognize words. The ability to accurately locate word boundaries(i . e. to lexically segment) is an important contributor to this skill. Research has shown that English native speakers use various cues in the signal in lexical segmentation. One such cue is phonotactic constraints; more specifically, the presence of illegal English consonant sequences such as AV and MY signals word boundaries. It has also been shown that phonotactic probability (i. e. the frequency of segments and sequences of segments in words) affects native speakers' processing of English. However, the role that phonotactic probability and phonotactic constraints play in the EFL classroom has hardly been studied, while much attention has been devoted to teaching listening comprehension in EFL. This thesis reports on an intervention study which investigated the effect of teaching English phonotactics upon Arabic speakers' lexical segmentation of running speech in English. The study involved a native English group (N= 12), a non-native speaking control group (N= 20); and a non-native speaking experimental group (N=20). Each of the groups took three tests, namely Non-word Rating, Lexical Decision and Word Spotting. These tests probed how sensitive the subjects were to English phonotactic probability and to the presence of illegal sequences of phonemes in English and investigated whether they used these sequences in the lexical segmentation of English. The non-native groups were post-tested with the -same tasks after only the experimental group had been given a treatment which consisted of explicit teaching of relevant English phonotactic constraints and related activities for 8 weeks. The gains made by the experimental group are discussed, with implications for teaching both pronunciation and listening comprehension in an EFL setting.Qassim University, Saudi Arabia

    Yet Another Model for Arabic Dialect Identification

    Full text link
    In this paper, we describe a spoken Arabic dialect identification (ADI) model for Arabic that consistently outperforms previously published results on two benchmark datasets: ADI-5 and ADI-17. We explore two architectural variations: ResNet and ECAPA-TDNN, coupled with two types of acoustic features: MFCCs and features exratected from the pre-trained self-supervised model UniSpeech-SAT Large, as well as a fusion of all four variants. We find that individually, ECAPA-TDNN network outperforms ResNet, and models with UniSpeech-SAT features outperform models with MFCCs by a large margin. Furthermore, a fusion of all four variants consistently outperforms individual models. Our best models outperform previously reported results on both datasets, with accuracies of 84.7% and 96.9% on ADI-5 and ADI-17, respectively.Comment: ACCEPTED AT ArabicNLP 202
    corecore