186 research outputs found

    Self-imitating Feedback Generation Using GAN for Computer-Assisted Pronunciation Training

    Full text link
    Self-imitating feedback is an effective and learner-friendly method for non-native learners in Computer-Assisted Pronunciation Training. Acoustic characteristics in native utterances are extracted and transplanted onto learner's own speech input, and given back to the learner as a corrective feedback. Previous works focused on speech conversion using prosodic transplantation techniques based on PSOLA algorithm. Motivated by the visual differences found in spectrograms of native and non-native speeches, we investigated applying GAN to generate self-imitating feedback by utilizing generator's ability through adversarial training. Because this mapping is highly under-constrained, we also adopt cycle consistency loss to encourage the output to preserve the global structure, which is shared by native and non-native utterances. Trained on 97,200 spectrogram images of short utterances produced by native and non-native speakers of Korean, the generator is able to successfully transform the non-native spectrogram input to a spectrogram with properties of self-imitating feedback. Furthermore, the transformed spectrogram shows segmental corrections that cannot be obtained by prosodic transplantation. Perceptual test comparing the self-imitating and correcting abilities of our method with the baseline PSOLA method shows that the generative approach with cycle consistency loss is promising

    CAPT를 위한 발음 변이 분석 및 CycleGAN 기반 피드백 생성

    Get PDF
    학위논문(박사)--서울대학교 대학원 :인문대학 협동과정 인지과학전공,2020. 2. 정민화.Despite the growing popularity in learning Korean as a foreign language and the rapid development in language learning applications, the existing computer-assisted pronunciation training (CAPT) systems in Korean do not utilize linguistic characteristics of non-native Korean speech. Pronunciation variations in non-native speech are far more diverse than those observed in native speech, which may pose a difficulty in combining such knowledge in an automatic system. Moreover, most of the existing methods rely on feature extraction results from signal processing, prosodic analysis, and natural language processing techniques. Such methods entail limitations since they necessarily depend on finding the right features for the task and the extraction accuracies. This thesis presents a new approach for corrective feedback generation in a CAPT system, in which pronunciation variation patterns and linguistic correlates with accentedness are analyzed and combined with a deep neural network approach, so that feature engineering efforts are minimized while maintaining the linguistically important factors for the corrective feedback generation task. Investigations on non-native Korean speech characteristics in contrast with those of native speakers, and their correlation with accentedness judgement show that both segmental and prosodic variations are important factors in a Korean CAPT system. The present thesis argues that the feedback generation task can be interpreted as a style transfer problem, and proposes to evaluate the idea using generative adversarial network. A corrective feedback generation model is trained on 65,100 read utterances by 217 non-native speakers of 27 mother tongue backgrounds. The features are automatically learnt in an unsupervised way in an auxiliary classifier CycleGAN setting, in which the generator learns to map a foreign accented speech to native speech distributions. In order to inject linguistic knowledge into the network, an auxiliary classifier is trained so that the feedback also identifies the linguistic error types that were defined in the first half of the thesis. The proposed approach generates a corrected version the speech using the learners own voice, outperforming the conventional Pitch-Synchronous Overlap-and-Add method.외국어로서의 한국어 교육에 대한 관심이 고조되어 한국어 학습자의 수가 크게 증가하고 있으며, 음성언어처리 기술을 적용한 컴퓨터 기반 발음 교육(Computer-Assisted Pronunciation Training; CAPT) 어플리케이션에 대한 연구 또한 적극적으로 이루어지고 있다. 그럼에도 불구하고 현존하는 한국어 말하기 교육 시스템은 외국인의 한국어에 대한 언어학적 특징을 충분히 활용하지 않고 있으며, 최신 언어처리 기술 또한 적용되지 않고 있는 실정이다. 가능한 원인으로써는 외국인 발화 한국어 현상에 대한 분석이 충분하게 이루어지지 않았다는 점, 그리고 관련 연구가 있어도 이를 자동화된 시스템에 반영하기에는 고도화된 연구가 필요하다는 점이 있다. 뿐만 아니라 CAPT 기술 전반적으로는 신호처리, 운율 분석, 자연어처리 기법과 같은 특징 추출에 의존하고 있어서 적합한 특징을 찾고 이를 정확하게 추출하는 데에 많은 시간과 노력이 필요한 실정이다. 이는 최신 딥러닝 기반 언어처리 기술을 활용함으로써 이 과정 또한 발전의 여지가 많다는 바를 시사한다. 따라서 본 연구는 먼저 CAPT 시스템 개발에 있어 발음 변이 양상과 언어학적 상관관계를 분석하였다. 외국인 화자들의 낭독체 변이 양상과 한국어 원어민 화자들의 낭독체 변이 양상을 대조하고 주요한 변이를 확인한 후, 상관관계 분석을 통하여 의사소통에 영향을 미치는 중요도를 파악하였다. 그 결과, 종성 삭제와 3중 대립의 혼동, 초분절 관련 오류가 발생할 경우 피드백 생성에 우선적으로 반영하는 것이 필요하다는 것이 확인되었다. 교정된 피드백을 자동으로 생성하는 것은 CAPT 시스템의 중요한 과제 중 하나이다. 본 연구는 이 과제가 발화의 스타일 변화의 문제로 해석이 가능하다고 보았으며, 생성적 적대 신경망 (Cycle-consistent Generative Adversarial Network; CycleGAN) 구조에서 모델링하는 것을 제안하였다. GAN 네트워크의 생성모델은 비원어민 발화의 분포와 원어민 발화 분포의 매핑을 학습하며, Cycle consistency 손실함수를 사용함으로써 발화간 전반적인 구조를 유지함과 동시에 과도한 교정을 방지하였다. 별도의 특징 추출 과정이 없이 필요한 특징들이 CycleGAN 프레임워크에서 무감독 방법으로 스스로 학습되는 방법으로, 언어 확장이 용이한 방법이다. 언어학적 분석에서 드러난 주요한 변이들 간의 우선순위는 Auxiliary Classifier CycleGAN 구조에서 모델링하는 것을 제안하였다. 이 방법은 기존의 CycleGAN에 지식을 접목시켜 피드백 음성을 생성함과 동시에 해당 피드백이 어떤 유형의 오류인지 분류하는 문제를 수행한다. 이는 도메인 지식이 교정 피드백 생성 단계까지 유지되고 통제가 가능하다는 장점이 있다는 데에 그 의의가 있다. 본 연구에서 제안한 방법을 평가하기 위해서 27개의 모국어를 갖는 217명의 유의미 어휘 발화 65,100개로 피드백 자동 생성 모델을 훈련하고, 개선 여부 및 정도에 대한 지각 평가를 수행하였다. 제안된 방법을 사용하였을 때 학습자 본인의 목소리를 유지한 채 교정된 발음으로 변환하는 것이 가능하며, 전통적인 방법인 음높이 동기식 중첩가산 (Pitch-Synchronous Overlap-and-Add) 알고리즘을 사용하는 방법에 비해 상대 개선률 16.67%이 확인되었다.Chapter 1. Introduction 1 1.1. Motivation 1 1.1.1. An Overview of CAPT Systems 3 1.1.2. Survey of existing Korean CAPT Systems 5 1.2. Problem Statement 7 1.3. Thesis Structure 7 Chapter 2. Pronunciation Analysis of Korean Produced by Chinese 9 2.1. Comparison between Korean and Chinese 11 2.1.1. Phonetic and Syllable Structure Comparisons 11 2.1.2. Phonological Comparisons 14 2.2. Related Works 16 2.3. Proposed Analysis Method 19 2.3.1. Corpus 19 2.3.2. Transcribers and Agreement Rates 22 2.4. Salient Pronunciation Variations 22 2.4.1. Segmental Variation Patterns 22 2.4.1.1. Discussions 25 2.4.2. Phonological Variation Patterns 26 2.4.1.2. Discussions 27 2.5. Summary 29 Chapter 3. Correlation Analysis of Pronunciation Variations and Human Evaluation 30 3.1. Related Works 31 3.1.1. Criteria used in L2 Speech 31 3.1.2. Criteria used in L2 Korean Speech 32 3.2. Proposed Human Evaluation Method 36 3.2.1. Reading Prompt Design 36 3.2.2. Evaluation Criteria Design 37 3.2.3. Raters and Agreement Rates 40 3.3. Linguistic Factors Affecting L2 Korean Accentedness 41 3.3.1. Pearsons Correlation Analysis 41 3.3.2. Discussions 42 3.3.3. Implications for Automatic Feedback Generation 44 3.4. Summary 45 Chapter 4. Corrective Feedback Generation for CAPT 46 4.1. Related Works 46 4.1.1. Prosody Transplantation 47 4.1.2. Recent Speech Conversion Methods 49 4.1.3. Evaluation of Corrective Feedback 50 4.2. Proposed Method: Corrective Feedback as a Style Transfer 51 4.2.1. Speech Analysis at Spectral Domain 53 4.2.2. Self-imitative Learning 55 4.2.3. An Analogy: CAPT System and GAN Architecture 57 4.3. Generative Adversarial Networks 59 4.3.1. Conditional GAN 61 4.3.2. CycleGAN 62 4.4. Experiment 63 4.4.1. Corpus 64 4.4.2. Baseline Implementation 65 4.4.3. Adversarial Training Implementation 65 4.4.4. Spectrogram-to-Spectrogram Training 66 4.5. Results and Evaluation 69 4.5.1. Spectrogram Generation Results 69 4.5.2. Perceptual Evaluation 70 4.5.3. Discussions 72 4.6. Summary 74 Chapter 5. Integration of Linguistic Knowledge in an Auxiliary Classifier CycleGAN for Feedback Generation 75 5.1. Linguistic Class Selection 75 5.2. Auxiliary Classifier CycleGAN Design 77 5.3. Experiment and Results 80 5.3.1. Corpus 80 5.3.2. Feature Annotations 81 5.3.3. Experiment Setup 81 5.3.4. Results 82 5.4. Summary 84 Chapter 6. Conclusion 86 6.1. Thesis Results 86 6.2. Thesis Contributions 88 6.3. Recommendations for Future Work 89 Bibliography 91 Appendix 107 Abstract in Korean 117 Acknowledgments 120Docto

    Design and evaluation of mobile computer-assisted pronunciation training tools for second language learning

    Get PDF
    The quality of speech technology (automatic speech recognition, ASR, and textto- speech, TTS) has considerably improved and, consequently, an increasing number of computer-assisted pronunciation (CAPT) tools has included it. However, pronunciation is one area of teaching that has not been developed enough since there is scarce empirical evidence assessing the effectiveness of tools and games that include speech technology in the field of pronunciation training and teaching. This PhD thesis addresses the design and validation of an innovative CAPT system for smart devices for training second language (L2) pronunciation. Particularly, it aims to improve learner’s L2 pronunciation at the segmental level with a specific set of methodological choices, such as learner’s first and second language connection (L1– L2), minimal pairs, a training cycle of exposure–perception–production, individualistic and social approaches, and the inclusion of ASR and TTS technology. The experimental research conducted applying these methodological choices with real users validates the efficiency of the CAPT prototypes developed for the four main experiments of this dissertation. Data is automatically gathered by the CAPT systems to give an immediate specific feedback to users and to analyze all results. The protocols, metrics, algorithms, and methods necessary to statistically analyze and discuss the results are also detailed. The two main L2 tested during the experimental procedure are American English and Spanish. The different CAPT prototypes designed and validated in this thesis, and the methodological choices that they implement, allow to accurately measuring the relative pronunciation improvement of the individuals who trained with them. Both rater’s subjective scores and CAPT’s objective scores show a strong correlation, being useful in the future to be able to assess a large amount of data and reducing human costs. Results also show an intensive practice supported by a significant number of activities carried out. In the case of the controlled experiments, students who worked with the CAPT tool achieved better pronunciation improvement values than their peers in the traditional in-classroom instruction group. In the case of the challenge-based CAPT learning game proposed, the most active players in the competition kept on playing until the end and achieved significant pronunciation improvement results.Departamento de Informática (Arquitectura y Tecnología de Computadores, Ciencias de la Computación e Inteligencia Artificial, Lenguajes y Sistemas Informáticos)Doctorado en Informátic

    Learning English as a Foreign Language in an Online Interactive Environment: A Case Study in China

    Get PDF
    This case study is designed to examine Chinese university students’ English as a foreign language (EFL) learning in an online interactive context. Investigation focused on the students’ perceptions of and engagement in EFL learning that occurred in a technology-supported context. Informed by the sociocultural theory, four theoretical constructs: learner autonomy, interactive learning, Zone of Proximal Development (ZPD) and scaffolding, form the theoretical framework to investigate Chinese university students’ EFL learning in a Computer-Assisted Language Learning (CALL) context. This theoretical model informs the adoption of a qualitative case study approach with statistical descriptions. A total of 154 Chinese university EFL students participated in the research. Data were collected via a questionnaire, focus groups, individual face-to-face interviews and online documents. Through data analysis, it revealed that Chinese university EFL students had positive perceptions of interactive online language learning, which promoted learner autonomy. Participants were confident about their abilities to find out appropriate learning materials and associated well-scaffolded instructional resources that were within their ZPDs. In the learning process, they enjoyed an increasing level of autonomy in language learning. They autonomously selected, organized and engaged digital resources, including learning materials and tasks as well as learning strategies, in their learning which were appropriate to language levels and catered for their learning needs. They showed the sign of good language learners with high degree of learner autonomy, who indicated a desire to continue their language learning in the future. The participants also regarded online space as a low-stress context for more interactive learning in an English as a foreign language context. Although the participants had developed some degree of learner autonomy via learning in the online mode, their autonomy in language learning, particularly for after-class online EFL learning, was still in development. There was a need for them to expand their language knowledge and skills development, particularly in the area of intercultural learning. Their selection and adoption of learning resources were also expected to improve to suit their current language abilities and their learning needs. Their understanding of and engagement in interactive learning were yet to be enhanced as well as they became more familiar with learning in this emerging context. Built on these findings, a tentative model of online EFL learning for facilitating learner autonomy is proposed to fulfil Chinese EFL students’ language learning needs in an online context, and help them to achieve better learning outcomes. It is envisaged that such a model is replicable to teaching and learning EFL in similar contexts

    Artificial Intelligence and Education. Guidance for Policy-makers

    Get PDF
    Artificial Intelligence (AI) has the potential to address some of the biggest challenges in education today, innovate teaching and learning practices, and ultimately accelerate the progress towards SDG 4. However, these rapid technological developments inevitably bring multiple risks and challenges, which have so far outpaced policy debates and regulatory frameworks. This publication offers guidance for policy-makers on how best to leverage the opportunities and address the risks, presented by the growing connection between AI and education. It starts with the essentials of AI: definitions, techniques and technologies. It continues with a detailed analysis of the emerging trends and implications of AI for teaching and learning, including how we can ensure the ethical, inclusive and equitable use of AI in education, how education can prepare humans to live and work with AI, and how AI can be applied to enhance education. It finally introduces the challenges of harnessing AI to achieve SDG 4 and offers concrete actionable recommendations for policy-makers to plan policies and programmes for local contexts

    Learner autonomy: The complexity of control‐shift

    Get PDF
    It is generally held that constructing learner autonomy (LA) requires a pedagogical shift of control from teachers to students. It is also understood that the development of learner autonomy relates largely to teacher autonomy (TA), which requires school managers to relinquish some degree of control to teachers. However, from a socio‐political perspective, the construct of autonomy is a right also extended to educational managers (MA). Thus, a problem arises: how can the three levels of controlshifts co‐exist and survive in harmony, and ideally, thrive each in its own way? Based on a recent case study, this paper aims to explore the complexity of the dynamic interaction between these three types of autonomy within an educational hierarchy. The study was conducted in a private Chinese secondary school which was promoting whole‐person development through a comprehensive innovation project involving all its academic staff members. The participants comprised nine English teachers, the principal, and the school’s executive director. Data collection was conducted through interviews, classroom observations followed by post‐lesson discussions, and the researcher’s field notes. Specifically, three questions were addressed in this paper focusing on managers’ perceptions of LA, a classroom instruction model intended to cultivate LA, and an in‐house professional development scheme to facilitate TA, all of which impacted on teachers’ professional decision‐making. The findings display a complex picture of these issues, and imply the importance of a genuine shared understanding of the nature of autonomy and the need to carefully ensure the optimal balance among the three types of autonomy in the design and implementation of curriculum innovations

    Brain structural predispositions for music and language processing

    Get PDF
    [eng] It has been shown that music and language training can elicit plastic changes on brain structure and function bringing along behavioural benefits. For instance, musicians have been reported to have better auditory discrimination including pitch and speech-in-noise perception, motor-synchronization, verbal memory and general IQ than individuals without formal musical background. Also, bilinguals have shown higher executive function and attention-related abilities than monolinguals. Furthermore, altered functional and structural connectivity can be tracked to brain areas related to the activities most frequently performed by both musicians (instrumentalists and singers) and linguistic experts (such as bilinguals or professional phoneticians). While research in the last decade has devoted important effort to the study of brain plasticity, only a few investigations have addressed the connection between the initial functional or structural properties of brain networks related to auditory-motor function and subsequent language or musical training. Indeed, brain structural markers such as grey matter volume/density or white-matter diffusivity measurements from diffusion tensor imaging (DTI) data, as well as functional measurements from task- related activity or resting-state data from magnetic resonance imaging (MRI) or electroenceplhalography (EEG) have been demonstrated to correlate with consecutive performance and learning in the auditory-motor domain. The main goal of the present dissertation was twofold: we aimed to further the existing knowledge regarding brain plasticity elicited during putative sensitive periods and after long-term music practice, and to explore the white-matter pathways that predict linguistic or musical skills at baseline . Our secondary goals were to confirm previous findings regarding the brain structures involved in music and language processing, as well as to provide evidence of the benefits of usingstructural measurements and correlational analyses between imaging and behavioural data to study inter-individual differences. Study I focused on the comparison between professional pianists and non- musicians observing a complex pattern of increases and decreases in grey matter volume. In comparison to non-musician individuals, pianists showed greater grey matter volume in areas related to motor skill and the automatization of learned movements, as well as reinforcement learning and emotional processing. On the other hand, regions associated to sensorimotor control, score reading and auditory and musical perception presented a reduction in grey matter volume. Study II explored the relationship between white-matter structural properties of the arcuate fasciculus (AF) and the performance of native German speakers in a foreign- language (Hindi) sentence and word imitation task. We found that a greater left lateralization of the AF volume predicted performance on the imitation task. This result was confirmed by using not only a manual deterministic approach but also an automatic atlas-based fibre-reconstruction method, which in addition pointed out to a specific region in the anterior half of the left AF as the most related to imitation ability. Study III aimed to investigate whether the white-matter structural connectivity of the pathways previously described as targets for plasticity mechanisms in professional musicians predicted musical abilities in non-musicians. We observed that the white- matter microstructural organization of the right hemisphere pathways involved in motor-control (corticospinal tract) and auditory-motor transformations (AF) correlated with the performance of non-musician individuals during the initial stages of rhythmic and melodic learning. The present work confirmed the involvement of several brain structures previously described to display plastic effects associated to music and language training in the first stages of audio-motor learning. Furthermore, they challenge previous views regarding music-induced plasticity by showing that expertise is not always or uniquely correlated with increases in brain tissue. This raises the question of the role of efficiency mechanisms derived from professional-like practice. Most importantly, the results from these three studies converge in showing that a prediction-feedback-feedforward loop for auditory-motor processing may be crucially involved in both musical and language learning and skills. We thus suggest that brain auditory-motor systems previously described as participating in native language processing (cortical areas of the dorsal route for language processing and the AF that connects them) may also be recruited during exposure to new linguistic or musical material, being refined after sustained music practice.[spa] Estudios previos muestran que la formación musical y lingüística provoca cambios plásticos en las estructuras y funciones cerebrales, acompañándose también de beneficios conductuales. Por ejemplo, se ha descrito que los músicos poseen mejores habilidades de discriminación auditiva (incluyendo la percepción tonal y la discriminación del habla en un ambiente ruidoso), una mayor capacidad de sincronización motora, así como mejor memoria verbal y coeficiente intelectual general en comparación con personas sin formación musical. Paralelamente, los bilingües muestran mejores funciones ejecutivas y habilidades relacionadas con la atención en comparación con individuos monolingües. Además, las alteraciones en la conectividad cerebral funcional y estructural pueden ser rastreadas estudiando las áreas cerebrales relacionadas con las actividades más utilizadas por músicos (instrumentistas y cantantes) y expertos lingüísticos (como bilingües o fonetistas profesionales). Pese a que en la última década se han dedicado esfuerzos importantes en el campo de la investigación sobre la plasticidad cerebral, sólo unos pocos estudios han tratado de investigar la conexión entre las propiedades iniciales del cerebro, en cuanto a las funciones y estructuras que se relacionan con las funciones auditivo-motoras, y el posterior aprendizaje musical o del lenguaje. Sin embargo, los marcadores estructurales cerebrales, tales como volumen/densidad de materia gris o medidas de difusividad en la sustancia blanca a partir de datos de imagen del tensor de difusión, así como medidas funcionales de la actividad relacionada con una tarea o datos de resting-state (estado de reposo) obtenidos por resonancia magnética o electroencefalografía, han demostrado que pueden correlacionar con el rendimiento y el aprendizaje en el dominio auditivo- motor. En la presente tesis pretendíamos ampliar nuestro conocimiento en cuanto a la plasticidad cerebral obtenida durante los supuestos “períodos sensibles” y después de la práctica musical mantenida en el tiempo, por un lado, y explorar las vías de sustancia blanca que pueden predecir habilidades lingüísticas o musicales al inicio del aprendizaje, por otro lado. Como objetivos secundarios, queríamos confirmar resultados previos con respecto a las estructuras cerebrales involucradas en el procesamiento de la música y el lenguaje, así como apoyar el uso de mediciones estructurales y enfoques correlacionales (entre datos de neuroimagen y conductuales) para estudiar las diferencias inter- individuales. El Estudio I se centró en la comparación entre pianistas profesionales y no músicos, observando un complejo patrón de aumentos y disminuciones en el volumen de materia gris. En comparación con los individuos no músicos, los pianistas mostraron mayor volumen de sustancia gris en áreas relacionadas con la habilidad motora y la automatización de movimientos aprendidos, así como el aprendizaje a través del refuerzo y el procesamiento emocional, mientras que las regiones asociadas al control sensoriomotor, lectura de partituras y percepción auditiva y musical presentaron una reducción del volumen de materia gris. El Estudio II exploró la relación entre las propiedades estructurales de la materia blanca del fascículo arqueado (AF por sus siglas en inglés) y el rendimiento de hablantes nativos de alemán en una tarea de imitación de frases y palabras en una lengua extranjera (hindi). Encontramos que una mayor lateralización del volumen de AF hacia la izquierda predecía el desempeño en la tarea de imitación. Este resultado se confirmó utilizando no sólo un enfoque determinístico-manual sino también una reconstrucción automática (basada en atlas anatómicos) de las fibras de sustancia blanca que, además, señalaba una región específica en la mitad anterior del AF izquierdo como la más relacionada con las capacidades de imitación. El Estudio III tenía como objetivo investigar si la conectividad estructural de vías de sustancia blanca anteriormente descritas como dianas para los mecanismos de plasticidad en músicos profesionales, podría predecir las habilidades musicales en los no músicos. Se observó que la organización micro-estructural de la materia blanca en el hemisferio derecho en vías involucradas en el control motor (tracto corticoespinal) y en transformaciones auditivo-motoras (AF) correlacionaba con el desempeño de individuos no músicos en las etapas iniciales del aprendizaje rítmico y melódico. El presente trabajo ha confirmado la implicación en las primeras etapas del aprendizaje audio-motor de varias estructuras cerebrales que previamente habían mostrado efectos plásticos asociados al aprendizaje musical y del lenguaje. Además, estos resultados desafían las opiniones anteriores sobre la plasticidad inducida por la experiencia musical al demostrar que la experiencia no se correlaciona siempre ni únicamente con un aumento del tejido cerebral, y planteando así preguntas sobre los mecanismos de eficiencia derivados de la práctica musical a nivel profesional. Más importante aún es que los resultados de estos tres estudios convergen mostrando que un bucle de predicción–retroalimentación (feedback)–alimentación directa (feedforward) para el procesamiento auditivo-motor puede estar implicado de manera crucial tanto en el aprendizaje musical como en el aprendizaje de idiomas. Por tanto, sugerimos que los sistemas auditivo-motrices del cerebro, que previamente se habían descrito como participantes en el procesamiento del lenguaje nativo (áreas corticales involucradas en la vía dorsal para el procesamiento del lenguaje, y el AF, que las conecta) también pueden ser reclutados durante la exposición a material lingüístico o musical nuevo, siendo refinado tras años de práctica musical activ
    corecore