23 research outputs found

    PLASER: Pronunciation Learning via Automatic Speech Recognition

    Get PDF
    PLASER is a multimedia tool with instant feedback designed to teach English pronunciation for high-school students of Hong Kong whose mother tongue is Cantonese Chinese. The objective is to teach correct pronunciation and not to assess a student's overall pronunciation quality. Major challenges related to speech recognition technology include: allowance for non-native accent, reliable and corrective feedbacks, and visualization of errors

    I hear you eat and speak: automatic recognition of eating condition and food type, use-cases, and impact on ASR performance

    Get PDF
    We propose a new recognition task in the area of computational paralinguistics: automatic recognition of eating conditions in speech, i. e., whether people are eating while speaking, and what they are eating. To this end, we introduce the audio-visual iHEARu-EAT database featuring 1.6 k utterances of 30 subjects (mean age: 26.1 years, standard deviation: 2.66 years, gender balanced, German speakers), six types of food (Apple, Nectarine, Banana, Haribo Smurfs, Biscuit, and Crisps), and read as well as spontaneous speech, which is made publicly available for research purposes. We start with demonstrating that for automatic speech recognition (ASR), it pays off to know whether speakers are eating or not. We also propose automatic classification both by brute-forcing of low-level acoustic features as well as higher-level features related to intelligibility, obtained from an Automatic Speech Recogniser. Prediction of the eating condition was performed with a Support Vector Machine (SVM) classifier employed in a leave-one-speaker-out evaluation framework. Results show that the binary prediction of eating condition (i. e., eating or not eating) can be easily solved independently of the speaking condition; the obtained average recalls are all above 90%. Low-level acoustic features provide the best performance on spontaneous speech, which reaches up to 62.3% average recall for multi-way classification of the eating condition, i. e., discriminating the six types of food, as well as not eating. The early fusion of features related to intelligibility with the brute-forced acoustic feature set improves the performance on read speech, reaching a 66.4% average recall for the multi-way classification task. Analysing features and classifier errors leads to a suitable ordinal scale for eating conditions, on which automatic regression can be performed with up to 56.2% determination coefficient

    Comparación de dos métodos basados en la intensidad para el cálculo automático de la velocidad de habla

    Get PDF
    Automatic computation of speech rate is a necessary task in a wide range of applications that require this prosodic feature, in which a manual transcription and time alignments are not available. Several tools have been developed to this end, but not enough research has been conducted yet to see to what extent they are scalable to other languages. In the present work, we take two off-the- shelf tools designed for automatic speech rate computation and already tested for Dutch and English (v1, which relies on intensity peaks preceded by an intensity dip to find syllable nuclei and v3, which relies on intensity peaks surrounded by dips) and we apply them to read and spontaneous Spanish speech. Then, we test which of them offers the best performance. The results obtained with precision and normalized mean squared error metrics showed that v3 performs better than v1. However, recall measurement shows a better performance of v1, which suggests that a more fine-grained analysis on sensitivity and specificity is needed to select the best option depending on the application we are dealing with.El cálculo automático de la velocidad de habla es una tarea fonética útil y que además se hace indispensable cuando no hay disponible una transcripción manual a partir de la cual determinar una tasa de habla manual. Se han desarrollado varias herramientas para este fin, pero todavía no se ha llevado a cabo suficiente investigación para ver hasta qué punto las herramientas son aplicables a lenguas distintas para las que fueron diseñadas. En este artículo probamos dos herramientas para el cálculo automático de la velocidad de habla ya evaluadas para el neerlandés y el inglés (v1, que se basa en la determinación de picos de intensidad precedidos de un valle para encontrar núcleos de sílaba, y v3, que se basa en picos de intensidad rodeados de valles) y las aplicamos a un corpus de habla leída y espontánea del español para analizar cuál ofrece mejores resultados en español. Los resultados de precisión y del error cuadrático mediano normalizado obtenidos muestran que v3 funciona mejor que v1. No obstante, el recall muestra mejor rendimiento para la v1, lo que nos indica que se necesita un análisis detallado de la sensibilidad y la especificidad para seleccionar la mejor opción en función de los objetivos del análisis posterior que se quiera hacer

    New Perspectives in Teaching Pronunciation

    Get PDF
    pp.165-18

    The Effect of Using Authentic Videos on English Major Students' Prosodic Competence

    Get PDF
    This study aims to investigate the effect of using authentic videos on the prosodic competence of foreign language learners. It is hypothesized worldwide that authentic videos have a positive effect on the EFL learners' supra segmental competence. The population of the study included 32 students majoring in English Language at Taibah University in KSA during the academic year 2011/2012. The sample consisted of two sections, a control group and an experimental one. A pretest was administered to both groups to ensure that they were homogeneous. The control group was taught supra segmental aspects of language using a traditional approach while the experimental group was taught authentic videos. About four months later, a posttest was administered. The results of the study showed that there was much progress in the experimental group which significantly outperformed the control group in the different aspects of prosody. These findings confirm the hypothesis which read videos can have a positive effect on the EFL learners' supra segmental competence.  Keywords :Supra segmental competence, authentic videos ,Saudi English major students as  EFL learners, Intonation, Pronunciation, Stress, Pause , Juncture , Rhyme ,  and Prosodic aspects of language

    Directions for the future of technology in pronunciation research and teaching

    Get PDF
    This paper reports on the role of technology in state-of-the-art pronunciation research and instruction, and makes concrete suggestions for future developments. The point of departure for this contribution is that the goal of second language (L2) pronunciation research and teaching should be enhanced comprehensibility and intelligibility as opposed to native-likeness. Three main areas are covered here. We begin with a presentation of advanced uses of pronunciation technology in research with a special focus on the expertise required to carry out even small-scale investigations. Next, we discuss the nature of data in pronunciation research, pointing to ways in which future work can build on advances in corpus research and crowdsourcing. Finally, we consider how these insights pave the way for researchers and developers working to create research-informed, computer-assisted pronunciation teaching resources. We conclude with predictions for future developments

    Automatic Pronunciation Assessment of Korean Spoken by L2 Learners Using Best Feature Set Selection

    Get PDF
    This paper proposes a method for automatic pronunciation assessment of Korean spoken by L2 learners by selecting the best feature set from a collection of the most well-known features in the literature. The L2 Korean Speech Corpus is used for assessment modeling, where the native languages of the L2 learners are English, Chinese, Japanese, Russian, and Mongolian. In our system, learners speech is forced-aligned and recognized using a native Korean acoustic model. Based on these results, various features for pronunciation assessment are computed, and divided into four categories such as RATE, SEGMENT, SILENCE, and GOP. Pronunciation scores produced by combining categories of features by multiple linear regression are used as a baseline. In order to enhance the baseline performance, relevant features are selected by using Principal Component Regression (PCR) and Best Subset Selection (BSS), respectively. The results show that the BSS model outperforms the baseline and the PCR model, and that features corresponding to speech segment and rate are selected as the relevant ones for automatic pronunciation assessment. The observed tendency of salient features will be useful for further improvement of automatic pronunciation assessment model for Korean language learners.OAIID:RECH_ACHV_DSTSH_NO:A201625650RECH_ACHV_FG:RR00200003ADJUST_YN:EMP_ID:A076305CITE_RATE:FILENAME:2016_09 (APSIPA 류혁수).pdfDEPT_NM:언어학과EMAIL:[email protected]_YN:FILEURL:https://srnd.snu.ac.kr/eXrepEIR/fws/file/9614f371-16ac-45af-add0-9434be5bacf0/linkCONFIRM:

    Assessment of non-native speech using vowel space characteristics

    Full text link
    In this paper, we propose the idea of using the characteris-tics of a speaker’s vowel space for automated assessment of second language (L2) proficiency. Specifically, we adpot fea-tures that were shown in previous studies to be good indica-tors of native speaker intelligibility and clarity and apply them to L2 speech from non-native speakers. The features focus on three peripheral vowels (IY, AA, and OW) and measure a speaker’s coverage of the vowel space. A pilot study and a large-scale corpus study involving read speech produced by native and non-native speakers were conducted in which the vowel space features were rank correlated with pronunciation scores provided by human listeners for the non-native speech and an assumed higher score for the native speech. The results of the studies show that several of the features achieve moder-ately high correlations with the pronunciation scores, support-ing their usefulness for automated assessment of non-native speech. The feature with the best performance in the large-scale study was the F2 − F1 distance for IY, which achieved a correlation of 0.78 with pronunciation proficiency scores. Index Terms: speech assessment, phonetics, vowel quality 1
    corecore