This paper presents results of a joint project between an engineering team of a university and an educational team of another to develop an online fluency assessment system for Japanese learners of English. A picture description corpus of English spoken by 90 learners and 10 native speakers was used, where fluency was rated by other 10 native raters for each speaker manually. The assessment system was built to predict the averaged manual scores. For system development, a special focus was put on two separate purposes. The assessment system was trained in such an analytical way that teachers can know and discuss which speech features contribute more to fluency prediction, and in such a technical way that teachers' knowledge can be involved for training the system, which can be further optimized using an interpretable network. Experiments showed that quality-of-pronunciation features are much more helpful than quantity-of-phonation features, and the optimized system reached an extremely high correlation of 0.956 with the averaged manual scores, which is higher than the maximum of inter-rater correlations (0.910)

Minematsu, N

Saito, D

Saito, K

Shen, Y

Yasukagawa, A

English

UCL Discovery

Improved Prediction of Perceived Fluency of Japanese Englishusing Quantity of Phonation and Quality of Pronunciation ∗☆Yang SHEN, Ayano YASUKAGAWA, Daisuke SAITO, Nobuaki MINEMATSU (UTokyo),Kazuya SAITO (UCL)1 IntroductionTo support people to learn a new language,various types of technical aids have been ex-amined [1, 2] and realized as commercial prod-ucts or services [3, 4]. This paper presents re-search results of a joint project between UTokyoand UCL, where teachers asked engineers toautomatize their fluency scoring strategy. Inthis study, prediction is conducted with ElasticNet regression with speech features. Experi-mental results demonstrate that posteriorgramwith multiple granularities is effective for pre-diction and a high correlation of 0.925 is ob-tained between machine scores and the scoresof perceived fluency averaged over 10 nativeraters. This value is higher than the averageof inter-rater correlations of 0.873.2 Related works2.1 Picture description corpus with flu-ency rating [5]90 native Japanese studentsas well as 10 na-tive speakers participated in data collection.The task is picture description, where threeindependent photos were presented with threekeywords per photo to the participants, asshown in Figure 1. They were asked to describethe pictures orally using the keywords. Theirutterances were recorded with 16 bits and 44.1kHz as sampling frequency.10 native raters, who did not participate indata collection, were recruited for manual flu-ency assessment of the 100 utterances. They arenative speakers, but not teachers or researchersof language education. The score varied from1 (=least fluent) to 9 (=extremely fluent). Be-fore rating, the definition of fluency in [5] wasexplained to the raters, who showed a high con-sensus on that definition.Each rater assigned 100 scores to the 100 ut-terances. Correlations are calculated betweenevery pair of the raters. The minimum, aver-age, and maximum of one-to-one correlationsare 0.677, 0.786, and 0.897. Correlations arealso calculated for each rater to the averagedscores of the other nine raters. The minimum,average, and maximum of one-to-others corre-lations were 0.798, 0.873, and 0.910. Theseare used as reference when assessing the per-formance of automatic prediction of fluency.2.2 Manual extraction of features re-lated to fluency [5]The following features were manually ex-tracted with Praat [6], 1) the number of break-downs, (un)filled pauses, per unit time, 2)∗発声の量と発音の質に関する特徴を用いた日本人英語の主観流暢さ予測の改善瀋陽, 安ヵ川彩乃, 齋藤大輔, 峯松信明（東大）, 斉藤一弥（UCL）Fig. 1 Three photos used for data collectionspeaking rate, the number of syllables per unittime, 3) the number of repairs per unit time.The number of breakdowns were counted sepa-rately for two cases, within and between clauses.Repairs can also be divided into repetitions andself-corrections. The five features were expectedto affect raters’ judgements through an exten-sive review of the related literature.We regard the above features as related toquantity of phonation, per unit time and ap-plied Elastic Net regression to predict the flu-ency scores, averaged over the 10 raters. 5-fold cross-validation showed that the predictedscores had a correlation of 0.788 to the humanscores, which is comparable to the average ofone-to-one correlations. This value can be usedas reference when assessing the performance ofautomatic prediction of fluency.2.3 Clustering of phonemic classes usingposteriors [7]Besides features related to quantity of phona-tion, those related to quality of pronunciationare also examined. For this end, all the ut-terances are converted to posteriorgrams. Pos-teriorgrams generally use a set of phonemeclasses, the number of which is several thou-sands. They can be viewed as finely-definedcontext-dependent phonemes, but they may betoo fine to be used for assessment. We reducethe number of classes using bottom-up cluster-ing with Ward’s method [8] , which requires thedistance matrix between any two classes. TheBhattacharyya distance between two classes aand b is re-written using class posterior throughBayes’ theorem [7] asBD(a, b)=− ln∫ √p(x|a)p(x|b)dx=− ln∫ √p(a|x)p(x)p(a)p(b|x)p(x)p(b)dx=− ln∫p(x)√p(a|x)p(b|x)dx+ 12ln p(a) +12ln p(b).p(x) is a prior probability for x, which canbe calculated using the universal backgroundmodel. p(a|x) and p(b|x) are class posteriors,which are outputs from DNN-based acousticmodels to input vector x. p(a) and p(b) areprior probabilities for the two classes, whichcan be obtained as normalized frequency fromthe training corpus. Once DNN models aretrained, any speech sample can be converted toits posteriorgram, which is a sequence of vectorscomprised of probabilities of phoneme classes.With the above formulation, a given posterior-gram can be reduced into a smaller dimensionof classes. In the current study, the baselinenumber of classes is 2,000 and n-class posterior-grams can be calculated for any n (2≤n≤2,000).2.4 Phonotactic modeling of languages[9]A classical approach of language identifica-tion is applied to quantify native-likeness. Inthe classical approach, a continuous phonemerecognizer of a specific language, e.g. English,was applied to a given utterance of any lan-guage. Then, the utterance was represented ina forced way as a sequence of English phonemes.Languages of interest were modeled separatelyas phoneme N -gram using the forced Englishphonemes. If we consider a special case of N=1,the model becomes phoneme distribution. Af-ter converting the 30-sec long utterance of eachparticipant into its posteriorgram, we can cal-culate the averaged posterior probability of then classes (2≤n≤2,000), which directly corre-sponds to distribution of the n classes.2.5 Elastic Net regression [10]In this study, for feature selection and for pre-diction, Elastic Net regression is used for a spe-cific purpose. Mathematically speaking, ElasticNet regression is a combination of Ridge regres-sion [11] and Lasso regression [12], i.e. a com-bined use of L1 norm and L2 norm as regular-ization terms for weights. Value normalizationis also done for each feature. Because of these,weight coefficients attached to features of lesspredictability become zero and this is why thefunction of Elastic Net is said to be predictionbased on feature selection.3 Speech features extracted for pre-dictionWe introduce three types of features for au-tomatic prediction, 1) those derived only fromspeech acoustics with signal processing tech-niques, 2) those derived from posteriorgramsof utterances, and 3) those derived from ASRresults of the utterances. They are related toquantity of phonation and/or quality of pro-nunciation. Since the utterances in the corpus[5] are with unignorable noises, two versions ofWSJ-KALDI-based English speech recognizers[13] were trained, one with the WSJ corpus onlyand the other with WSJ and its noisy versions,where three levels of noises (SNR=10, 30, 40[dB]) were added and all the clean and noisyutterances were used together to train a noise-robust speech recognizer. Results will be shownseparately for the baseline recognizer and thenoise-robust recognizer.Even with the noise-robust recognizer, therecognition accuracy was very low and we de-cided not to use recognized words as they were.However, we extracted some statistics from therecognized words, which were tested for predic-tion.3.1 Features derived with signal pro-cessingFollowing a previous study [14], envelope-based syllable detection was used, which is pro-vided as Praat script [6]. Then, speaking ratewas calculated asspeaking rate =#syllablestotal duration of phonationThe denominator is defined as the utterancelength minus its entire duration of pauses.Speaking rate dose not tell anything on howmany silent frames are found in the utterance.We introduced a similar but different feature ofphonation ratio [15] asphonation ratio =total duration of phonationutterance length3.2 Features derived from posterior-gramsFrom the posteriorgram of each utterance, af-ter pause removal, the following three types offeatures are calculated automatically.3.2.1 Average of maximum posteriorprobabilities [15]Here, from a given posteriorgram, we de-tect the maximum posterior probability for eachtime and it is averaged over time. The higherthe average is, the more distinct pronunciationthe utterance is made with. The number ofphoneme classes n can vary from 2 to 2,000.3.2.2 Averaged posterior distribution asfine phoneme distributionAs discussed in Section 2.4, the averaged pos-terior vector can be viewed as the distribution ofphonemes. Since the utterance from each par-ticipant is so long as 30 sec, a variety enough ofphonemes are supposed to exist and the aver-aged posterior vector can characterize native-likeness of each participant. The number ofphoneme classes n can vary from 2 to 2,000, andthe average posterior vector is directly used forprediction.3.2.3 Posterior gap between a partici-pant and native speakersFor each participant, we calculate the aver-aged posterior vector. Since we have 10 nativespeakers in the participants, we calculate dis-tance from a participant to each native speaker,10 gaps in total. The Bhattacharyya distancenativespeakers: averaged posterior: posterior gaplearner1learner2learner3learner4Fig. 2 Averaged posterior and posterior gapTable 1 Prediction with quantity featuresASR DNN 1) 2) 3) 4) corr.w/o — 0.768 1.333 — — 0.819with clean 0.744 1.245 0.000 0.182 0.821with noise 0.765 1.281 0.000 0.097 0.817is used again as metric with variable dimensionn. These gaps quantify native-likeness of eachparticipant more directly and the averaged gapis used for prediction of fluency. Figure 2 visu-alizes the averaged posterior and the posteriorgap. The former characterizes quality of pro-nunciation, location in the feature space, andthe latter characterizes relative distances to the10 native speakers.3.3 Features derived from ASR resultsWe tested two versions of WSJ-KALDI-basedspeech recognizers, i.e. clean model and noise-robust model on all the 100 utterances. The re-sults showed that 29.5 % and 32.1 % as correctrecognition rates, respectively for the two mod-els. Since these rates are very low, we did notuse any features that characterize lexical iden-tity of the recognized results. However, somestatistics are supposed to be calculated ratheradequately and they are used as feature for flu-ency prediction.3.3.1 Correct recognition rateFor this study, the fifth author provided cor-rect transcripts of all the 100 utterances, andwith them, we can calculate the correct recog-nition rate for each participant. Although tran-scripts of spontaneous utterances are gener-ally unavailable, we tentatively use the correctrecognition rate as feature for prediction. Theprediction performance with transcripts is justfor reference, which may be used as upper limitof prediction.3.3.2 Total number of words in ASR re-sultsPhonation ratio characterizes how continu-ously a participant speaks, and that acousti-cally. It is possible to derive a similar but dif-ferent feature lexically from recognition results.The recognition performance is surely low, butthe total number of words in the recognitionresults may be effective for prediction.3.3.3 Size of vocabulary in ASR resultsIt is easily expected that poor participantsmay utter the same words repeatedly. This ex-Fig. 3 Correlation as function of posterior dimTable 2 Prediction with quality featuresASR DNN a) b) c) d) corr.w/o clean 0.124 0.365 -0.624 — 0.903w/o noise 0.233 0.272 -0.753 — 0.917with clean 0.000 0.254 -0.491 0.628 0.922with noise 0.045 0.214 -0.549 0.537 0.921pectation led us to use the number of differentwords, size of vocabulary, for prediction [16].4 Automatic prediction of fluency4.1 Prediction with quantity featuresIn [5], features related to smoothness orquantity of phonation were manually extracted.Among the automatically extracted features,we regard 1) speaking rate, 2) phonation ra-tio, 3) total number of words, and 4) size ofvocabulary as features related to quantity ofphonation. Table 1 describes results of ElasticNet regression with these features for fluencyprediction, where correlations between the av-eraged fluency scores over the 10 native ratersand the machine scores are calculated basedon 5-fold cross-validation. In the table, cleanand noise mean the two types of ASR models,and the three values assigned to each kind offeature is the weight coefficients calculated forthat feature. Clearly shown, phonation ratioand speaking rate are very effective for predic-tion. The performance is higher than the av-erage of one-to-one inter-rater correlations butmuch lower than the average of one-to-otherscorrelations.4.2 Prediction with quality featuresThe other features, a) average of maximumposteriors, b) averaged distribution of posteri-ors, and c) posterior gap to natives, are testedwith Elastic Net regression. d) correct recog-nition rate is also tentatively considered. Fig-ure 3 shows correlations as a function of the di-mension n of posterior probabilities calculatedwith noisy DNN model. For a) and c), fea-ture correlations are plotted while, for b), modelcorrelations (prediction correlations) are shownwith Elastic Net regression. Correlations withb) and c) are maximized around n=50, whilethose with a) seem to be higher with larger n,but still lower than those with b) and c). Fromthese results, we select 50 as n and use it fortesting all the quality features.Table 3 Prediction with all the featuresASR DNN c) 1) a) 2) d) corr.w/o clean -0.589 0.224 0.112 0.364 — 0.906w/o noise -0.748 0.264 0.255 0.231 — 0.925with clean -0.580 0.194 0.131 0.334 — 0.906with noise -0.715 0.242 0.233 0.200 — 0.923with clean -0.476 0.192 0.000 0.311 0.602 0.923with noise -0.543 0.276 0.033 0.239 0.561 0.928Table 2 describes results of Elastic Net re-gression with the quality features for fluencyprediction. As b) is a multivariate feature, itsweight means the largest weight among the ndimensions. Clearly shown, c) and b) are veryeffective for prediction. It is very surprisingto us that the correlation with the quality fea-tures only even without ASR overcomes the av-erage of one-to-others correlations (0.873), andis comparable to the maximum (0.910). Thisclaims that the trained model is comparable tothe most stable and reliable human rater.4.3 Prediction with all the featuresTable 3 describes results of Elastic Net regres-sion with all the features. Only the top four fea-tures in the case of noisy DNN but without ASRare shown, also in other cases with or withoutd) correct recognition rates. In the table, thetop four features are c) averaged posterior gapto natives, 1) speaking rate, a) average of max-imum posteriors, and 2) phonation ratio. i.e.two quality features and two quantity features.In the table, very high usability of the qualityfeatures is shown again and, even without ASR,the trained model gives a higher correlation of0.925 than the maximum of one-to-others cor-relations (0.910).4.4 Discussion and future directionsIn this paper, we tried to predict subjectivescores of fluency. What we found is that thefluency scores can be much more highly pre-dicted with quality features than with quantityfeatures. This result implies that 1) judgmentsof the 10 native raters were rather biased to thequality of pronunciation, which is logically in-dependent of smoothness and fluidity in utter-ances, or 2) quantity features and quality fea-tures are highly correlated and the latter wereextracted with higher accuracy. We’re inter-ested in another kind of fluency scores, givenby expert raters. With expert rating, we mayobtain some different results.5 ConclusionsThis paper presented research results of ajoint project between UTokyo and UCL, whereautomated scoring of fluency was investigated.Since the L2 corpus prepared for develop-ment was not large, we tested classical ma-chine learning techniques with recently pro-posed speech representations such as posteri-orgram with variable granularity. Experimentsshowed a correlation of 0.925 to the perceivedfluency, which was higher than the maximuminter-rater (one-to-others) correlation (0.910).Reference[1] M. Eskenazi, “An overview of spoken languagetechnology for education,” Speech Communica-tion, vol. 51, no. 10, pp. 832–844, 2009.[2] T. Kawahara and N. Minematsu, “Computer-Assisted Language Learning (CALL) based onspeech technologies,” IEICE Trans. Info. Sys., vol.J96-D, no. 7, pp. 1549–1565, 2013.[3] L. Chen, L. Davis, K. Zechner, C. M. Lee, S.-Y.Yoon, M. Ma, K. Evenini, R. Mundkowsky, X.Wang, C. Lu, A. Loukina, C. W. Leong, J. Tao,and B. Gyawali, “Automated scoring of nonnativespeech using the SpeechRater v.5.0 engine,” ETSResearch Report Series, vol. RR-18, no. 10, pp.1–31, 2018.[4] T. Isaacs, “Fully automated speaking assessments:changes to proficiency testing and the role of pro-nunciation,” in The Routledge handbook of con-temporary English pronunciation, O. Kang, R. I.Thomson, and J. Murphy, Eds. Routledge, 2018,pp. 570–584.[5] K. Saito, M. Ilkan, V. Magne, M. N. Tran, and S.Suzuki, “Acoustic characteristics and learner pro-files of low-, mid-and high-level second languagefluency,” Applied psycholinguistics, vol. 39, no. 3,pp. 593–617, 2018.[6] P. Boersma and D. Weenink, Praat:doingphonetics by computer (Version 6.1.03)[com-puter software], 2019. [Online]. Available:http://www.praat.org[7] Y. Kashiwagi, C. Zhang, D. Saito, and N. Mine-matsu, “Divergence estimation based on deep neu-ral networks and its use for language identifica-tion,” in Proc. ICASSP, 2016, pp. 5435–5439.[8] J. H. Ward,“Hierarchical grouping to optimize anobjective function,” Journal of the American Sta-tistical Association, vol. 58, pp. 236–244, 1963.[9] P. Mateika, P. Schwarz, J.Cernocky, and P. Chytil,“Phonotactic language identification using highquality phoneme recognition,” in Proc. INTER-SPEECH, 2005, pp. 2237–2240.[10] H. Zou and T. Hastie, “Regularization and variableselection via the Elastic Net,” Journal of the RoyalStatistical Society, Series B, vol. 67, pp. 301–320,2005.[11] A. E. Hoerl and R. W. Kennard, “Ridge regres-sion: Biased estimation for nonorthogonal prob-lems,” Journal of Technometrics, vol. 12, no. 1,pp. 55–67, 1970.[12] R. Tibshirani, “Regression shrinkage and selectionvia the lasso,” Journal of the Royal Statistical So-ciety: Series B (Methodological), vol. 58, no. 1, pp.267–288, 1996.[13] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O.Glembek, N. Goel, M. Hannemann, P. Motlicek, Y.Qian, P. Schwarz et al., “The Kaldi speech recog-nition toolkit,” in Proc. ASRU, 2011.[14] L. Fontan, M. L. Coz, and S. Detey, “Automati-cally measuring L2 speech fluency without the needof ASR: a proof-of-concept study with Japaneselearners of French,” in Proc. INTERSPEECH,2018, pp. 2544–2548.[15] A. Yasukagawa, S. Ando, E, Konno, Z. Lin, Y. In-oue, D. Saito, N. Minematsu, and K. Saito, “An ex-perimental study of automatic scoring of fluency ofspontaneous English utterances by Japanese learn-ers,” Proc. Spring Meeting of Acoustical Society ofJapan, 2020.[16] H. Hilton, “The link between vocabulary knowl-edge and spoken L2 fluency,” Language LearningJournal, vol. 36, no. 2, pp. 153– 166, 2008.

Optimized Prediction of Fluency of L2 English Based on Interpretable Network Using Quantity of Phonation and Quality of Pronunciation

https://discovery.ucl.ac.uk/id/eprint/10126742/1/%E6%96%B0%E3%81%97%E3%81%84%E5%AD%A6%E4%BC%9AAbstract%20July%202020.pdf

Optimized Prediction of Fluency of L2 English Based on Interpretable Network Using Quantity of Phonation and Quality of Pronunciation

Abstract

Similar works

Full text

Available Versions

UCL Discovery