Transcribing lectures is a challenging task, both in acoustic and in language modeling. In this work, we present our first results on the automatic transcription of lectures from the TED corpus, recently released by ELRA and LDC. In particular, we concentrated our effort on language modeling. Baseline acoustic and language models were developed using respectively 8 hours of TED transcripts and various types of texts: conference proceedings, lecture transcripts, and conversational speech transcripts. Then, adaptation of the language model to single speakers was investigated by exploiting different kinds of information: automatic transcripts of the talk, the title of the talk, the abstract and, finally, the paper. In the last case, a 39.2% WER was achieved

Cettolo, M.

Federico, M.

Leeuwis, E.

English

Transcribing lectures is a challenging task, both in acoustic and in language modeling. In this work, we present our first results on the automatic transcription of lectures from the TED corpus, recently released by ELRA and LDC. In particular, we concentrated our effort on language modeling. Baseline acoustic and language models were developed using respectively 8 hours of TED transcripts and various types of texts: conference proceedings, lecture transcripts, and conversational speech transcripts. Then, adaptation of the language model to single speakers was investigated by exploiting different kinds of information: automatic transcripts of the talk, the title of the talk, the abstract and, finally, the paper. In the last case, a 39.2\% WER was achieve

Erwin Leeuwis

Federico, Marcello

Cettolo, Mauro

Archivio della ricerca - Fondazione Bruno Kessler

Language Modeling and Transcription of the TED Corpus Lectures

Leeuwis, Erwin

NARCIS 

Language modeling and transcription of the TED corpus lectures

University of Twente Research Information

LANGUAGE MODELING AND TRANSCRIPTION OF THE TED CORPUS LECTURES Erwin Leeuwisl, Murcello Federico’ and Mauro Cettolo2 ]University of Twente ’ITC-irs t Department of Computer Science P.O. Box 217,7500 AE Enschede, The Netherlands Centro per la Ricerca Scientifica e Tecnologica 1-38010 Povo di Trento - Italy transcribed 5 12 12 6 4 intest set 1 3 3 I 0 ABSTRACT Transcribing lectures is a challenging task, both in acoustic and in language modeling. In this work, we present our first results on the automatic transcription of lectures from the TED corpus, re- cently released by ELRA and LDC. In particular, we concentrated our effort on language modeling. Baseline acoustic and language models were developed using respectively 8 hours of TED tran- scripts and various types of texts: conference proceedings, lecture transcripts, and conversational speech transcripts. Then, adapta- tion of the language model to single speakers was investigated by exploiting different kinds of information: automatic transcripts of the talk, the title of the talk, the abstract and, finally, the paper. In the last case, a 39.2% WER was achieved. 7 32 2 6 1. INTRODUCTION Automatic lecture transcription is arising as an important task both for research and applications [ I ,  21. It is a challenge for speech recognition as, in contrast to broadcast news, lectures typically present a higher variability in terms of speaking style, linguis- tic domain, and speech fluency. From the application point of view, spoken document retrieval based on automatic transcripts has shown to he a promising mean for accessing content in au- diovisual digital libraries [ 3 ] .  Hence, envisaging digital reposito- ries of recorded speeches and lectures, which can he searched and browsed through the net, is quite natural now. A useful and publicly available resource for investigating auto- matic lecture transcription is given by the TED corpus, which was issued in 2002 by ELRA and LDC. Briefly, the Translanguage En- glish Database contains 188 recordings of talks in English at Eu- rospeech ‘93, a part of which has been manually transcribed. The lectures in TED present several kinds of problems to cope with. Speakers are often non-native, have a strong accent, and, sometimes, are not even fluent. Despite the spealang style being in general planned, spontaneous speech phenomena occur quite frequently. Recordings were made with a lapel microphone, hence the signal often contains some noise from the auditorium and from the speaker as well. Finally, relatively little supervised data is available for acoustic and language model training. For the sake of language modeling, the lack of transcripts is compensated by the availability of electronic texts of that conference. This work describes the development of a TED baseline system at ITC-irst. Acoustic models were estimated starting from an existing The 39 manually transcribed lectures were divided in a test set of 8 speakers (2 hours of speech) and a training set of 31 speakers (8 hours of speech). Test speakers were selected by taking into account the proportion of each native language group and gender (Table I).  The test set speakers are listed in Table 2. 3. BASELINE SYSTEM The ITC-irst transcription system (Fig. I )  features a Viterbi de- coder, context-dependent cross-word continuous-density HMMs, MLLR adaptation, and a trigram LM. The system has been applied to several large vocabulary tasks: Ital- ian broadcast news [4], American English broadcast news (HUB4) 0-7803-7663-3/03/$17.00 02003 IEEE I - 232 ICASSP 2003 speaker native language gender cj29s3 english male dcS7s2 italian male fd29sS french male hb64s4 french female ld29s2 danish female phS0s2 german male m 3  1 s4 dutch male yiS9s5 japanese male Table 2. Test set speaker identifier, mother tongue, and gender. BASIC TRANSCRIBER . . .... signa, k f . . ..... Fig. 1. Architecture of the ITC-irst transcription system and newspaper dictation (Wall Street Journal, WSJ) 3.1. Acoustic Model The acoustic model (AM) for TED was developed starting from a WSJ baseline, featuring 27K triphone units and 71k Gaussians trained on 66.Sh of speech. By using the standard 20k-word tri- gram LM, the WSI baseline scores a 12.9% WER on the 1993 DARPA evaluation test set. The WSJ AM was adapted on the TED training data (8 hours) through MLLR adaptation. In this step, spontaneous speech phenomena were mapped into a single filler model. 4. LM ESTIMATION AND ADAPTATION For LM estimation, three different types of data were used: Lect SSKw of lecture transcripts from the TED training data; Proc 15Mw of scientific papers from speech conferences and workshops (Eurospeech, ICASSP, ICSLP, etc.); Cunv 300Kw of transcripts of conversational speech (Vrrbmobil, HUBS). The Lect corpus has the most suitable data, but unfortunately is rather small. Therefore bigger corpora are also used that are less suitable, but have useful qualities: Proc does not have the required style, but has suitable content (speech research); Conv on the contrary, does not have suitable content, hut bas the required style (conversational). LMs estimated for the TED task make use of trigram statistics and are based on a recursive interpolation scheme and non-linear smoothing [SI. For the sake of LM estimation, three different LM adaptation methods have been investigated. Mixture Model (MIX). Given two or more interpolated language models, a mixture model can he derived which applies a convex combination at the level of discounted relative frequencies 151. The mixture model can be used to combine one or more general back- ground (BG) LMs with a foreground (FG) LM representing new features of the language we want to include. In this case, the mix- ture weights can be estimated on the foreground data by applying a cross-validation scheme that simulates the occurrence of new n- grams [SI. Minimum Discrimination Information (MDI). Assuming a small adaptation text sample, one may reasonably assume that only unigram statistics can be reliably estimated. These statistics can be used as constraints when estimating the adapted LM as the one minimizing the Kullback-Leibler distance from a background tri- gram model. Practically speaking, the adapted n-gram conditional probability is obtained by scaling and normalizing the background LM distribution. As shown in 161, an empirically estimated expo- nent (adaptation rate) can be applied to the scaling factor to im- prove the effect of adaptation. This adaptation rate has a value between 0 and 1, with 0 corresponding to no adaptation and 1 to full adaptation. Probabilistic Latent Semantic Analysis (PLSA). PLSA can be interpreted as the problem ofestimating a kemel of T unigram dis- tributions which better fits the word distribution of each document, in a collection z), through a suitable convex combination [6]. As- suming that D contains documents talking about different topics, the compression effect induced by the model should force seman- tically related words, e.g. words associated with a specific topic, to have meaningful probabilities concentrated in one or few basis dis- tributions. An appealing feature of PLSA is that a documenthopic word distribution can he estimated from a small amount of adap- tation data relatively easily. Combination of MDI with PLSA nat- urally follows given that the PLSA distribution estimated from the adaptation data can he used to constrain a higher-order background LM 161. In this way, statistically sound constraints about a trigram LM can be derived from very little data. 5. EXPERIMENTS 5.1. Baseline Development The baseline system for transcribing the TED lectures is that of Fig. I ,  with the AM developed as explained in Section 3. Inter- polated LMs estimated on corpora Lect. Proc and Cunv, de- scribed in Section 4, have been mixed in different combinations in order to explore the relationship between their characteristics and transcription performance. In Table 3, results in terms of perplexity (PP), out of vocabulary rate (OOV) and word error rate (WER) are reponed for different mixture models. In particular, for each mixture model, the fore- ground and background models are indicated. For the sake ofcom- parison, the first two rows show the performance of the recognizer developed for the WSJ task, and of the recognizer using the TED AM and the WSJ LM. Since in terms of PP and OOV rate its results are the best, and its recognition accuracy is not worse than the best one in a statistically significant way, the LM of the last row was selected as baseline LM. Intuitively, we assume that it adapts the style of Con" and the content of Proc to suit Lect, which is the most proper data for I - 233 AM LM PP OOV WER FC BCI BC2 ("/.I (%I WSI WSJ - - 1240 5.33 93.2 245 .2 225 g 220 d 215 210 205 TED WSJ - - 1240 5.33 59.7 TED Lect - - 634 8.07 56.3 TED Proc - - 288 1.51 46.3 TED Proc Conv - 239 0.93 45.1 TED Proc Lect - 218 0.55 45.2 TED Lect Proc ~ 202 0.55 43.9 TED Lect Proc Conv 197 0.53 44.0 Table 3. Baseline recognizer performance by using various LMs. this task. The baseline LM has a dictionary of 36Kw; the 44.0% WER was achieved using the basic transcriber with a real time ratio of 65 on a Pentium Ill 933 MHz processor. Fig. 2. PP as function of the LM estimation carpus size In Fig. 2. IherelationshipofthePPofthe baselineLM with thesize ofthe Proc corpus is plotted. It shows that increasing the amount of proceedings used, decreases perplexity significantly. Thus, we expect PP to go further down when more proceedings will be used. 5.2. Unsupervised LM adaptation A first set of experiments aimed at improving the baseline perfor- mance by adapting the LM on each single test lecture. In particu- lar, unsupervised LM adaptation was carried out on the automatic transcnpts output by the baseline 171. Actually, also AM adapta- tion was performed again, which leads to the adaptation scheme depicted in Fig. 3. MIX adaptation was applied by extending the baseline mixture with a new component estimated on the automatic transcript. For estimating the mixture weights, thc new component was taken as foreground model. MDI adaptation was performed in the same way by only extracting unigram statistics from the transcript. In order to smooth the effect of recognition errors, words in the transcripts with frequency below 2 were mapped into the out-oi- vocabulary word class [SI. The best performance was achieved with an adaptation rate of 0.7. PLSA adaptation was based on a set of 100 kernel distributions Fig. 3. Unsupervised LM adaptation experiments scheme estimated on the Proc corpus, which includes over 6,000 docu- ments. As adaptation data the 10 most frequent nun-stop words in the transcript were used. The unigram mixture estimated from the kcrnel distributions and the adaptation data was then used for MDI adaptation. This lime the optimal adaptation rate was 0.2. In order to reduce the bias of perplexity mensures after unsuper- vised adaptation, perplexity computation of MIX and MDI was not performed on the whole transcript, hut Using a leaving-one-out scheme. The transcript was split at sentence level: iteratively, a sentence was left out of the adaptation data and that sentence was used to compute perplexity on. Finally, the resulting perplexities were combined. Results of the experiments are reponed in Table 4. Base MIX MDI PLSA PP 197 157 170 190 WER 44.0 44.3 43.9 43.8 Table 4. Unsupervised LM adaptation per speaker. Even though the leaving-one-out strategy should reduce the bias, there is a decrease in PP for MIX and MDI that is not reflected in the WER. Perhaps the PPs are still biased on sentence level, but probably the discrepancy is due to the significantly higher prob- ability assigned to recognized n-grams. From the WER point of view. performancc does not change substantially, as the LM is sug- gesting the same n-grams the recognizer produced in the previous step. Hence, a reduction of the bias could be achieved by filtering out less frequent words from the transcript or by using only unigram statistics, as is done by the MDI and PLSA adaptation methods. In general. we expect that the availability of more transcribed ma- terial or, alternatively, of multiple quite independently produced transcripts of the same data should help to reduce the bias. 5.3. Supervised 1.M adaptation Supervised LM adaptation was periotmed using instead the pre- sented paper or parts of it to adapt the baseline LM. In order to assume an increasing amount of supervision, adaptation was per- formed just on the title (PLSA), on the abstract (PLSA), or on the full paper (PLSA. MDI, MIX). PLSA adaptation was applied by using the same kernel distributions estimated for the unsupervised adaptation experiments. MIX adaptation extended the baseline 1 - 234 components with an additional LM estimated on the adaptation data and used as foreground model. Results for each approach are given in Table 5 .  As expected, performance became better when the amount of supervision in- creased. Very imaginal improvement is achieved with PLSA adaptation, probably due to thc fact that papers in the collection are not easily decomposed into very distinct topics. PI .SA ._l. Base Mix MDI Paper Abstract Title PP 197 133 166 188 190 193 WER 44.0 39.2 42.3 43.8 43.9 44.2 Table S. Supervised LM adaptation. The other two methods instead gave reasonable improvements in terms of PP and WER. Fig. 4 and 5 show the PP and WER re- spectively for each speaker using the baseline LM and using both MIX and MDI supcrvised adaptation. For each spcdker and each method both PP and WER decrcase significantly. There is a strong correlation between the difference in PP and in WER. Speakers CJ and Y1 show bigger improvements with mixture adaptation than the other speakers, since they held lectures in a style similar to their papers. , y ,  50 FD CJ YI DC HB LD PH RO Speakers Fig. 4. PP after supervised adaptation pcr speaker 6. CONCLUSION Lecture transcription is a difficult task, both from an acoustic and a linguistic point of view. Nan-native speech, background noise, different and varying speaking rates and many spontaneous speech phenomena, are all characteristics of lecture speech that make acoustic modeling difficult. Language modeling is hampered due to the sparseness of suitable data and the mixed style of lecture spoken language, combining colloquial expressions with formal jargon. In  this work, we concentrated our effon on language modeling. A baseline LM was estimated using various typcs of data, which were all flawed, but used in such a way that their qualities were highlighted and not their deficiencies. Using the ITC-irst WSJ AM 70t  6o t 5 50 3 40 - - 30 ~ I FD CJ YI DC HB LD PH RO SDeakers Fig. 5.  WER after supervised adaptation per speaker. adapted on 8h of TED training data, it resulted in a WER of44.0%. Unsuperviscd LM adaptation did not show mentionable improve- ments in WER, but the decreases in perplexity indicate that future research could prow beneficial. Significant improvements were obtained by adapting the baseline LM on the papers of the speak- ers: 39.2% WER. That represents a good starting point for further research developments. Future work will he devoted to inVesti- gate acoustic and lexical mudeling for non-native speech, and un- super\,ised adaptationitraining methods for acoustic and language modeling, for which there are 38 hours of untranscribed speech available in the TED corpus. 7. REFERENCES [I] M. Novak and R. Mammone, "Use of non-negative matrix factorization for language model adaptation in a lecture tran- scription task," in Proc. ICASSP, Salt Lake City, UT, USA, 2001 "Speaking-rate dependent de- coding and adaptation for spontaneous lecture speech recog- nition:' in Proc. ICASSP, Orlando, FL, USA, 2002. [31 F. Kubala, S. Cnlbath, D. Liu, A. Srivastava, and 1. Makhoul, "Integrated tcchnologies for indexing spoken language,'' Com- munications of the ACM. vol. 43, no. 2, pp. 48-56, 2000. [41 N. Bertoldi, F. Brugnara, M. Cettolo. M. Federico, and D. Giu- liani, "From broadcast news to spontaneous dialogue tran- scription: Portability issues," in Proc. ICASSP. Salt Lake City, UT, 200 I .  15) M. Federico and N. Benoldi, "Broadcast news LM adapta- tion using contemporary texts," in Proc. Europeech, Aalborg, Denmark, 2001. (61 M. Federico, "Language model adaptation through topic de- [2] H. Nanjo and T. Kawahara, composition and MDl estimation:' in Pruc. ICASfP, Orlando, FL, USA, 2002. 171 D. Giuliani and M. Federico, "Unsupervised language and acoustic model adaptation for cross domain portability," in Proc. ISCA Workxhop on Adapration Methods for Speech Recognirion, Sophia-Antipolis, France, 2001 I - 235 

Broadcast news LM adaptation using contemporary texts,&quot; in

Integrated tcchnologies for indexing spoken language,''

Language model adaptation through topic de[2] H. Nanjo

Unsupervised language and acoustic model adaptation for cross domain portability,&quot; in

Use of non-negative matrix factorization for language model adaptation in a lecture transcription task,&quot; in

https://ris.utwente.nl/ws/files/6198266/01198760.pdf

Language modeling and transcription of the TED corpus lectures

Abstract

Similar works

Full text

Available Versions

Archivio della ricerca - Fondazione Bruno Kessler

NARCIS

University of Twente Research Information