Cross-language retrieval of spontaneous speech combines the challenges of working with noisy automated transcription and language translation. The CLEF 2005 Cross-Language Speech Retrieval (CL-SR) task provides a standard test collection to investigate these challenges. We show that we can improve retrieval performance: by careful selection of the term weighting scheme; by decomposing automated transcripts into

phonetic substrings to help ameliorate transcription

errors; and by combining automatic transcriptions with manually-assigned metadata. We further show that topic translation with online machine translation resources

yields effective CL-SR

Alzghool, Muath

Inkpen, Diana

Jones, Gareth J.F.

Oard, Douglas W.

English

Crossref

Investigating cross-language speech retrieval for a spontaneous conversational speech collection

Cross-language retrieval of spontaneous speech combines the challenges of working with noisy automated transcription and language translation. The CLEF 2005 Cross-Language Speech Retrieval (CL-SR) task provides a standard test collection to investigate these challenges. We show that we can improve retrieval performance: by careful selection of the term weighting scheme; by decomposing automated transcripts into
phonetic substrings to help ameliorate transcription
errors; and by combining automatic transcriptions with manually-assigned metadata. We further show that topic translation with online machine translation resources
yields effective CL-SR

Irish Universities

Cross-language retrieval of spontaneous speech combines the challenges of working with noisy automated transcription and language translation. The CLEF 2005 Cross-Language Speech Retrieval (CL-SR) task provides a standard test collection to investigate these challenges. We show that we can improve retrieval performance: by careful selection of the term weighting scheme; by decomposing automated transcripts into\ud
phonetic substrings to help ameliorate transcription\ud
errors; and by combining automatic transcriptions with manually-assigned metadata. We further show that topic translation with online machine translation resources\ud
yields effective CL-SR

Name not available

Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL, pages 61–64,New York, June 2006. c©2006 Association for Computational LinguisticsInvestigating Cross-Language Speech Retrieval for a  Spontaneous Conversational Speech Collection   Diana Inkpen, Muath Alzghool Gareth J.F. Jones Douglas W. Oard School of Info. Technology and Eng. School of Computing College of Info. Studies/UMIACS  University of Ottawa Dublin City University University of Maryland Ottawa, Ontario, Canada, K1N 6N5 Dublin 9, Ireland College Park, MD 20742, USA {diana,alzghool}@site.uottawa.ca Gareth.Jones@computing.dcu.ie oard@umd.edu    Abstract Cross-language retrieval of spontaneous speech combines the challenges of working with noisy automated transcription and lan-guage translation. The CLEF 2005 Cross-Language Speech Retrieval (CL-SR) task provides a standard test collection to inves-tigate these challenges. We show that we can improve retrieval performance: by care-ful selection of the term weighting scheme; by decomposing automated transcripts into phonetic substrings to help ameliorate tran-scription errors; and by combining auto-matic transcriptions with manually-assigned metadata. We further show that topic trans-lation with online machine translation re-sources yields effective CL-SR. 1 Introduction The emergence of large collections of digitized spoken data has encouraged research in speech re-trieval. Previous studies, notably those at TREC (Garafolo et al, 2000), have focused mainly on well-structured news documents. In this paper we report on work carried out for the Cross-Language Evaluation Forum (CLEF) 2005 Cross-Language Speech Retrieval (CL-SR) track (White et al, 2005). The document collection for the CL-SR task is a part of the oral testimonies collected by the USC Shoah Foundation Institute for Visual History and Education (VHI) for which some Automatic Speech Recognition (ASR) transcriptions are available (Oard et al., 2004). The data is conversional spon-taneous speech lacking clear topic boundaries; it is thus a more challenging speech retrieval task than those explored previously. The CLEF data is also annotated with a range of automatic and manually generated sets of metadata. While the complete VHI dataset contains interviews in many languages, the CLEF 2005 CL-SR task focuses on English speech. Cross-language searching is evaluated by making the topic statements (from which queries are auto-matically formed) available in several languages. This task raises many interesting research ques-tions; in this paper we explore alternative term weighting methods and content indexing strategies.  The remainder of this paper is structured as fol-lows: Section 2 briefly reviews details of the CLEF 2005 CL-SR task; Section 3 describes the system we used to investigate this task; Section 4 reports our experimental results; and Section 5 gives con-clusions and details for our ongoing work.   2 Task description The CLEF-2005 CL-SR collection includes 8,104 manually-determined topically-coherent segments from 272 interviews with Holocaust survivors, wit-nesses and rescuers, totaling 589 hours of speech. Two ASR transcripts are available for this data, in this work we use transcripts provided by IBM Re-search in 2004 for which a mean word error rate of 38% was computed on held out data. Additional, metadata fields for each segment include: two sets of 20 automatically assigned thesaurus terms from different kNN classifiers (AK1 and AK2), an aver-age of 5 manually-assigned thesaurus terms (MK), and a 3-sentence summary written by a subject mat-ter expert. A set of 38 training topics and 25 test topics were generated in English from actual user requests. Topics were structured as Title, Descrip-tion and Narrative fields, which correspond roughly to a 2-3 word Web query, what someone might first say to a librarian, and what that librarian might ul-timately understand after a brief reference inter-view. To support CL-SR experiments the topics were re-expressed in Czech, German, French, and Spanish by native speakers in a manner reflecting 61the way questions would be posed in those lan-guages. Relevance judgments were manually gener-ated using by augmenting an interactive search-guided procedure and purposive sampling designed to identify additional relevant segments. See (Oard et al, 2004) and (White et al, 2005) for details.  3 System Overview Our Information Retrieval (IR) system was built with off-the-shelf components.  Topics were trans-lated from French, Spanish, and German into Eng-lish using seven free online machine translation (MT) tools. Their output was merged in order to allow for variety in lexical choices. All the transla-tions of a topic Title field were combined in a merged Title field of the translated topics; the same procedure was adopted for the Description and Nar-rative fields. Czech language topics were translated using InterTrans, the only web-based MT system available to us for this language pair. Retrieval was carried out using the SMART IR system (Buckley et al, 1993) applying its standard stop word list and stemming algorithm.  In system development using the training topics we tested SMART with many different term weighting schemes combining collection frequency, document frequency and length normalization for the indexed collection and topics (Salton and Buckley, 1988). In this paper we employ the notation used in SMART to describe the combined schemes: xxx.xxx. The first three characters refer to the weighting scheme used to index the document collection and the last three characters refer to the weighting scheme used to index the topic fields. For example, lpc.atc means that lpc was used for documents and atc for queries. lpc would apply log term frequency weighting (l) and probabilistic collection frequency weighting (p) with cosine normalization to the document collec-tion (c). atc would apply augmented normalized term frequency (a), inverse document frequency weight (t) with cosine normalization (c). One scheme in particular (mpc.ntn) proved to have much better performance than other combina-tions. For weighting document terms we used term frequency normalized by the maximum value (m) and probabilistic collection frequency weighting (p) with cosine normalization (c). For topics we used non-normalized term frequency (n) and inverse document frequency weighting (t) without vector normalization (n). This combination worked very well when all the fields of the query were used; it also worked well with Title plus Description, but slightly less well with the Title field alone. 4 Experimental Investigation In this section we report results from our experi-mental investigation of the CLEF 2005 CL-SR task. For each set of experiments we report Mean unin-terpolated Average Precision (MAP) computed us-ing the trec_eval script. The topic fields used are indicated as: T for title only, TD for title + descrip-tion, TDN for title + description + narrative. The first experiment shows results for different term weighting schemes; we then give cross-language retrieval results. For both sets of experiments, “documents” are represented by combining the ASR transcription with the AK1 and AK2 fields. Thus each document representation is generated completely automatically. Later experiments ex-plore two alternative indexing strategies. 4.1 Comparison of Term Weighting Schemes The CLEF 2005 CL-SR collection is quite small by IR standards, and it is well known that collection size matters when selecting term weighting schemes (Salton and Buckley, 1988).  Moreover, the docu-ments in this case are relatively short, averaging about 500 words (about 4 minutes of speech), and that factor may affect the optimal choice of weight-ing schemes as well.  We therefore used the training topics to explore the space of available SMART term weighting schemes.  Table 1 presents results for various weighting schemes with  English topics. There are 3,600 possible combinations of weighting schemes available: 60 schemes (5 x 4 x 3) for documents and 60 for queries. We tested a total of 240 combinations. In Table 1 we present the results for 15 combinations (the best ones, plus some oth-ers to illustate  the diversity of the results). mpc.ntn is still the best for the test topic set; but, as shown, a few other weighting schemes achieve similar per-formance. Some of the weighting schemes perform better when indexing all the topic fields (TDN), some on TD, and some on title only (T). npn.ntn was best for TD and lsn.ntn and lsn.atn are best for T. The mpc.ntn weighting scheme is used for all other experiments in this section.  We are investi-gating the reasons for the effectiveness of this weighting scheme in our experiments. 62TDN TD T  Weighting scheme Map Map Map 1 Mpc.mts 0.2175 0.1651 0.1175 2 Mpc.nts 0.2175 0.1651 0.1175 3 Mpc.ntn  0.2176 0.1653 0.1174 4 npc.ntn 0.2176 0.1653 0.1174 5 Mpc.mtc 0.2176 0.1653 0.1174 6 Mpc.ntc 0.2176 0.1653 0.1174 7 Mpc.mtn 0.2176 0.1653 0.1174 8 Npn.ntn 0.2116 0.1681 0.1181 9 lsn.ntn 0.1195 0.1233 0.1227 10 lsn.atn 0.0919 0.1115 0.1227 11 asn.ntn 0.0912 0.0923 0.1062 12 snn.ntn 0.0693 0.0592 0.0729 13 sps.ntn 0.0349 0.0377 0.0383 14 nps.ntn 0.0517 0.0416 0.0474 15 Mtc.atc 0.1138 0.1151 0.1108 Table 1. MAP, 25 English test topics. Bold=best scores. 4.2 Cross-Language Experiments Table 2 shows our results for the merged ASR, AK1 and AK2 documents with multi-system topic translations for French, German and Spanish, and single-system Czech translation. We can see that Spanish topics perform well compared to monolin-gual English. However, results for German and Czech are much poorer. This is perhaps not surpris-ing for the Czech topics where only a single transla-tion is available. For German, the quality of translation was sometimes low and some German words were retained untranslated. For French, only TD topic fields were available.  In this case we can see that cross-language retrieval effectiveness is almost identical to monolingual English. Every re-search team participating in the CLEF 2005 CL-SR task submitted at least one TD English run, and among those our mpc.ntn system yielded the best MAP (Wilcoxon signed rank test for paired sam-ples, p<0.05). However, as we show in Table 4, manual metadata can yield better retrieval effec-tiveness than automatic description.   Topic Language System Map Fields English Our system 0.1653 TD English Our system 0.2176 TDN Spanish Our system 0.1863 TDN French Our system 0.1685 TD German Our system 0.1281 TDN Czech Our system 0.1166 TDN Table 2. MAP, cross-language, 25 test topics Language Map Fields Description English 0.1276 T Phonetic English 0.2550 TD Phonetic English 0.1245 T Phonetic+Text English 0.2590 TD Phonetic+Text Spanish 0.1395 T Phonetic Spanish 0.2653 TD Phonetic Spanish 0.1443 T Phonetic+Text Spanish 0.2669 TD Phonetic+Text French 0.1251 T Phonetic French 0.2726 TD Phonetic French 0.1254 T Phonetic+Text French 0.2833 TD Phonetic+Text German 0.1163 T Phonetic German 0.2356 TD Phonetic German 0.1187 T Phonetic+Text German 0.2324 TD Phonetic+Text Czech 0.0776 T Phonetic Czech 0.1647 TD Phonetic Czech 0.0805 T Phonetic+Text Czech 0.1695 TD Phonetic+Text Table 3. MAP, phonetic 4-grams, 25 test topics. 4.3 Results on Phonetic Transcriptions In Table 3 we present results for an experiment where the text of the collection and topics, without stemming, is transformed into a phonetic transcrip-tion. Consecutive phones are then grouped into overlapping n-gram sequences (groups of n sounds, n=4 in our case) that we used for indexing. The phonetic n-grams were provided by Clarke (2005), using NIST’s text-to-phone tool1. For example, the phonetic form for the query fragment child survi-vors is: ch_ay_l_d s_ax_r_v ax_r_v_ay r_v_ay_v v_ay_v_ax ay_v_ax_r v_ax_r_z. The phonetic form helps compensate for the speech recognition errors. With TD queries, the re-sults improve substantially compared with the text form of the documents and queries (9% relative). Combining phonetic and text forms (by simply in-dexing both phonetic n-grams and text) yields little additional improvement. 4.4 Manual summaries and keywords Manually prepared transcripts are not available for this test collection, so we chose to use manually assigned metadata as a reference condition.  To ex-plore the effect of merging automatic and manual fields, Table 4 presents the results combining man-                                                          1 http://www.nist.gov/speech/tools/ 63ual keywords and manual summaries with ASR transcripts, AK1, and AK2. Retrieval effectiveness increased substantially for all topic languages. The MAP score improved with 25% relative when add-ing the manual metadata for English TDN.  Table 4 also shows comparative results between and our results and results reported by the Univer-sity of Maryland at CLEF 2005 using a widely used IR system (InQuery) that has a standard term weighting algorithm optimized for large collections. For English TD, our system is 6% (relative) better and for French TD 10% (relative) better.  The Uni-versity of Maryland results with only automated fields are also lower than the results we report in Table 2 for the same fields.  Table 4. MAP, indexing all fields (MK, summaries, ASR transcripts, AK1 and AK2), 25 test topics. Language System Map Fields English Our system 0.4647 TDN English Our system 0.3689 TD English InQuery 0.3129 TD English Our system 0.2861 T Spanish Our system 0.3811 TDN French Our system 0.3496 TD French InQuery 0.2480 TD French Our system 0.3496 TD German Our system 0.2513 TDN Czech Our system 0.2338 TDN 5 Conclusions and Further Investigation The system described in this paper obtained the best results among the seven teams that participated in the CLEF 2005 CL-SR track. We believe that this results from our use of the 38 training topics to find a term weighting scheme that is particularly suitable for this collection. Relevance judgments are typi-cally not available for training until the second year of an IR evaluation; using a search-guided process that does not require system results to be available before judgments can be performed made it possi-ble to accelerate that timetable in this case.  Table 2 shows that performance varies markedly with the choice of weighting scheme.  Indeed, some of the classic weighting schemes yielded much poorer results than the one  we ultimately selected. In this paper we presented results on the test queries, but we observed similar effects on the training queries. On combined manual and automatic data, the best MAP score we obtained for English topics is 0.4647. On automatic data, the best MAP is 0.2176. This difference could result from ASR errors or from terms added by human indexers that were not available to the ASR system to be recognized. In future work we plan to investigate methods of re-moving or correcting some of the speech recogni-tion errors in the ASR transcripts using semantic coherence measures. In ongoing further work we are exploring the re-lationship between properties of the collection and the weighting schemes in order to better understand the underlying reasons for the demonstrated effec-tiveness of the mpc.ntn weighting scheme.  The challenges of CLEF CL-SR task will con-tinue to expand in subsequent years as new collec-tions are introduced (e.g., Czech interviews in 2006). Because manually assigned segment bounda-ries are available only for English interviews, this will yield an unknown topic boundary condition that is similar to previous experiments with auto-matically transcribed broadcast news the Text Re-trieval Conference (Garafolo et al, 2000), but with the additional caveat that topic boundaries are not known for the ground truth relevance judgments.    References Chris Buckley, Gerard Salton, and James Allan. 1993. Automatic retrieval with locality information using SMART. In Proceedings of the First Text REtrieval Conference (TREC-1), pages 59–72. Charles L. A. Clarke. 2005. Waterloo Experiments for the CLEF05 SDR Track, in Working Notes for the CLEF 2005 Workshop, Vienna, Austria John S. Garofolo, Cedric G.P. Auzanne and Ellen M. Voorhees. 2000. The TREC Spoken Document Re-trieval Track: A Success Story. In Proceedings of the RIAO Conference: Content-Based Multimedia Infor-mation Access, Paris, France, pages 1-20. Douglas W. Oard, Dagobert Soergel, David Doermann, Xiaoli Huang, G. Craig Murray, Jianqiang Wang, Bhuvana Ramabhadran, Martin Franz and Samuel Gustman. 2004. Building an Information Retrieval Test Collection for Spontaneous Conversational Speech, in  Proceedings of SIGIR, pages 41-48. Gerard Salton and Chris Buckley. 1988. Term-weighting approaches in automatic retrieval. Information Proc-essing and Management, 24(5):513-523. Ryen W. White, Douglas W. Oard, Gareth J. F. Jones, Dagobert Soergel and Xiaoli Huang. 2005. Overview of the CLEF-2005 Cross-Language Speech Retrieval Track, in Working Notes for the CLEF 2005 Work-shop, Vienna, Austria 64

Investigating cross-language speech retrieval for a spontaneous conversational speech collection

Abstract

Similar works

Full text

Available Versions

Crossref

Irish Universities

Name not available

DCU Online Research Access Service