722 research outputs found

    Validating multilingual hybrid automatic term extraction for search engine optimisation : the use case of EBM-GUIDELINES

    Get PDF
    Tools that automatically extract terms and their equivalents in other languages from parallel corpora can contribute to multilingual professional communication in more than one way. By means of a use case with data from a medical web site with point of care evidence summaries (Ebpracticenet), we illustrate how hybrid multilingual automatic term extraction from parallel corpora works and how it can be used in a practical application such as search engine optimisation. The original aim was to use the result of the extraction to improve the recall of a search engine by allowing automated multilingual searches. Two additional possible applications were found while considering the data: searching via related forms and searching via strongly semantically related words. The second stage of this research was to find the most suitable format for the required manual validation of the raw extraction results and to compare the validation process when performed by a domain expert versus a terminologist

    Speech Synthesis Based on Hidden Markov Models

    Get PDF

    In no uncertain terms : a dataset for monolingual and multilingual automatic term extraction from comparable corpora

    Get PDF
    Automatic term extraction is a productive field of research within natural language processing, but it still faces significant obstacles regarding datasets and evaluation, which require manual term annotation. This is an arduous task, made even more difficult by the lack of a clear distinction between terms and general language, which results in low inter-annotator agreement. There is a large need for well-documented, manually validated datasets, especially in the rising field of multilingual term extraction from comparable corpora, which presents a unique new set of challenges. In this paper, a new approach is presented for both monolingual and multilingual term annotation in comparable corpora. The detailed guidelines with different term labels, the domain- and language-independent methodology and the large volumes annotated in three different languages and four different domains make this a rich resource. The resulting datasets are not just suited for evaluation purposes but can also serve as a general source of information about terms and even as training data for supervised methods. Moreover, the gold standard for multilingual term extraction from comparable corpora contains information about term variants and translation equivalents, which allows an in-depth, nuanced evaluation

    Towards Machine Speech-to-speech Translation

    Get PDF
    There has been a good deal of research on machine speech-to-speech translation (S2ST) in Japan, and this article presents these and our own recent research on automatic simultaneous speech translation. The S2ST system is basically composed of three modules: large vocabulary continuous automatic speech recognition (ASR), machine text-to-text translation (MT) and text-to-speech synthesis (TTS). All these modules need to be multilingual in nature and thus require multilingual speech and corpora for training models. S2ST performance is drastically improved by deep learning and large training corpora, but many issues still still remain such as simultaneity, paralinguistics, context and situation dependency, intention and cultural dependency. This article presents current on-going research and discusses issues with a view to next-generation speech-to-speech translation.En Japón se han llevado a cabo muchas actividades de investigación acerca de la traducción automática del habla. Este artículo pretende ofrecer una visión general de dichas actividades y presentar las que se han realizado más recientemente. El sistema S2ST está formado básicamente por tres módulos: el reconocimiento automático del habla continua y de amplios vocabularios (Automatic Speech Recognition, ASR), la traducción automática de textos (Machine translation, MT) y la conversión de texto a voz (Text-to-Speech Synthesis, TTS). Todos los módulos deben ser plurilingües, por lo cual se requieren discursos y corpus multilingües para los modelos de formación. El rendimiento del sistema S2ST mejora considerablemente por medio de un aprendizaje profundo y grandes corpus formativos. Sin embargo, todavía hace falta tratar diversos aspectos, com la simultaneidad, la paralingüística, la dependencia del contexto y de la situación, la intención y la dependencia cultural. Por todo ello, repasaremos las actividades de investigación actuales y discutiremos varias cuestiones relacionadas con la traducción automática del habla de última generación.Al Japó s'han dut a terme moltes activitats de recerca sobre la traducció automàtica de la parla. Aquest article n'ofereix una visió general i presenta les activitats que s'han efectuat més recentment. El sistema S2ST es compon bàsicament de tres mòduls: el reconeixement automàtic de la parla contínua i de vocabularis extensos (Automatic Speech Recognition, ASR), la traducció automàtica de textos (Machine translation, MT) i la conversió de text a veu (Text-to-Speech Synthesis, TTS). Tots els mòduls han de ser plurilingües, per la qual cosa es requereixen discursos i corpus multilingües per als models de formació. El rendiment del sistema S2ST millora considerablement per mitjà d'un aprenentatge profund i de grans corpus formatius. Tanmateix, encara cal tractar diversos aspectes, com la simultaneïtat, la paralingüística, la dependència del context i de la situació, la intenció i la dependència cultural. Així, farem un repàs a les activitats de recerca actuals i discutirem diverses qüestions relacionades amb la traducció automàtica de la parla d'última generació

    Proceedings of the COLING 2004 Post Conference Workshop on Multilingual Linguistic Ressources MLR2004

    No full text
    International audienceIn an ever expanding information society, most information systems are now facing the "multilingual challenge". Multilingual language resources play an essential role in modern information systems. Such resources need to provide information on many languages in a common framework and should be (re)usable in many applications (for automatic or human use). Many centres have been involved in national and international projects dedicated to building har- monised language resources and creating expertise in the maintenance and further development of standardised linguistic data. These resources include dictionaries, lexicons, thesauri, word-nets, and annotated corpora developed along the lines of best practices and recommendations. However, since the late 90's, most efforts in scaling up these resources remain the responsibility of the local authorities, usually, with very low funding (if any) and few opportunities for academic recognition of this work. Hence, it is not surprising that many of the resource holders and developers have become reluctant to give free access to the latest versions of their resources, and their actual status is therefore currently rather unclear. The goal of this workshop is to study problems involved in the development, management and reuse of lexical resources in a multilingual context. Moreover, this workshop provides a forum for reviewing the present state of language resources. The workshop is meant to bring to the international community qualitative and quantitative information about the most recent developments in the area of linguistic resources and their use in applications. The impressive number of submissions (38) to this workshop and in other workshops and conferences dedicated to similar topics proves that dealing with multilingual linguistic ressources has become a very hot problem in the Natural Language Processing community. To cope with the number of submissions, the workshop organising committee decided to accept 16 papers from 10 countries based on the reviewers' recommendations. Six of these papers will be presented in a poster session. The papers constitute a representative selection of current trends in research on Multilingual Language Resources, such as multilingual aligned corpora, bilingual and multilingual lexicons, and multilingual speech resources. The papers also represent a characteristic set of approaches to the development of multilingual language resources, such as automatic extraction of information from corpora, combination and re-use of existing resources, online collaborative development of multilingual lexicons, and use of the Web as a multilingual language resource. The development and management of multilingual language resources is a long-term activity in which collaboration among researchers is essential. We hope that this workshop will gather many researchers involved in such developments and will give them the opportunity to discuss, exchange, compare their approaches and strengthen their collaborations in the field. The organisation of this workshop would have been impossible without the hard work of the program committee who managed to provide accurate reviews on time, on a rather tight schedule. We would also like to thank the Coling 2004 organising committee that made this workshop possible. Finally, we hope that this workshop will yield fruitful results for all participants

    La traducción interactiva del habla

    Get PDF

    NusaCrowd: Open Source Initiative for Indonesian NLP Resources

    Full text link
    We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have brought together 137 datasets and 118 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their value is demonstrated through multiple experiments. NusaCrowd's data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and the local languages of Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and the local languages of Indonesia. Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken
    corecore