5,903 research outputs found
Japanese/English Cross-Language Information Retrieval: Exploration of Query Translation and Transliteration
Cross-language information retrieval (CLIR), where queries and documents are
in different languages, has of late become one of the major topics within the
information retrieval community. This paper proposes a Japanese/English CLIR
system, where we combine a query translation and retrieval modules. We
currently target the retrieval of technical documents, and therefore the
performance of our system is highly dependent on the quality of the translation
of technical terms. However, the technical term translation is still
problematic in that technical terms are often compound words, and thus new
terms are progressively created by combining existing base words. In addition,
Japanese often represents loanwords based on its special phonogram.
Consequently, existing dictionaries find it difficult to achieve sufficient
coverage. To counter the first problem, we produce a Japanese/English
dictionary for base words, and translate compound words on a word-by-word
basis. We also use a probabilistic method to resolve translation ambiguity. For
the second problem, we use a transliteration method, which corresponds words
unlisted in the base word dictionary to their phonetic equivalents in the
target language. We evaluate our system using a test collection for CLIR, and
show that both the compound word translation and transliteration methods
improve the system performance
Proceedings of the COLING 2004 Post Conference Workshop on Multilingual Linguistic Ressources MLR2004
International audienceIn an ever expanding information society, most information systems are now facing the "multilingual challenge". Multilingual language resources play an essential role in modern information systems. Such resources need to provide information on many languages in a common framework and should be (re)usable in many applications (for automatic or human use). Many centres have been involved in national and international projects dedicated to building har- monised language resources and creating expertise in the maintenance and further development of standardised linguistic data. These resources include dictionaries, lexicons, thesauri, word-nets, and annotated corpora developed along the lines of best practices and recommendations. However, since the late 90's, most efforts in scaling up these resources remain the responsibility of the local authorities, usually, with very low funding (if any) and few opportunities for academic recognition of this work. Hence, it is not surprising that many of the resource holders and developers have become reluctant to give free access to the latest versions of their resources, and their actual status is therefore currently rather unclear. The goal of this workshop is to study problems involved in the development, management and reuse of lexical resources in a multilingual context. Moreover, this workshop provides a forum for reviewing the present state of language resources. The workshop is meant to bring to the international community qualitative and quantitative information about the most recent developments in the area of linguistic resources and their use in applications. The impressive number of submissions (38) to this workshop and in other workshops and conferences dedicated to similar topics proves that dealing with multilingual linguistic ressources has become a very hot problem in the Natural Language Processing community. To cope with the number of submissions, the workshop organising committee decided to accept 16 papers from 10 countries based on the reviewers' recommendations. Six of these papers will be presented in a poster session. The papers constitute a representative selection of current trends in research on Multilingual Language Resources, such as multilingual aligned corpora, bilingual and multilingual lexicons, and multilingual speech resources. The papers also represent a characteristic set of approaches to the development of multilingual language resources, such as automatic extraction of information from corpora, combination and re-use of existing resources, online collaborative development of multilingual lexicons, and use of the Web as a multilingual language resource. The development and management of multilingual language resources is a long-term activity in which collaboration among researchers is essential. We hope that this workshop will gather many researchers involved in such developments and will give them the opportunity to discuss, exchange, compare their approaches and strengthen their collaborations in the field. The organisation of this workshop would have been impossible without the hard work of the program committee who managed to provide accurate reviews on time, on a rather tight schedule. We would also like to thank the Coling 2004 organising committee that made this workshop possible. Finally, we hope that this workshop will yield fruitful results for all participants
Cross-lingual Distillation for Text Classification
Cross-lingual text classification(CLTC) is the task of classifying documents
written in different languages into the same taxonomy of categories. This paper
presents a novel approach to CLTC that builds on model distillation, which
adapts and extends a framework originally proposed for model compression. Using
soft probabilistic predictions for the documents in a label-rich language as
the (induced) supervisory labels in a parallel corpus of documents, we train
classifiers successfully for new languages in which labeled training data are
not available. An adversarial feature adaptation technique is also applied
during the model training to reduce distribution mismatch. We conducted
experiments on two benchmark CLTC datasets, treating English as the source
language and German, French, Japan and Chinese as the unlabeled target
languages. The proposed approach had the advantageous or comparable performance
of the other state-of-art methods.Comment: Accepted at ACL 2017; Code available at
https://github.com/xrc10/cross-distil
MultiNews: a web collection of an aligned multimodal and multilingual corpus
Integrating Natural Language Processing
(NLP) and computer vision is a promising
effort. However, the applicability of these
methods directly depends on the availability of a specific multimodal data that includes images and texts. In this paper, we
present a collection of a Multimodal corpus of comparable document and their images in 9 languages from the web news articles of Euronews website.1 This corpus
has found widespread use in the NLP community in Multilingual and multimodal
tasks. Here, we focus on its acquisition
of the images and text data and their multilingual alignment
- …