1,721 research outputs found
Cross-lingual document retrieval categorisation and navigation based on distributed services
The widespread use of the Internet across countries has increased the need for access to document collections
that are often written in languages different from a user’s native language. In this paper we describe Clarity, a
Cross Language Information Retrieval (CLIR) system for English, Finnish, Swedish, Latvian and Lithuanian.
Clarity is a fully-fledged retrieval system that supports the user during the whole process of query formulation,
text retrieval and document browsing. We address four of the major aspects of Clarity: (i) the user-driven
methodology that formed the basis for the iterative design cycle and framework in the project, (ii) the system
architecture that was developed to support the interaction and coordination of Clarity’s distributed services, (iii)
the data resources and methods for query translation, and (iv) the support for Baltic languages. Clarity is an
example of a distributed CLIR system built with minimal translation resources and, to our knowledge, the only
such system that currently supports Baltic languages
MIRACLE’s hybrid approach to bilingual and monolingual Information Retrieval
The main goal of the bilingual and monolingual participation of the MIRACLE team at CLEF 2004 was testing the effect of combination approaches to information retrieval. The starting point is a set of basic components: stemming, transformation, filtering, generation of n-grams, weighting and relevance feedback. Some of these basic components are used in different combinations and order of application for document indexing and for query processing. Besides this, a second order combination is done, mainly by averaging or by selective combination of the documents retrieved by different approaches for a particular query
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages
The driving factors behind the development of large language models (LLMs)
with impressive learning capabilities are their colossal model sizes and
extensive training datasets. Along with the progress in natural language
processing, LLMs have been frequently made accessible to the public to foster
deeper investigation and applications. However, when it comes to training
datasets for these LLMs, especially the recent state-of-the-art models, they
are often not fully disclosed. Creating training data for high-performing LLMs
involves extensive cleaning and deduplication to ensure the necessary level of
quality. The lack of transparency for training data has thus hampered research
on attributing and addressing hallucination and bias issues in LLMs, hindering
replication efforts and further advancements in the community. These challenges
become even more pronounced in multilingual learning scenarios, where the
available multilingual text datasets are often inadequately collected and
cleaned. Consequently, there is a lack of open-source and readily usable
dataset to effectively train LLMs in multiple languages. To overcome this
issue, we present CulturaX, a substantial multilingual dataset with 6.3
trillion tokens in 167 languages, tailored for LLM development. Our dataset
undergoes meticulous cleaning and deduplication through a rigorous pipeline of
multiple stages to accomplish the best quality for model training, including
language identification, URL-based filtering, metric-based cleaning, document
refinement, and data deduplication. CulturaX is fully released to the public in
HuggingFace to facilitate research and advancements in multilingual LLMs:
https://huggingface.co/datasets/uonlp/CulturaX.Comment: Ongoing Wor
- …