Search CORE

1,721 research outputs found

Cross-lingual document retrieval categorisation and navigation based on distributed services

Author: Deksne D.
Demetriou G.
Gaizauskas R.
Hansen P.
Karlgren J.
Keskustalo H.
Petrelli D.
Sanderson M.
Skadina I.
Publication venue
Publication date
Field of study

The widespread use of the Internet across countries has increased the need for access to document collections that are often written in languages different from a user’s native language. In this paper we describe Clarity, a Cross Language Information Retrieval (CLIR) system for English, Finnish, Swedish, Latvian and Lithuanian. Clarity is a fully-fledged retrieval system that supports the user during the whole process of query formulation, text retrieval and document browsing. We address four of the major aspects of Clarity: (i) the user-driven methodology that formed the basis for the iterative design cycle and framework in the project, (ii) the system architecture that was developed to support the interaction and coordination of Clarity’s distributed services, (iii) the data resources and methods for query translation, and (iv) the support for Baltic languages. Clarity is an example of a distributed CLIR system built with minimal translation resources and, to our knowledge, the only such system that currently supports Baltic languages

MIRACLE’s hybrid approach to bilingual and monolingual Information Retrieval

Author: Alonso Sánchez Javier
García Serrano Ana
González Cristóbal José Carlos
Goñi Menoyo José Miguel
Martínez Fernández José Luis
Martínez Fernández Paloma
Pablo Sánchez César de
Villena Román Julio
Publication venue: E.T.S.I. Telecomunicación (UPM)
Publication date: 01/01/2004
Field of study

The main goal of the bilingual and monolingual participation of the MIRACLE team at CLEF 2004 was testing the effect of combination approaches to information retrieval. The starting point is a set of basic components: stemming, transformation, filtering, generation of n-grams, weighting and relevance feedback. Some of these basic components are used in different combinations and order of application for document indexing and for query processing. Besides this, a second order combination is done, mainly by averaging or by selective combination of the documents retrieved by different approaches for a particular query

CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages

Author: Dernoncourt Franck
Lai Viet Dac
Man Hieu
Ngo Nghia Trung
Nguyen Thien Huu
Nguyen Thuat
Rossi Ryan A.
Van Nguyen Chien
Publication venue
Publication date: 17/09/2023
Field of study

The driving factors behind the development of large language models (LLMs) with impressive learning capabilities are their colossal model sizes and extensive training datasets. Along with the progress in natural language processing, LLMs have been frequently made accessible to the public to foster deeper investigation and applications. However, when it comes to training datasets for these LLMs, especially the recent state-of-the-art models, they are often not fully disclosed. Creating training data for high-performing LLMs involves extensive cleaning and deduplication to ensure the necessary level of quality. The lack of transparency for training data has thus hampered research on attributing and addressing hallucination and bias issues in LLMs, hindering replication efforts and further advancements in the community. These challenges become even more pronounced in multilingual learning scenarios, where the available multilingual text datasets are often inadequately collected and cleaned. Consequently, there is a lack of open-source and readily usable dataset to effectively train LLMs in multiple languages. To overcome this issue, we present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages, tailored for LLM development. Our dataset undergoes meticulous cleaning and deduplication through a rigorous pipeline of multiple stages to accomplish the best quality for model training, including language identification, URL-based filtering, metric-based cleaning, document refinement, and data deduplication. CulturaX is fully released to the public in HuggingFace to facilitate research and advancements in multilingual LLMs: https://huggingface.co/datasets/uonlp/CulturaX.Comment: Ongoing Wor

arXiv.org e-Print Archive

ABOP, automatic optimization of patient information leaflets

Author: Cardey Seditor
Delaere IsabelleUGent000100939210802000108756975383250931FA57EEB6-F0ED-11E1-A9DE-61C894A0A6B4
Hoste VeroniqueLW228020002478890000-0002-0539-4630F93F00BE-F0ED-11E1-A9DE-61C894A0A6B4
Peersman Claudia
Van Vaerenbergh Leona
Velaerts PeterLW22802000691160220C88EA-F0EE-11E1-A9DE-61C894A0A6B4
Publication venue: Université de Franche-Comté
Publication date: 01/01/2009
Field of study

Archivsystem Ask23