314 research outputs found

    Enabling Cross-lingual Information Retrieval for African Languages

    Get PDF
    Language diversity in NLP is critical in enabling the development of tools for a wide range of users. However, there are limited resources for building such tools for many languages, particularly those spoken in Africa. For search, most existing datasets feature few to no African languages, directly impacting researchers’ ability to build and improve information access capabilities in those languages. Motivated by this, we created AfriCLIRMatrix, a test collection for cross-lingual information retrieval research in 15 diverse African languages automatically created from Wikipedia. The dataset comprises 6 million queries in English and 23 million relevance judgments automatically extracted from Wikipedia inter-language links. We extract 13,050 test queries with relevant judgments across 15 languages, covering a significantly broader range of African languages than other existing information retrieval test collections. In addition to providing a much-needed resource for researchers, we also release BM25, dense retrieval, and sparse-dense hybrid baselines to establish a starting point for the development of future systems. We hope that our efforts will stimulate further research in information retrieval for African languages and lead to the creation of more effective tools for the benefit of users

    Geographic information extraction from texts

    Get PDF
    A large volume of unstructured texts, containing valuable geographic information, is available online. This information – provided implicitly or explicitly – is useful not only for scientific studies (e.g., spatial humanities) but also for many practical applications (e.g., geographic information retrieval). Although large progress has been achieved in geographic information extraction from texts, there are still unsolved challenges and issues, ranging from methods, systems, and data, to applications and privacy. Therefore, this workshop will provide a timely opportunity to discuss the recent advances, new ideas, and concepts but also identify research gaps in geographic information extraction

    Building cross-language corpora for human understanding of privacy policies

    Get PDF
    Making sure that users understand privacy policies that impact them is a key challenge for a real GDPR deployment. Research studies are mostly carried in English, but in Europe and elsewhere, users speak a language that is not English. Replicating studies in different languages requires the availability of comparable cross-language privacy policies corpora. This work provides a methodology for building comparable cross-language in a national language and a reference study language. We provide an application example of our methodology comparing English and Italian extending the corpus of one of the first studies about users understanding of technical terms in privacy policies. We also investigate other open issues that can make replication harder

    Enhancing Plagiarism Detection: The Role of Artificial Intelligence in Upholding Academic Integrity

    Get PDF
    Plagiarism poses a significant threat to academic integrity, requiring effective measures for its detection and prevention. This paper explores the efficacy of plagiarism detection tools in upholding academic integrity, with a specific focus on the use of artificial intelligence (AI) technologies. The abstract introduces the concept of plagiarism and its impact on scholarly work. It highlights the importance of reliable and accurate plagiarism detection methods and emphasizes the role of AI in enhancing the effectiveness of such tools. The abstract briefly outlines the main points covered in the paper, including the use of AI techniques such as text matching algorithms and natural language processing, the application of machine learning in plagiarism detection, and the challenges and advancements in cross-language detection. The abstract concludes by emphasizing the importance of promoting ethical scholarship and academic integrity in educational institution

    Facilitating Information Access for Heterogeneous Data Across Many Languages

    Get PDF
    Information access, which enables people to identify, retrieve, and use information freely and effectively, has attracted interest from academia and industry. Systems for document retrieval and question answering have helped people access information in powerful and useful ways. Recently, natural language technologies based on neural network have been applied to various tasks for information access. Specifically, transformer-based pre-trained models have pushed tasks such as document and passage retrieval to new state-of-the-art effectiveness. (1) Most of the research has focused on helping people access passages and documents on the web. However, there is abundant information stored in other formats such as semi-structured tables and domain-specific relational databases in companies. Development of the models and frameworks that support access information from these data formats is also essential. (2) Moreover, most of the advances in information access research are based on English, leaving other languages less explored. It is insufficient and inequitable in our globalized and connected world to serve only speakers of English. In this thesis, we explore and develop models and frameworks that could alleviate the aforementioned challenges. This dissertation consists of three parts. We begin with a discussion on developing models designed for accessing data in formats other than passages and documents. We mainly focus on two data formats, namely semi-structured tables and relational databases. In the second part, we discuss methods that can enhance the user experience for non-English speakers when using information access systems. Specifically, we first introduce model development for multilingual knowledge graph integration, which can benefit many information access applications such as cross-lingual question answering systems and other knowledge-driven cross-lingual NLP applications. We further focus on multilingual document dense retrieval and reranking that boost the effectiveness of search engines for non-English information access. Last but not least, we take a step further based on the aforementioned two parts by investigating models and frameworks that can facilitate non-English speakers to access structured data. In detail, we present cross-lingual Text-to-SQL semantic parsing systems that enable non-English speakers to query relational databases with queries in their languages

    Bridging Language Gaps in Health Information Access: Konkani-English CLIR System for Medical Knowledge

    Get PDF
    This paper addresses the challenges posed by linguistic diversity in terms of medical information by introducing a Cross-Language Information Retrieval System attuned to the needs of Konkani language information seekers. The proposed system leverages Konkani queries entered by the user, translates them to English, and retrieves the documents using a thesaurus- based approach. Various strategies also have been considered to address the challenges posed by the source language – Konkani which is a minority language spoken in the Indian subcontinent. The proposed approach showcases the potential of combining language technology, information retrieval, and medical domain expertise to bridge linguistic barriers. As healthcare information remains a critical societal need, this work holds promise in facilitating equitable access to medical knowledge

    Digitale d'autore. Macchine, archivi, letterature

    Get PDF
    Il volume parte da una ricognizione introduttiva sul rapporto tra scrittori e computer; dà una definizione degli archivi letterari nati digitalmente, fornisce alcuni esempi nel panorama internazionale e delinea una prima mappatura delle esperienze italiane, soffermandosi in particolare sul caso dell’archivio di Franco Fortini conservato all’Università degli Studi di Siena. Offre una sintesi del primo progetto italiano dedicato al born-digital letterario, PAD – Pavia Archivi Digitali, analizzando i processi di acquisizione e gestione dei fondi, oggi conservati presso il Centro Manoscritti di Pavia. Propone infine un’analisi critica delle prime tre opere di Francesco Pecoraro alla luce dell’archivio digitale conservato a Pavia

    Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility

    Full text link
    The recent popularity of large language models (LLMs) has brought a significant impact to boundless fields, particularly through their open-ended ecosystem such as the APIs, open-sourced models, and plugins. However, with their widespread deployment, there is a general lack of research that thoroughly discusses and analyzes the potential risks concealed. In that case, we intend to conduct a preliminary but pioneering study covering the robustness, consistency, and credibility of LLMs systems. With most of the related literature in the era of LLM uncharted, we propose an automated workflow that copes with an upscaled number of queries/responses. Overall, we conduct over a million queries to the mainstream LLMs including ChatGPT, LLaMA, and OPT. Core to our workflow consists of a data primitive, followed by an automated interpreter that evaluates these LLMs under different adversarial metrical systems. As a result, we draw several, and perhaps unfortunate, conclusions that are quite uncommon from this trendy community. Briefly, they are: (i)-the minor but inevitable error occurrence in the user-generated query input may, by chance, cause the LLM to respond unexpectedly; (ii)-LLMs possess poor consistency when processing semantically similar query input. In addition, as a side finding, we find that ChatGPT is still capable to yield the correct answer even when the input is polluted at an extreme level. While this phenomenon demonstrates the powerful memorization of the LLMs, it raises serious concerns about using such data for LLM-involved evaluation in academic development. To deal with it, we propose a novel index associated with a dataset that roughly decides the feasibility of using such data for LLM-involved evaluation. Extensive empirical studies are tagged to support the aforementioned claims

    Navigating Copyright for Libraries

    Get PDF
    Much of the information that libraries make available is protected by copyright or subject to the terms of license agreements. This reader presents an overview of current issues in copyright law reform. The chapters present salient points, overviews of the law and legal concepts, selected comparisons of approaches around the world, significance of the topic, and opportunities for reform, advocacy, and other related resources
    • …
    corecore