11 research outputs found

    From text to knowledge: multilingual information extraction for knowledge graph construction

    Get PDF
    In the era of Large Language Models (LLMs), Information Extraction (IE) may seem like a “Chronicle of a Death Foretold”. Between 2020 and 2023, it ranked among the top three most popular topics at conferences like ACL, yet by 2024, it had dropped to tenth place. The advent of Transformer Language Models (LMs), emerging just before work on this dissertation began, has transformed the field of Natural Language Processing (NLP), enabling unprecedented performance across a broad range of Natural Language Understanding (NLU) tasks. Surprisingly, scaling these models into LLMs has not led to diminishing returns but has instead further expanded their capabilities. However, there remains a need for efficient methods suitable for real-world applications that require low latency or the ability to process large volumes of real-time data—domains where large models are often impractical. Additionally, tasks reliant on LLMs’ parametric memory face limitations due to neural inference, where accuracy and recency of information cannot always be guaranteed. While LLMs show great promise, they increasingly require grounding in external knowledge sources for reliable results. This is where IE becomes indispensable. Rather than being replaced, IE complements and strengthens LLMs, supporting their reasoning with accurate, grounded information. Knowledge Graphs (KGs) serve as structured frameworks that bridge unstructured text and structured knowledge, enabling scalable, interpretable organization of vast amounts of information. Essential for applications like semantic search, recommendation systems, and question-answering, KGs rely heavily on robust IE techniques. In this thesis, we focus on advancing multilingual IE methods to enhance KG construction and address limitations in existing IE systems

    Dissecting Biases in Relation Extraction: A Cross-Dataset Analysis on People’s Gender and Origin

    Get PDF
    Relation Extraction (RE) is at the core of many Natural Language Understanding tasks, including knowledge-base population and Question Answering. However, any Natural Language Processing system is exposed to biases, and the analysis of these has not received much attention in RE. We propose a new method for inspecting bias in the RE pipeline, which is completely transparent in terms of interpretability. Specifically, in this work we analyze biases related to gender and place of birth. Our methodology includes (i) obtaining semantic triplets (subject, object, semantic relation) involving ‘person’ entities from RE resources, (ii) collecting meta-information (‘gender’ and ‘place of birth’) using Entity Linking technologies, and then (iii) analyze the distribution of triplets across different groups (e.g., men versus women). We investigate bias at two levels: In the training data of three commonly used RE datasets (SREDFM, CrossRE, NYT), and in the predictions of a state-of-the-art RE approach (ReLiK). To enable cross-dataset analysis, we introduce a taxonomy of relation types mapping the label sets of different RE datasets to a unified label space. Our findings reveal that bias is a compounded issue affecting underrepresented groups within data and predictions for RE

    MOSAICo: a Multilingual Open-text Semantically Annotated Interlinked Corpus

    Get PDF
    Several Natural Language Understanding (NLU) tasks focus on linking text to explicit knowledge, including Word Sense Disambiguation, Semantic Role Labeling, Semantic Parsing, and Relation Extraction. In addition to the importance of connecting raw text with explicit knowledge bases, the integration of such carefully curated knowledge into deep learning models has been shown to be beneficial across a diverse range of applications, including Language Modeling and Machine Translation. Nevertheless, the scarcity of semantically-annotated corpora across various tasks and languages limits the potential advantages significantly. To address this issue, we put forward MOSAICo, the first endeavor aimed at equipping the research community with the key ingredients to model explicit semantic knowledge at a large scale, providing hundreds of millions of silver yet high-quality annotations for four NLU tasks across five languages. We describe the creation process of MOSAICo, demonstrate its quality and variety, and analyze the interplay between different types of semantic information. MOSAICo, available at https://github.com/SapienzaNLP/mosaico, aims to drop the requirement of closed, licensed datasets and represents a step towards a level playing field across languages and tasks in NLU

    ReLiK: Retrieve and LinK, Fast and Accurate Entity Linking and Relation Extraction on an Academic Budget

    No full text
    Entity Linking (EL) and Relation Extraction (RE) are fundamental tasks in Natural Language Processing, serving as critical components in a wide range of applications. In this paper, we propose ReLiK, a Retriever-Reader architecture for both EL and RE, where, given an input text, the Retriever module undertakes the identification of candidate entities or relations that could potentially appear within the text. Subsequently, the Reader module is tasked to discern the pertinent retrieved entities or relations and establish their alignment with the corresponding textual spans. Notably, we put forward an innovative input representation that incorporates the candidate entities or relations alongside the text, making it possible to link entities or extract relations in a single forward pass and to fully leverage pre-trained language models contextualization capabilities, in contrast with previous Retriever-Reader-based methods, which require a forward pass for each candidate. Our formulation of EL and RE achieves state-of-the-art performance in both in-domain and out-of-domain benchmarks while using academic budget training and with up to 40x inference speed compared to competitors. Finally, we show how our architecture can be used seamlessly for Information Extraction (cIE), i.e. EL + RE, and setting a new state of the art by employing a shared Reader that simultaneously extracts entities and relations

    Us vs. Them: A Dataset of Populist Attitudes, News Bias and Emotions

    No full text
    Computational modelling of political discourse tasks has become an increasingly important area of research in natural language processing. Populist rhetoric has risen across the political sphere in recent years; however, computational approaches to it have been scarce due to its complex nature. In this paper, we present the new Us vs. Them dataset, consisting of 6861 Reddit comments annotated for populist attitudes and the first large-scale computational models of this phenomenon. We investigate the relationship between populist mindsets and social groups, as well as a range of emotions typically associated with these. We set a baseline for two tasks related to populist attitudes and present a set of multi-task learning models that leverage and demonstrate the importance of emotion and group identification as auxiliary tasks

    Us vs. Them: A Dataset of Populist Attitudes, News Bias and Emotions

    No full text
    Computational modelling of political discourse tasks has become an increasingly important area of research in natural language processing. Populist rhetoric has risen across the political sphere in recent years; however, computational approaches to it have been scarce due to its complex nature. In this paper, we present the new Us vs. Them dataset, consisting of 6861 Reddit comments annotated for populist attitudes and the first large-scale computational models of this phenomenon. We investigate the relationship between populist mindsets and social groups, as well as a range of emotions typically associated with these. We set a baseline for two tasks related to populist attitudes and present a set of multi-task learning models that leverage and demonstrate the importance of emotion and group identification as auxiliary tasks.</p

    Socio-Economic or Emotional Predictors of Populist Attitudes across Europe

    No full text
    Previous research on predictors of populism has predominantly focused on socio-economic (e.g., education, employment, social status), and socio-cultural factors (e.g., social identity and social status). However, during the last years, the role of negative emotions has become increasingly prominent in the study of populism. We conducted a cross-national survey in 15 European countries (N=8059), measuring emotions towards the government and the elites, perceptions of threats about the future, and socio-economic factors as predictors of populist attitudes (the latter operationalized via three existing scales, anti-elitism, Manichaean outlook, people-centrism, and a newly developed scale on nativism). We tested the role of emotional factors in a deductive research design based on a structural model. Our results show that negative emotions (anger, contempt and anxiety) are better predictors of populist attitudes than mere socio-economic and socio-cultural factors. An inductive machine learning algorithm, Random Forest (RF), reaffirmed the importance of emotions across our survey dataset.</p

    REDFM: a Filtered and Multilingual Relation Extraction Dataset

    No full text
    Relation Extraction (RE) is a task that identifies relationships between entities in a text, enabling the acquisition of relational facts and bridging the gap between natural language and structured knowledge. However, current RE models often rely on small datasets with low coverage of relation types, particularly when working with languages other than English.In this paper, we address the above issue and provide two new resources that enable the training and evaluation of multilingual RE systems.First, we present SREDFM, an automatically annotated dataset covering 18 languages, 400 relation types, 13 entity types, totaling more than 40 million triplet instances. Second, we propose REDFM, a smaller, human-revised dataset for seven languages that allows for the evaluation of multilingual RE systems. To demonstrate the utility of these novel datasets, we experiment with the first end-to-end multilingual RE model, mREBEL, that extracts triplets, including entity types, in multiple languages. We release our resources and model checkpoints at [https://www.github.com/babelscape/rebel](https://www.github.com/babelscape/rebel)
    corecore