20 research outputs found

    SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity

    Full text link
    Instruction-tuned Large Language Models (LLMs) have recently showcased remarkable advancements in their ability to generate fitting responses to natural language instructions. However, many current works rely on manual evaluation to judge the quality of generated responses. Since such manual evaluation is time-consuming, it does not easily scale to the evaluation of multiple models and model variants. In this short paper, we propose a straightforward but remarkably effective evaluation metric called SemScore, in which we directly compare model outputs to gold target responses using semantic textual similarity (STS). We conduct a comparative evaluation of the model outputs of 12 prominent instruction-tuned LLMs using 8 widely-used evaluation metrics for text generation. We find that our proposed SemScore metric outperforms all other, in many cases more complex, evaluation metrics in terms of correlation to human evaluation. These findings indicate the utility of our proposed metric for the evaluation of instruction-tuned LLMs

    CleanCoNLL: A Nearly Noise-Free Named Entity Recognition Dataset

    Full text link
    The CoNLL-03 corpus is arguably the most well-known and utilized benchmark dataset for named entity recognition (NER). However, prior works found significant numbers of annotation errors, incompleteness, and inconsistencies in the data. This poses challenges to objectively comparing NER approaches and analyzing their errors, as current state-of-the-art models achieve F1-scores that are comparable to or even exceed the estimated noise level in CoNLL-03. To address this issue, we present a comprehensive relabeling effort assisted by automatic consistency checking that corrects 7.0% of all labels in the English CoNLL-03. Our effort adds a layer of entity linking annotation both for better explainability of NER labels and as additional safeguard of annotation quality. Our experimental evaluation finds not only that state-of-the-art approaches reach significantly higher F1-scores (97.1%) on our data, but crucially that the share of correct predictions falsely counted as errors due to annotation noise drops from 47% to 6%. This indicates that our resource is well suited to analyze the remaining errors made by state-of-the-art models, and that the theoretical upper bound even on high resource, coarse-grained NER is not yet reached. To facilitate such analysis, we make CleanCoNLL publicly available to the research community.Comment: EMNLP 2023 camera-ready versio

    BEAR: A Unified Framework for Evaluating Relational Knowledge in Causal and Masked Language Models

    Full text link
    Knowledge probing assesses to which degree a language model (LM) has successfully learned relational knowledge during pre-training. Probing is an inexpensive way to compare LMs of different sizes and training configurations. However, previous approaches rely on the objective function used in pre-training LMs and are thus applicable only to masked or causal LMs. As a result, comparing different types of LMs becomes impossible. To address this, we propose an approach that uses an LM's inherent ability to estimate the log-likelihood of any given textual statement. We carefully design an evaluation dataset of 7,731 instances (40,916 in a larger variant) from which we produce alternative statements for each relational fact, one of which is correct. We then evaluate whether an LM correctly assigns the highest log-likelihood to the correct statement. Our experimental evaluation of 22 common LMs shows that our proposed framework, BEAR, can effectively probe for knowledge across different LM types. We release the BEAR datasets and an open-source framework that implements the probing approach to the research community to facilitate the evaluation and development of LMs.Comment: NAACL 202

    Fabricator: An Open Source Toolkit for Generating Labeled Training Data with Teacher LLMs

    Full text link
    Most NLP tasks are modeled as supervised learning and thus require labeled training data to train effective models. However, manually producing such data at sufficient quality and quantity is known to be costly and time-intensive. Current research addresses this bottleneck by exploring a novel paradigm called zero-shot learning via dataset generation. Here, a powerful LLM is prompted with a task description to generate labeled data that can be used to train a downstream NLP model. For instance, an LLM might be prompted to "generate 500 movie reviews with positive overall sentiment, and another 500 with negative sentiment." The generated data could then be used to train a binary sentiment classifier, effectively leveraging an LLM as a teacher to a smaller student model. With this demo, we introduce Fabricator, an open-source Python toolkit for dataset generation. Fabricator implements common dataset generation workflows, supports a wide range of downstream NLP tasks (such as text classification, question answering, and entity recognition), and is integrated with well-known libraries to facilitate quick experimentation. With Fabricator, we aim to support researchers in conducting reproducible dataset generation experiments using LLMs and help practitioners apply this approach to train models for downstream tasks.Comment: 3 Figures and 2 Table

    Automatic preservation watch using information extraction on the Web: a case study on semantic extraction of natural language for digital preservation

    Get PDF
    The ability to recognize when digital content is becoming endangered is essential for maintaining the long-term, continuous and authentic access to digital assets. To achieve this ability, knowledge about aspects of the world that might hinder the preservation of content is needed. However, the processes of gathering, managing and reasoning on knowledge can become manually infeasible when the volume and heterogeneity of content increases, multiplying the aspects to monitor. Automation of these processes is possible [11,21], but its usefulness is limited by the data it is able to gather. Up to now, automatic digital preservation processes have been restricted to knowledge expressed in a machine understandable language, ignoring a plethora of data expressed in natural language, such as the DPC Technology Watch Reports, which could greatly contribute to the completeness and freshness of data about aspects of the world related to digital preservation. This paper presents a real case scenario from the National Library of the Netherlands, where the monitoring of publishers and journals is needed. This knowledge is mostly represented in natural language on Web sites of the publishers and, therefore, is dificult to automatically monitor. In this paper, we demonstrate how we use information extraction technologies to end and extract machine readable information on publishers and journals for ingestion into automatic digital preservation watch tools. We show that the results of automatic semantic extraction are a good complement to existing knowledge bases on publishers [9, 20], finding newer and more complete data. We demonstrate the viability of the approach as an alternative or auxiliary method for automatically gathering information on preservation risks in digital content.KEEP SOLUTION

    HunFlair2 in a cross-corpus evaluation of biomedical named entity recognition and normalization tools

    Full text link
    With the exponential growth of the life science literature, biomedical text mining (BTM) has become an essential technology for accelerating the extraction of insights from publications. Identifying named entities (e.g., diseases, drugs, or genes) in texts and their linkage to reference knowledge bases are crucial steps in BTM pipelines to enable information aggregation from different documents. However, tools for these two steps are rarely applied in the same context in which they were developed. Instead, they are applied in the wild, i.e., on application-dependent text collections different from those used for the tools' training, varying, e.g., in focus, genre, style, and text type. This raises the question of whether the reported performance of BTM tools can be trusted for downstream applications. Here, we report on the results of a carefully designed cross-corpus benchmark for named entity extraction, where tools were applied systematically to corpora not used during their training. Based on a survey of 28 published systems, we selected five for an in-depth analysis on three publicly available corpora encompassing four different entity types. Comparison between tools results in a mixed picture and shows that, in a cross-corpus setting, the performance is significantly lower than the one reported in an in-corpus setting. HunFlair2 showed the best performance on average, being closely followed by PubTator. Our results indicate that users of BTM tools should expect diminishing performances when applying them in the wild compared to original publications and show that further research is necessary to make BTM tools more robust

    Exploratory Relation Extraction in Large Multilingual Data

    No full text
    The task of Relation Extraction (RE) is concerned with creating extractors that automatically find structured, relational information in unstructured data such as natural language text. Motivated by an explosion of sources of readily available text data such as the Web, RE offers intriguing possibilities for querying, organizing, and analyzing information by drawing upon the clean semantics of structured databases and the abundance of unstructured data. However, practical applications of RE are often characterized by vague and shifting information needs on the one hand and large multilingual datasets of unknown content on the other. Classical RE approaches are unable to handle such scenarios since they require a careful, upfront definition of extraction tasks before extractors can be created in an effort-intensive, time-consuming process. With this thesis, I propose the paradigm of Exploratory Relation Extraction (ERE), a user-driven but data-guided process of exploration for relations of interest in unknown data. I show how distributional evidence and an informed linguistic abstraction can be employed to allow users to openly explore a dataset for relations of interest and rapidly prototype extractors for discovered relations at minimal effort. Furthermore, I propose the use of a language-neutral representation of shallow semantics to address the issue of multilingual data. This representation enables a shared feature space for different languages against which extractors can be developed. I present a method that expands English-language Semantic Role Labeling (SRL) to other languages and use it to generate multilingual SRL resources for seven distinct languages from different language groups, namely Arabic, Chinese, French, German, Hindi, Russian and Spanish in order to bootstrap semantic parsers for these languages. Together, the researched approaches represent a novel way for data scientists to work with large multilingual datasets of unknown content

    Explorative Relationsextraktion in mehrsprachigen Massendaten

    No full text
    The task of Relation Extraction (RE) is concerned with creating extractors that automatically find structured, relational information in unstructured data such as natural language text. Motivated by an explosion of sources of readily available text data such as the Web, RE offers intriguing possibilities for querying, organizing, and analyzing information by drawing upon the clean semantics of structured databases and the abundance of unstructured data. However, practical applications of RE are often characterized by vague and shifting information needs on the one hand and large multilingual datasets of unknown content on the other. Classical RE approaches are unable to handle such scenarios since they require a careful, upfront definition of extraction tasks before extractors can be created in an effort-intensive, time-consuming process. With this thesis, I propose the paradigm of Exploratory Relation Extraction (ERE), a user-driven but data-guided process of exploration for relations of interest in unknown data. I show how distributional evidence and an informed linguistic abstraction can be employed to allow users to openly explore a dataset for relations of interest and rapidly prototype extractors for discovered relations at minimal effort. Furthermore, I propose the use of a language-neutral representation of shallow semantics to address the issue of multilingual data. This representation enables a shared feature space for different languages against which extractors can be developed. I present a method that expands English-language Semantic Role Labeling (SRL) to other languages and use it to generate multilingual SRL resources for seven distinct languages from different language groups, namely Arabic, Chinese, French, German, Hindi, Russian and Spanish in order to bootstrap semantic parsers for these languages. Together, the researched approaches represent a novel way for data scientists to work with large multilingual datasets of unknown content.Die Problemstellung der Relationsextraktion (RE) beschreibt die automatische Gewinnung strukturierter, relationaler Information aus unstrukturierten Daten wie zum Beispiel naturlichsprachlichem Text. Durch RE werden neue Arten der Strukturierung, Organisation und Analyse von Informationen ermoglicht, da sie eine Brücke zwischen der klar strukturierten Semantik von Datenbanken und der stetigen Explosion verfugbarer Textdaten zu bauen vermag. In der Praxis ist die Anwendung von RE allerdings problematisch; Anwendungsszenarien sind oft durch vage, sich schnell andernde Informationsbedürfnisse gekennzeichnet, sowie von großen, mehrsprachigen Datensatzen unbekannten Inhalts. In solchen Szenarien schlagen klassische RE Ansätze fehl, da Extraktionsaufgaben im Voraus sorgsam definiert werden mussen, woraufhin Extraktoren in einem zweiten Schritt mit hohem Aufwand gebaut werden. In dieser Dissertation stelle ich das neuartige Paradigma der Explorativen Relationsextraktion (ERE) vor. Hierbei handelt es sich um einen nutzergetriebenen, halbautomatischen Vorgang, mit dem neue Relationstypen in Datensatzen unbekannten Inhalts entdeckt werden können. Ich zeige, wie verteilungssemantische Statistiken und eine ausgewahlte linguistische Abstraktion angewendet werden, um Nutzern sowohl die Erkundung von Textdaten nach relationalen Informationen als auch das schnelle prototypische Erstellen von Extraktoren mit minimalem Aufwand zu ermoglichen. Für den Umgang mit mehrsprachigen Daten schlage ich darüber hinaus die Nutzung einer sprachubergreifenden Repräsentation flacher Semantik vor. Auf dieser Basis konnen ohne Zusatzaufwand sprachübergreifende Extraktoren erzeugt werden. Ich stelle eine Methode vor, mit der englischsprachige Semantische Rollen auf andere Sprachen ausgeweitet werden konnen und erzeuge damit umfassende Resourcen um mehrsprachige semantische Parser zu trainieren. Zusammengenommen stellen die in dieser Dissertation erforschten Methoden einen neuartigen Ansatz zum Umgang mit großen und mehrsprachigen Datensatzen unbekannten Inhalts dar
    corecore