169 research outputs found
ExTaSem! Extending, Taxonomizing and Semantifying Domain Terminologies
We introduce EXTASEM!, a novel approach for the automatic learning of lexical taxonomies from domain terminologies. First, we exploit a very large semantic network to collect thousands of in-domain textual definitions. Second, we extract (hyponym, hypernym) pairs from each definition with a CRF-based algorithm trained on manuallyvalidated data. Finally, we introduce a graph induction procedure which constructs a full-fledged taxonomy where each edge is weighted according to its domain pertinence. EXTASEM! achieves state-of-the-art results in the following taxonomy evaluation experiments: (1) Hypernym discovery, (2) Reconstructing gold standard taxonomies, and (3) Taxonomy quality according to structural measures. We release weighted taxonomies for six domains for the use and scrutiny of the communit
Description and Evaluation of a Definition Extraction System for Catalan
La extracción automática de definiciones (ED) es una tarea que consiste en identificar definiciones en texto. Este artÃculo presenta un método para la identificación de definiciones para el catalán en el dominio enciclopédico, tomando como corpora para entrenamiento y evaluación una colección de documentos de la Wikipedia en catalán (Viquipèdia). El corpus de evaluación ha sido validado manualmente. El sistema consiste en un algoritmo de clasificación supervisado basado en Conditional Random Fields. Además de los habituales rasgos lingüÃsticos, se introducen rasgos que explotan la frecuencia de palabras en dominios generales y especÃficos, en definiciones y oraciones no definitorias, y en posición de definiendum (el término que se define) y de definiens (el clúster de palabras que define el definiendum). Los resultados obtenidos son prometedores, y sugieren que la combinación de rasgos lingüÃsticos y estadÃsticos juegan un papel importante en el desarrollo de sistemas ED para lenguas minoritarias.Automatic Definition Extraction (DE) consists of identifying definitions in naturally-occurring text. This paper presents a method for the identification of definitions in Catalan in the encyclopedic domain. The train and test corpora come from the Catalan Wikipedia (Viquipèdia). The test set has been manually validated. We approach the task as a supervised classification problem, using the Conditional Random Fields algorithm. In addition to the common linguistic features, we introduce features that exploit the frequency of a word in general and specific domains, in definitional and non-definitional sentences, and in definiendum (term to be defined) and definiens (cluster of words that defines the definiendum) position. We obtain promising results that suggest that combining linguistic and statistical features can prove useful for developing DE systems for under-resourced languages.Este trabajo ha sido parcialmente financiado por el proyecto número TIN2012-38584-C06-03 del Ministerio de EconomÃa y Competitividad, SecretarÃa de Estado de Investigación, Desarrollo e Innovación, España
WIKITIDE: A Wikipedia-Based Timestamped Definition Pairs Dataset
A fundamental challenge in the current NLP context, dominated by language
models, comes from the inflexibility of current architectures to 'learn' new
information. While model-centric solutions like continual learning or
parameter-efficient fine tuning are available, the question still remains of
how to reliably identify changes in language or in the world. In this paper, we
propose WikiTiDe, a dataset derived from pairs of timestamped definitions
extracted from Wikipedia. We argue that such resource can be helpful for
accelerating diachronic NLP, specifically, for training models able to scan
knowledge resources for core updates concerning a concept, an event, or a named
entity. Our proposed end-to-end method is fully automatic, and leverages a
bootstrapping algorithm for gradually creating a high-quality dataset. Our
results suggest that bootstrapping the seed version of WikiTiDe leads to better
fine-tuned models. We also leverage fine-tuned models in a number of downstream
tasks, showing promising results with respect to competitive baselines.Comment: Accepted by RANLP 2023 main conferenc
Probing Pre-Trained Language Models for Disease Knowledge
Pre-trained language models such as ClinicalBERT have achieved impressive
results on tasks such as medical Natural Language Inference. At first glance,
this may suggest that these models are able to perform medical reasoning tasks,
such as mapping symptoms to diseases. However, we find that standard benchmarks
such as MedNLI contain relatively few examples that require such forms of
reasoning. To better understand the medical reasoning capabilities of existing
language models, in this paper we introduce DisKnE, a new benchmark for Disease
Knowledge Evaluation. To construct this benchmark, we annotated each positive
MedNLI example with the types of medical reasoning that are needed. We then
created negative examples by corrupting these positive examples in an
adversarial way. Furthermore, we define training-test splits per disease,
ensuring that no knowledge about test diseases can be learned from the training
data, and we canonicalize the formulation of the hypotheses to avoid the
presence of artefacts. This leads to a number of binary classification
problems, one for each type of reasoning and each disease. When analysing
pre-trained models for the clinical/biomedical domain on the proposed
benchmark, we find that their performance drops considerably.Comment: Accepted by ACL 2021 Finding
Knowledge base unification via sense embeddings and disambiguation
We present KB-UNIFY, a novel approach for integrating the output of different Open Information Extraction systems into a single unified and fully disambiguated knowledge repository. KB-UNIFY consists of three main steps: (1) disambiguation of relation argument pairs via a sense-based vector representation and a large unified sense inventory; (2) ranking of semantic relations according to their degree of specificity; (3) cross-resource relation alignment and merging based on the semantic similarity of domains and ranges. We tested KB-UNIFY on a set of four heterogeneous knowledge bases, obtaining high-quality results. We discuss and provide evaluations at each stage, and release output and evaluation data for the use and scrutiny of the communit
Don't Patronize Me! An Annotated Dataset with Patronizing and Condescending Language towards Vulnerable Communities
In this paper, we introduce a new annotated dataset which is aimed at
supporting the development of NLP models to identify and categorize language
that is patronizing or condescending towards vulnerable communities (e.g.
refugees, homeless people, poor families). While the prevalence of such
language in the general media has long been shown to have harmful effects, it
differs from other types of harmful language, in that it is generally used
unconsciously and with good intentions. We furthermore believe that the often
subtle nature of patronizing and condescending language (PCL) presents an
interesting technical challenge for the NLP community. Our analysis of the
proposed dataset shows that identifying PCL is hard for standard NLP models,
with language models such as BERT achieving the best results
- …