12 research outputs found
Data-efficient methods for information extraction
Strukturierte Wissensrepräsentationssysteme wie Wissensdatenbanken oder Wissensgraphen bieten Einblicke in Entitäten und Beziehungen zwischen diesen Entitäten in der realen Welt. Solche Wissensrepräsentationssysteme können in verschiedenen Anwendungen der natürlichen Sprachverarbeitung eingesetzt werden, z. B. bei der semantischen Suche, der Beantwortung von Fragen und der Textzusammenfassung. Es ist nicht praktikabel und ineffizient, diese Wissensrepräsentationssysteme manuell zu befüllen. In dieser Arbeit entwickeln wir Methoden, um automatisch benannte Entitäten und Beziehungen zwischen den Entitäten aus Klartext zu extrahieren. Unsere Methoden können daher verwendet werden, um entweder die bestehenden unvollständigen Wissensrepräsentationssysteme zu vervollständigen oder ein neues strukturiertes Wissensrepräsentationssystem von Grund auf zu erstellen. Im Gegensatz zu den gängigen überwachten Methoden zur Informationsextraktion konzentrieren sich unsere Methoden auf das Szenario mit wenigen Daten und erfordern keine große Menge an kommentierten Daten.
Im ersten Teil der Arbeit haben wir uns auf das Problem der Erkennung von benannten Entitäten konzentriert. Wir haben an der gemeinsamen Aufgabe von Bacteria Biotope 2019 teilgenommen. Die gemeinsame Aufgabe besteht darin, biomedizinische Entitätserwähnungen zu erkennen und zu normalisieren. Unser linguistically informed Named-Entity-Recognition-System besteht aus einem Deep-Learning-basierten Modell, das sowohl verschachtelte als auch flache Entitäten extrahieren kann; unser Modell verwendet mehrere linguistische Merkmale und zusätzliche Trainingsziele, um effizientes Lernen in datenarmen Szenarien zu ermöglichen. Unser System zur Entitätsnormalisierung verwendet String-Match, Fuzzy-Suche und semantische Suche, um die extrahierten benannten Entitäten mit den biomedizinischen Datenbanken zu verknüpfen. Unser System zur Erkennung von benannten Entitäten und zur Entitätsnormalisierung erreichte die niedrigste Slot-Fehlerrate von 0,715 und belegte den ersten Platz in der gemeinsamen Aufgabe. Wir haben auch an zwei gemeinsamen Aufgaben teilgenommen: Adverse Drug Effect Span Detection (Englisch) und Profession Span Detection (Spanisch); beide Aufgaben sammeln Daten von der Social Media Plattform Twitter. Wir haben ein Named-Entity-Recognition-Modell entwickelt, das die Eingabedarstellung des Modells durch das Stapeln heterogener Einbettungen aus verschiedenen Domänen verbessern kann; unsere empirischen Ergebnisse zeigen komplementäres Lernen aus diesen heterogenen Einbettungen. Unser Beitrag belegte den 3. Platz in den beiden gemeinsamen Aufgaben.
Im zweiten Teil der Arbeit untersuchten wir Strategien zur Erweiterung synthetischer Daten, um ressourcenarme Informationsextraktion in spezialisierten Domänen zu ermöglichen. Insbesondere haben wir backtranslation an die Aufgabe der Erkennung von benannten Entitäten auf Token-Ebene und der Extraktion von Beziehungen auf Satzebene angepasst. Wir zeigen, dass die Rückübersetzung sprachlich vielfältige und grammatikalisch kohärente synthetische Sätze erzeugen kann und als wettbewerbsfähige Erweiterungsstrategie für die Aufgaben der Erkennung von benannten Entitäten und der Extraktion von Beziehungen dient.
Bei den meisten realen Aufgaben zur Extraktion von Beziehungen stehen keine kommentierten Daten zur Verfügung, jedoch ist häufig ein großer unkommentierter Textkorpus vorhanden. Bootstrapping-Methoden zur Beziehungsextraktion können mit diesem großen Korpus arbeiten, da sie nur eine Handvoll Startinstanzen benötigen. Bootstrapping-Methoden neigen jedoch dazu, im Laufe der Zeit Rauschen zu akkumulieren (bekannt als semantische Drift), und dieses Phänomen hat einen drastischen negativen Einfluss auf die endgültige Genauigkeit der Extraktionen. Wir entwickeln zwei Methoden zur Einschränkung des Bootstrapping-Prozesses, um die semantische Drift bei der Extraktion von Beziehungen zu minimieren. Unsere Methoden nutzen die Graphentheorie und vortrainierte Sprachmodelle, um verrauschte Extraktionsmuster explizit zu identifizieren und zu entfernen. Wir berichten über die experimentellen Ergebnisse auf dem TACRED-Datensatz für vier Relationen.
Im letzten Teil der Arbeit demonstrieren wir die Anwendung der Domänenanpassung auf die anspruchsvolle Aufgabe der mehrsprachigen Akronymextraktion. Unsere Experimente zeigen, dass die Domänenanpassung die Akronymextraktion in wissenschaftlichen und juristischen Bereichen in sechs Sprachen verbessern kann, darunter auch Sprachen mit geringen Ressourcen wie Persisch und Vietnamesisch.The structured knowledge representation systems such as knowledge base or knowledge graph can provide insights regarding entities and relationship(s) among these entities in the real-world, such knowledge representation systems can be employed in various natural language processing applications such as semantic search, question answering and text summarization. It is infeasible and inefficient to manually populate these knowledge representation systems. In this work, we develop methods to automatically extract named entities and relationships among the entities from plain text and hence our methods can be used to either complete the existing incomplete knowledge representation systems to create a new structured knowledge representation system from scratch. Unlike mainstream supervised methods for information extraction, our methods focus on the low-data scenario and do not require a large amount of annotated data.
In the first part of the thesis, we focused on the problem of named entity recognition. We participated in the shared task of Bacteria Biotope 2019, the shared task consists of recognizing and normalizing the biomedical entity mentions. Our linguistically informed named entity recognition system consists of a deep learning based model which can extract both nested and flat entities; our model employed several linguistic features and auxiliary training objectives to enable efficient learning in data-scarce scenarios. Our entity normalization system employed string match, fuzzy search and semantic search to link the extracted named entities to the biomedical databases. Our named entity recognition and entity normalization system achieved the lowest slot error rate of 0.715 and ranked first in the shared task. We also participated in two shared tasks of Adverse Drug Effect Span detection (English) and Profession Span Detection (Spanish); both of these tasks collect data from the social media platform Twitter. We developed a named entity recognition model which can improve the input representation of the model by stacking heterogeneous embeddings from a diverse domain(s); our empirical results demonstrate complementary learning from these heterogeneous embeddings. Our submission ranked 3rd in both of the shared tasks.
In the second part of the thesis, we explored synthetic data augmentation strategies to address low-resource information extraction in specialized domains. Specifically, we adapted backtranslation to the token-level task of named entity recognition and sentence-level task of relation extraction. We demonstrate that backtranslation can generate linguistically diverse and grammatically coherent synthetic sentences and serve as a competitive augmentation strategy for the task of named entity recognition and relation extraction.
In most of the real-world relation extraction tasks, the annotated data is not available, however, quite often a large unannotated text corpus is available. Bootstrapping methods for relation extraction can operate on this large corpus as they only require a handful of seed instances. However, bootstrapping methods tend to accumulate noise over time (known as semantic drift) and this phenomenon has a drastic negative impact on the final precision of the extractions. We develop two methods to constrain the bootstrapping process to minimise semantic drift for relation extraction; our methods leverage graph theory and pre-trained language models to explicitly identify and remove noisy extraction patterns. We report the experimental results on the TACRED dataset for four relations.
In the last part of the thesis, we demonstrate the application of domain adaptation to the challenging task of multi-lingual acronym extraction. Our experiments demonstrate that domain adaptation can improve acronym extraction within scientific and legal domains in 6 languages including low-resource languages such as Persian and Vietnamese
Actas del XXIV Workshop de Investigadores en Ciencias de la Computación: WICC 2022
Compilación de las ponencias presentadas en el XXIV Workshop de Investigadores en Ciencias de la Computación (WICC), llevado a cabo en Mendoza en abril de 2022.Red de Universidades con Carreras en Informátic
Federated knowledge base debugging in DL-Lite A
Due to the continuously growing amount of data the federation of different and distributed data sources gained increasing attention. In order to tackle the challenge of federating heterogeneous sources a variety of approaches has been proposed. Especially in the context of the Semantic Web the application of Description Logics is one of the preferred methods to model federated knowledge based on a well-defined syntax and semantics. However, the more data are available from heterogeneous sources, the higher the risk is of inconsistency – a serious obstacle for performing reasoning tasks and query answering over a federated knowledge base. Given a single knowledge base the process of knowledge base debugging comprising the identification and resolution of conflicting statements have been widely studied while the consideration of federated settings integrating a network of loosely coupled data sources (such as LOD sources) has mostly been neglected.
In this thesis we tackle the challenging problem of debugging federated knowledge bases and focus on a lightweight Description Logic language, called DL-LiteA, that is aimed at applications requiring efficient and scalable reasoning. After introducing formal foundations such as Description Logics and Semantic Web technologies we clarify the motivating context of this work and discuss the general problem of information integration based on Description Logics.
The main part of this thesis is subdivided into three subjects. First, we discuss the specific characteristics of federated knowledge bases and provide an appropriate approach for detecting and explaining contradictive statements in a federated DL-LiteA knowledge base. Second, we study the representation of the identified conflicts and their relationships as a conflict graph and propose an approach for repair generation based on majority voting and statistical evidences. Third, in order to provide an alternative way for handling inconsistency in federated DL-LiteA knowledge bases we propose an automated approach for assessing adequate trust values (i.e., probabilities) at different levels of granularity by leveraging probabilistic inference over a graphical model.
In the last part of this thesis, we evaluate the previously developed algorithms against a set of large distributed LOD sources. In the course of discussing the experimental results, it turns out that the proposed approaches are sufficient, efficient and scalable with respect to real-world scenarios. Moreover, due to the exploitation of the federated structure in our algorithms it further becomes apparent that the number of identified wrong statements, the quality of the generated repair as well as the fineness of the assessed trust values profit from an increasing number of integrated sources
Skalabilna softverska platforma za pretraživanje hemijskih i bioloških repozitorijuma
This dissertation is about the SpecINT (Spectral Integration), a scalable
software platform, which benefits from the Semantic Web and from the
results of spectral graph theory for data integration and exploring semanticbased
repositories. In order to save time and resources, institutions are in need
of a comprehensive overview of relevant and most recently published data
globally. According to statistics, only one substance out of thousands satisfies
the preclinical and clinical tests’ criteria and can be used as a medicament.
Laboratory conditions require a lot of time and resources to test the effect of
large number of substances, and that is why application of in silico models is
found necessary. The methodology applied to achieve this goal is based on
the coordinates of graph eigenvectors used for automatic join of sub-queries
in Federated SPARQL query out of which only the most relevant data sources
within repositories are taken into consideration. Such an approach enables
reduction of number of duplicates in the results obtained, but also provides
useful results for the researchers. In this way the integration of repositories
can be effected without a common ontology between them, leaving an
impression there exists a searchable central and virtual storage. The platform
is developed in collaboration with the Laboratory for Cell and Molecular
Biology of the Faculty of Science, University of Kragujevac. However, the
methodology can be applied more broadly, since it is based on the „Open
Data” standards and concepts
Minería de calidad de datos : aplicación de técnicas de minería de datos para la evaluación de la calidad de los datos
El aseguramiento de la calidad de los datos con los cuales se trabaja es crucial para tomar decisiones acertadas, efectivas y a tiempo. Lograr una buena calidad de datos no solo implica trabajar con datos que no contengan errores, sino que también incluye características tales como la completitud (tener la mayor cantidad posible de datos), la actualidad (que los datos sean lo más actuales posibles), la usabilidad (que los datos sean adecuados y comprensibles), y la disponibilidad (que se pueda acceder a ellos cuando se los necesita), entre muchas otras. La minería de datos, por otra parte, permite descubrir información oculta en los datos, utilizando un paradigma inverso al usual: mientras normalmente se comienza planteando una hipótesis para luego tratar de confirmarla, la minería de datos propone identificar en forma automatizada patrones que pueden resultar interesantes y que posiblemente no hayan sido imaginados por los analistas. Si bien ambas áreas son altamente relevantes en el mundo académico e industrial de la actualidad, donde la informática brinda un soporte tecnológico apropiado, la literatura existente y algunas experiencias muestran que existe muy poca o nula integración entre la calidad de datos y la minería de datos. En general, los trabajos pertenecientes a un área suelen ser ajenos a los existentes en la otra.
En este trabajo se realiza un estudio en profundidad de las dos áreas introducidas para luego hacer un análisis de los mecanismos que permitirían vincularlas, y finalmente implementar técnicas que permitan abordar el análisis de la calidad de conjuntos de datos aprovechando las capacidades inherentes de la minería de datos. El trabajo presenta dos propuestas nuevas para la aplicación de técnicas de minería de datos para la evaluación de la calidad de datos, que fueron presentadas en dos eventos internacionales especializados. Una de ellas se orienta a la determinación de si un conjunto de datos es suficientemente actualizado, y la otra se orienta al análisis de datos faltantes. Además, se presenta también una tercera propuesta, aún en etapa de formulación, para evaluar qué tan usable es un conjunto de datos en base a sus características. Palabras clave: calidad de datos, minería de datos, minería de calidad de datos. El trabajo presenta dos propuestas nuevas para la aplicación de técnicas de minería de datos parala evaluación de la calidad de datos, que fueron presentadas en dos eventos internacionales especializados. Una de ellas se orienta a la determinación de si un conjunto de datos es suficientemente actualizado, y la otra se orienta al análisis de datos faltantes. Además, se presenta también una tercera propuesta, aún en etapa de formulación, para evaluar qué tan usable es un conjunto de datos en base a sus características
Ontology Network for Social Network Analysis in a Knowledge Management Context
Organizational knowledge is one of the most valuable assets that companies own today. For several decades organizations have been developing strategies to manage knowledge with particular emphasis on tacit knowledge discovery. The particular dynamic that presents the evolution and transfer of tacit knowledge is closely tied to the relations between people. For this reason, Social Network Analysis (SNA) can be a powerful tool to support a Knowledge Management (KM) initiative. Despite usefulness recognition of SNA techniques within KM processes, there is still remains the initial problem of data collection and representation (problem shared by both initiatives). The aim of this paper is to analyze an ontology network usefulness to obtain the necessary knowledge structure to feed the SNA-KM integration architecture proposed.Sociedad Argentina de Informática e Investigación Operativa (SADIO
WICC 2016 : XVIII Workshop de Investigadores en Ciencias de la Computación
Actas del XVIII Workshop de Investigadores en Ciencias de la Computación (WICC 2016), realizado en la Universidad Nacional de Entre Ríos, el 14 y 15 de abril de 2016.Red de Universidades con Carreras en Informática (RedUNCI
Ontology Network for Social Network Analysis in a Knowledge Management Context
Organizational knowledge is one of the most valuable assets that companies own today. For several decades organizations have been developing strategies to manage knowledge with particular emphasis on tacit knowledge discovery. The particular dynamic that presents the evolution and transfer of tacit knowledge is closely tied to the relations between people. For this reason, Social Network Analysis (SNA) can be a powerful tool to support a Knowledge Management (KM) initiative. Despite usefulness recognition of SNA techniques within KM processes, there is still remains the initial problem of data collection and representation (problem shared by both initiatives). The aim of this paper is to analyze an ontology network usefulness to obtain the necessary knowledge structure to feed the SNA-KM integration architecture proposed.Sociedad Argentina de Informática e Investigación Operativa (SADIO
Ontology Network for Social Network Analysis in a Knowledge Management Context
Organizational knowledge is one of the most valuable assets that companies own today. For several decades organizations have been developing strategies to manage knowledge with particular emphasis on tacit knowledge discovery. The particular dynamic that presents the evolution and transfer of tacit knowledge is closely tied to the relations between people. For this reason, Social Network Analysis (SNA) can be a powerful tool to support a Knowledge Management (KM) initiative. Despite usefulness recognition of SNA techniques within KM processes, there is still remains the initial problem of data collection and representation (problem shared by both initiatives). The aim of this paper is to analyze an ontology network usefulness to obtain the necessary knowledge structure to feed the SNA-KM integration architecture proposed.Sociedad Argentina de Informática e Investigación Operativa (SADIO