2,591 research outputs found

    Documenting Knowledge Graph Embedding and Link Prediction using Knowledge Graphs

    Get PDF
    In recent years, sub-symbolic learning, i.e., Knowledge Graph Embedding (KGE) incorporated with Knowledge Graphs (KGs) has gained significant attention in various downstream tasks (e.g., Link Prediction (LP)). These techniques learn a latent vector representation of KG's semantical structure to infer missing links. Nonetheless, the KGE models remain a black box, and the decision-making process behind them is not clear. Thus, the trustability and reliability of the model's outcomes have been challenged. While many state-of-the-art approaches provide data-driven frameworks to address these issues, they do not always provide a complete understanding, and the interpretations are not machine-readable. That is why, in this work, we extend a hybrid interpretable framework, InterpretME, in the field of the KGE models, especially for translation distance models, which include TransE, TransH, TransR, and TransD. The experimental evaluation on various benchmark KGs supports the validity of this approach, which we term Trace KGE. Trace KGE, in particular, contributes to increased interpretability and understanding of the perplexing KGE model's behavior

    Location Reference Recognition from Texts: A Survey and Comparison

    Full text link
    A vast amount of location information exists in unstructured texts, such as social media posts, news stories, scientific articles, web pages, travel blogs, and historical archives. Geoparsing refers to recognizing location references from texts and identifying their geospatial representations. While geoparsing can benefit many domains, a summary of its specific applications is still missing. Further, there is a lack of a comprehensive review and comparison of existing approaches for location reference recognition, which is the first and core step of geoparsing. To fill these research gaps, this review first summarizes seven typical application domains of geoparsing: geographic information retrieval, disaster management, disease surveillance, traffic management, spatial humanities, tourism management, and crime management. We then review existing approaches for location reference recognition by categorizing these approaches into four groups based on their underlying functional principle: rule-based, gazetteer matching–based, statistical learning-–based, and hybrid approaches. Next, we thoroughly evaluate the correctness and computational efficiency of the 27 most widely used approaches for location reference recognition based on 26 public datasets with different types of texts (e.g., social media posts and news stories) containing 39,736 location references worldwide. Results from this thorough evaluation can help inform future methodological developments and can help guide the selection of proper approaches based on application needs

    Unifying context with labeled property graph: A pipeline-based system for comprehensive text representation in NLP

    Get PDF
    Extracting valuable insights from vast amounts of unstructured digital text presents significant challenges across diverse domains. This research addresses this challenge by proposing a novel pipeline-based system that generates domain-agnostic and task-agnostic text representations. The proposed approach leverages labeled property graphs (LPG) to encode contextual information, facilitating the integration of diverse linguistic elements into a unified representation. The proposed system enables efficient graph-based querying and manipulation by addressing the crucial aspect of comprehensive context modeling and fine-grained semantics. The effectiveness of the proposed system is demonstrated through the implementation of NLP components that operate on LPG-based representations. Additionally, the proposed approach introduces specialized patterns and algorithms to enhance specific NLP tasks, including nominal mention detection, named entity disambiguation, event enrichments, event participant detection, and temporal link detection. The evaluation of the proposed approach, using the MEANTIME corpus comprising manually annotated documents, provides encouraging results and valuable insights into the system\u27s strengths. The proposed pipeline-based framework serves as a solid foundation for future research, aiming to refine and optimize LPG-based graph structures to generate comprehensive and semantically rich text representations, addressing the challenges associated with efficient information extraction and analysis in NLP

    Zemi: Learning Zero-Shot Semi-Parametric Language Models from Multiple Tasks

    Full text link
    Although large language models have achieved impressive zero-shot ability, the huge model size generally incurs high cost. Recently, semi-parametric language models, which augment a smaller language model with an external retriever, have demonstrated promising language modeling capabilities. However, it remains unclear whether such semi-parametric language models can perform competitively well as their fully-parametric counterparts on zero-shot generalization to downstream tasks. In this work, we introduce Zemi\text{Zemi}, a zero-shot semi-parametric language model. To our best knowledge, this is the first semi-parametric language model that can demonstrate strong zero-shot performance on a wide range of held-out unseen tasks. We train Zemi\text{Zemi} with a novel semi-parametric multitask prompted training paradigm, which shows significant improvement compared with the parametric multitask training as proposed by T0. Specifically, we augment the multitask training and zero-shot evaluation with retrieval from a large-scale task-agnostic unlabeled corpus. In order to incorporate multiple potentially noisy retrieved augmentations, we further propose a novel augmentation fusion\text{augmentation fusion} module leveraging perceiver resampler and gated cross-attention. Notably, our proposed ZemiLARGE\text{Zemi}_\text{LARGE} outperforms T0-3B by 16% on all seven evaluation tasks while being 3.9x smaller in model size.Comment: Accepted as a conference paper at Findings of ACL 202

    GitTables: A Large-Scale Corpus of Relational Tables

    Full text link
    The success of deep learning has sparked interest in improving relational table tasks, like data preparation and search, with table representation models trained on large table corpora. Existing table corpora primarily contain tables extracted from HTML pages, limiting the capability to represent offline database tables. To train and evaluate high-capacity models for applications beyond the Web, we need resources with tables that resemble relational database tables. Here we introduce GitTables, a corpus of 1M relational tables extracted from GitHub. Our continuing curation aims at growing the corpus to at least 10M tables. Analyses of GitTables show that its structure, content, and topical coverage differ significantly from existing table corpora. We annotate table columns in GitTables with semantic types, hierarchical relations and descriptions from Schema.org and DBpedia. The evaluation of our annotation pipeline on the T2Dv2 benchmark illustrates that our approach provides results on par with human annotations. We present three applications of GitTables, demonstrating its value for learned semantic type detection models, schema completion methods, and benchmarks for table-to-KG matching, data search, and preparation. We make the corpus and code available at https://gittables.github.io

    PEJL: A path-enhanced joint learning approach for knowledge graph completion

    Get PDF
    Knowledge graphs (KGs) often suffer from incompleteness. Knowledge graph completion (KGC) is proposed to complete missing components in a KG. Most KGC methods focus on direct relations and fail to leverage rich semantic information in multi-hop paths. In contrast, path-based embedding methods can capture path information and utilize extra semantics to improve KGC. However, most path-based methods cannot take advantage of full multi-hop information and neglect to capture multiple semantic associations between single and multi-hop triples. To bridge the gap, we propose a novel path-enhanced joint learning approach called PEJL for KGC. Rather than learning multi-hop representations, PEJL can recover multi-hop embeddings by encoding full multi-hop components. Meanwhile, PEJL extends the definition of translation energy functions and generates new semantic representations for each multi-hop component, which is rarely considered in path-based methods. Specifically, we first use the path constraint resource allocation (PCRA) algorithm to extract multi-hop triples. Then we use an embedding recovering module consisting of a bidirectional gated recurrent unit (GRU) layer and a fully connected layer to obtain multi-hop embeddings. Next, we employ a KG modeling module to leverage various semantic information and model the whole knowledge graph based on translation methods. Finally, we define a joint learning approach to train our proposed PEJL. We evaluate our model on two KGC datasets: FB15K-237 and NELL-995. Experiments show the effectiveness and superiority of PEJL

    Less is More: Restricted Representations for Better Interpretability and Generalizability

    Get PDF
    Deep neural networks are prevalent in supervised learning for large amounts of tasks such as image classification, machine translation and even scientific discovery. Their success is often at the sacrifice of interpretability and generalizability. The increasing complexity of models and involvement of the pre-training process make the inexplicability more imminent. The outstanding performance when labeled data are abundant while prone to overfit when labeled data are limited demonstrates the difficulty of deep neural networks' generalizability to different datasets. This thesis aims to improve interpretability and generalizability by restricting representations. We choose to approach interpretability by focusing on attribution analysis to understand which features contribute to prediction on BERT, and to approach generalizability by focusing on effective methods in a low-data regime. We consider two strategies of restricting representations: (1) adding bottleneck, and (2) introducing compression. Given input x, suppose we want to learn y with the latent representation z (i.e. x→z→y), adding bottleneck means adding function R such that L(R(z)) < L(z) and introducing compression means adding function R so that L(R(y)) < L(y) where L refers to the number of bits. In other words, the restriction is added either in the middle of the pipeline or at the end of it. We first introduce how adding information bottleneck can help attribution analysis and apply it to investigate BERT's behavior on text classification in Chapter 3. We then extend this attribution method to analyze passage reranking in Chapter 4, where we conduct a detailed analysis to understand cross-layer and cross-passage behavior. Adding bottleneck can not only provide insight to understand deep neural networks but can also be used to increase generalizability. In Chapter 5, we demonstrate the equivalence between adding bottleneck and doing neural compression. We then leverage this finding with a framework called Non-Parametric learning by Compression with Latent Variables (NPC-LV), and show how optimizing neural compressors can be used in the non-parametric image classification with few labeled data. To further investigate how compression alone helps non-parametric learning without latent variables (NPC), we carry out experiments with a universal compressor gzip on text classification in Chapter 6. In Chapter 7, we elucidate methods of adopting the perspective of doing compression but without the actual process of compression using T5. Using experimental results in passage reranking, we show that our method is highly effective in a low-data regime when only one thousand query-passage pairs are available. In addition to the weakly supervised scenario, we also extend our method to large language models like GPT under almost no supervision --- in one-shot and zero-shot settings. The experiments show that without extra parameters or in-context learning, GPT can be used for semantic similarity, text classification, and text ranking and outperform strong baselines, which is presented in Chapter 8. The thesis proposes to tackle two big challenges in machine learning --- "interpretability" and "generalizability" through restricting representation. We provide both theoretical derivation and empirical results to show the effectiveness of using information-theoretic approaches. We not only design new algorithms but also provide numerous insights on why and how "compression" is so important in understanding deep neural networks and improving generalizability

    Defining Safe Training Datasets for Machine Learning Models Using Ontologies

    Get PDF
    Machine Learning (ML) models have been gaining popularity in recent years in a wide variety of domains, including safety-critical domains. While ML models have shown high accuracy in their predictions, they are still considered black boxes, meaning that developers and users do not know how the models make their decisions. While this is simply a nuisance in some domains, in safetycritical domains, this makes ML models difficult to trust. To fully utilize ML models in safetycritical domains, there needs to be a method to improve trust in their safety and accuracy without human experts checking each decision. This research proposes a method to increase trust in ML models used in safety-critical domains by ensuring the safety and completeness of the model’s training dataset. Since most of the complexity of the model is built through training, ensuring the safety of the training dataset could help to increase the trust in the safety of the model. The method proposed in this research uses a domain ontology and an image quality characteristic ontology to validate the domain completeness and image quality robustness of a training dataset. This research also presents an experiment as a proof of concept for this method where ontologies are built for the emergency road vehicle domain

    Evaluating automated and hybrid neural disambiguation for African historical named entities

    Get PDF
    Documents detailing South African history contain ambiguous names. Ambiguous names may be due to people having the same name or the same person being referred to by multiple different names. Thus when searching for or attempting to extract information about a particular person, the name used may affect the results. This problem may be alleviated by using a Named Entity Disambiguation (NED) system to disambiguate names by linking them to a knowledge base. In recent years, transformer-based language models have led to improvements in NED systems. Furthermore, multilingual language models have shown the ability to learn concepts across languages, reducing the amount of training data required in low-resource languages. Thus a multilingual language model-based NED system was developed to disambiguate people's names within a historical South African context using documents written in English and isiZulu from the 500 Year Archive (FHYA). The multilingual language model-based system substantially improved on a probability-based baseline and achieved a micro F1-score of 0.726. At the same time, the entity linking component was able to link 81.9% of the mentions to the correct entity. However, the system's performance on documents written in isiZulu was significantly lower than on the documents written in English. Thus the system was augmented with handcrafted rules to improve its performance. The addition of handcrafted rules resulted in a small but significant improvement in performance when compared to the unaugmented NED system

    Hierarchical Classification of Design Decisions using pre-trained Language Models

    Get PDF
    Die Software-Architektur Dokumentation (SAD) ist ein integrales Artefakt eines Software Projektes. Die SAD trägt zum fortwährenden Erfolg eines Software Projektes bei, indem sie ein gemeinsames Verständnis der Software Architektur gewährleistet, wichtige Entwurfsentscheidungen dokumentiert und einer Erosion der Software vorbeugt. Um die Qualität von SADs zu verbessern und nachgelagerte Aufgaben zu unterstützen, ist eine automatische Klassifizierung dieser Entwurfsentscheidungen erstrebenswert. In dieser Arbeit implementieren und evaluieren wir einen Ansatz zur automatischen Identifikation und Klassifizierung von Entwurfsentscheidungen auf der Grundlage einer feingranularen Taxonomie, bei der wir eine hierarchische Klassifikationsstrategie mit dem Einsatz von Transfer-Lernen durch vortrainierter Sprachmodelle kombinieren. Der Beitrag dieser Arbeit besteht darin, den Vorteil einer hierarchischen Klassifikationsstrategie für die automatische Klassifikation von Entwurfsentscheidungen gegenüber einem nicht-hierarchischen Ansatz zu untersuchen. Außerdem untersuchen und vergleichen wir die Effektivität der vortrainierten Sprachmodelle RoBERTa, XLNet, BERTOverflow und GPT-3 für diese Aufgabe. In unserer Evaluation schnitten die Ansätze mit vortrainierten Sprachmodellen im Allgemeinen besser ab als die Baseline-Ansätze. Wir konnten jedoch keinen klaren Vorteil der hierarchischen Ansätze gegenüber den nicht-hierarchischen Ansätzen in Kombination mit den Sprachmodellen feststelle. Letztlich sind die Ergebnisse dieser Arbeit durch die Größe und das Ungleichgewicht unseres Datensatzes limitiert und erfordern daher weitere Forschung mit einem größeren Datensatz
    • …
    corecore