1,171 research outputs found

    Short Text Categorization using World Knowledge

    Get PDF
    The content of the World Wide Web is drastically multiplying, and thus the amount of available online text data is increasing every day. Today, many users contribute to this massive global network via online platforms by sharing information in the form of a short text. Such an immense amount of data covers subjects from all the existing domains (e.g., Sports, Economy, Biology, etc.). Further, manually processing such data is beyond human capabilities. As a result, Natural Language Processing (NLP) tasks, which aim to automatically analyze and process natural language documents have gained significant attention. Among these tasks, due to its application in various domains, text categorization has become one of the most fundamental and crucial tasks. However, the standard text categorization models face major challenges while performing short text categorization, due to the unique characteristics of short texts, i.e., insufficient text length, sparsity, ambiguity, etc. In other words, the conventional approaches provide substandard performance, when they are directly applied to the short text categorization task. Furthermore, in the case of short text, the standard feature extraction techniques such as bag-of-words suffer from limited contextual information. Hence, it is essential to enhance the text representations with an external knowledge source. Moreover, the traditional models require a significant amount of manually labeled data and obtaining labeled data is a costly and time-consuming task. Therefore, although recently proposed supervised methods, especially, deep neural network approaches have demonstrated notable performance, the requirement of the labeled data remains the main bottleneck of these approaches. In this thesis, we investigate the main research question of how to perform \textit{short text categorization} effectively \textit{without requiring any labeled data} using knowledge bases as an external source. In this regard, novel short text categorization models, namely, Knowledge-Based Short Text Categorization (KBSTC) and Weakly Supervised Short Text Categorization using World Knowledge (WESSTEC) have been introduced and evaluated in this thesis. The models do not require any hand-labeled data to perform short text categorization, instead, they leverage the semantic similarity between the short texts and the predefined categories. To quantify such semantic similarity, the low dimensional representation of entities and categories have been learned by exploiting a large knowledge base. To achieve that a novel entity and category embedding model has also been proposed in this thesis. The extensive experiments have been conducted to assess the performance of the proposed short text categorization models and the embedding model on several standard benchmark datasets

    Automatic lexicon acquisition from encyclopedia.

    Get PDF
    Lo, Ka Kan.Thesis (M.Phil.)--Chinese University of Hong Kong, 2007.Includes bibliographical references (leaves 97-104).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Motivation --- p.3Chapter 1.2 --- New paradigm in language learning --- p.5Chapter 1.3 --- Semantic Relations --- p.7Chapter 1.4 --- Contribution of this thesis --- p.9Chapter 2 --- Related Work --- p.13Chapter 2.1 --- Theoretical Linguistics --- p.13Chapter 2.1.1 --- Overview --- p.13Chapter 2.1.2 --- Analysis --- p.15Chapter 2.2 --- Computational Linguistics - General Learning --- p.17Chapter 2.3 --- Computational Linguistics - HPSG Lexical Acquisition --- p.20Chapter 2.4 --- Learning approach --- p.22Chapter 3 --- Background --- p.25Chapter 3.1 --- Modeling primitives --- p.26Chapter 3.1.1 --- Feature Structure --- p.26Chapter 3.1.2 --- Word --- p.28Chapter 3.1.3 --- Phrase --- p.35Chapter 3.1.4 --- Clause --- p.36Chapter 3.2 --- Wikipedia Resource --- p.38Chapter 3.2.1 --- Encyclopedia Text --- p.40Chapter 3.3 --- Semantic Relations --- p.40Chapter 4 --- Learning Framework - Syntactic and Semantic --- p.46Chapter 4.1 --- Type feature scoring function --- p.48Chapter 4.2 --- Confidence score of lexical entry --- p.50Chapter 4.3 --- Specialization and Generalization --- p.52Chapter 4.3.1 --- Further Processing --- p.54Chapter 4.3.2 --- Algorithm Outline --- p.54Chapter 4.3.3 --- Algorithm Analysis --- p.55Chapter 4.4 --- Semantic Information --- p.57Chapter 4.4.1 --- Extraction --- p.58Chapter 4.4.2 --- Induction --- p.60Chapter 4.4.3 --- Generalization --- p.63Chapter 4.5 --- Extension with new text documents --- p.65Chapter 4.6 --- Integrating the syntactic and semantic acquisition framework --- p.65Chapter 5 --- Evaluation --- p.68Chapter 5.1 --- Evaluation Metric - English Resource Grammar --- p.68Chapter 5.1.1 --- English Resource Grammar --- p.69Chapter 5.2 --- Experiments --- p.71Chapter 5.2.1 --- Tasks --- p.71Chapter 5.2.2 --- Evaluation Measures --- p.77Chapter 5.2.3 --- Methodologies --- p.78Chapter 5.2.4 --- Corpus Preparation --- p.79Chapter 5.2.5 --- Results --- p.81Chapter 5.3 --- Result Analysis --- p.85Chapter 6 --- Conclusions --- p.95Bibliography --- p.9

    Knowledge Graphs Meet Multi-Modal Learning: A Comprehensive Survey

    Full text link
    Knowledge Graphs (KGs) play a pivotal role in advancing various AI applications, with the semantic web community's exploration into multi-modal dimensions unlocking new avenues for innovation. In this survey, we carefully review over 300 articles, focusing on KG-aware research in two principal aspects: KG-driven Multi-Modal (KG4MM) learning, where KGs support multi-modal tasks, and Multi-Modal Knowledge Graph (MM4KG), which extends KG studies into the MMKG realm. We begin by defining KGs and MMKGs, then explore their construction progress. Our review includes two primary task categories: KG-aware multi-modal learning tasks, such as Image Classification and Visual Question Answering, and intrinsic MMKG tasks like Multi-modal Knowledge Graph Completion and Entity Alignment, highlighting specific research trajectories. For most of these tasks, we provide definitions, evaluation benchmarks, and additionally outline essential insights for conducting relevant research. Finally, we discuss current challenges and identify emerging trends, such as progress in Large Language Modeling and Multi-modal Pre-training strategies. This survey aims to serve as a comprehensive reference for researchers already involved in or considering delving into KG and multi-modal learning research, offering insights into the evolving landscape of MMKG research and supporting future work.Comment: Ongoing work; 41 pages (Main Text), 55 pages (Total), 11 Tables, 13 Figures, 619 citations; Paper list is available at https://github.com/zjukg/KG-MM-Surve

    Towards the extraction of cross-sentence relations through event extraction and entity coreference

    Get PDF
    Cross-sentence relation extraction deals with the extraction of relations beyond the sentence boundary. This thesis focuses on two of the NLP tasks which are of importance to the successful extraction of cross-sentence relation mentions: event extraction and coreference resolution. The first part of the thesis focuses on addressing data sparsity issues in event extraction. We propose a self-training approach for obtaining additional labeled examples for the task. The process starts off with a Bi-LSTM event tagger trained on a small labeled data set which is used to discover new event instances in a large collection of unstructured text. The high confidence model predictions are selected to construct a data set of automatically-labeled training examples. We present several ways in which the resulting data set can be used for re-training the event tagger in conjunction with the initial labeled data. The best configuration achieves statistically significant improvement over the baseline on the ACE 2005 test set (macro-F1), as well as in a 10-fold cross validation (micro- and macro-F1) evaluation. Our error analysis reveals that the augmentation approach is especially beneficial for the classification of the most under-represented event types in the original data set. The second part of the thesis focuses on the problem of coreference resolution. While a certain level of precision can be reached by modeling surface information about entity mentions, their successful resolution often depends on semantic or world knowledge. This thesis investigates an unsupervised source of such knowledge, namely distributed word representations. We present several ways in which word embeddings can be utilized to extract features for a supervised coreference resolver. Our evaluation results and error analysis show that each of these features helps improve over the baseline coreference system’s performance, with a statistically significant improvement (CoNLL F1) achieved when the proposed features are used jointly. Moreover, all features lead to a reduction in the amount of precision errors in resolving references between common nouns, demonstrating that they successfully incorporate semantic information into the process

    On Generative Models and Joint Architectures for Document-level Relation Extraction

    Get PDF
    Biomedical text is being generated at a high rate in scientific literature publications and electronic health records. Within these documents lies a wealth of potentially useful information in biomedicine. Relation extraction (RE), the process of automating the identification of structured relationships between entities within text, represents a highly sought-after goal in biomedical informatics, offering the potential to unlock deeper insights and connections from this vast corpus of data. In this dissertation, we tackle this problem with a variety of approaches. We review the recent history of the field of document-level RE. Several themes emerge. First, graph neural networks dominate the methods for constructing entity and relation representations. Second, clever uses of attention allow for the these constructions to focus on particularly relevant tokens and object (such as mentions and entities) representations. Third, aggregation of signal across mentions in entity-level RE is a key focus of research. Fourth, the injection of additional signal by adding tokens to the text prior to encoding via language model (LM) or through additional learning tasks boosts performance. Last, we explore an assortment of strategies for the challenging task of end-to-end entity-level RE. Of particular note are sequence-to-sequence (seq2seq) methods that have become particularly popular in the past few years. With the success of general-domain generative LMs, biomedical NLP researchers have trained a variety of these models on biomedical text under the assumption that they would be superior for biomedical tasks. As training such models is computationally expensive, we investigate whether they outperform generic models. We test this assumption rigorously by comparing performance of all major biomedical generative language models to the performances of their generic counterparts across multiple biomedical RE datasets, in the traditional finetuning setting as well as in the few-shot setting. Surprisingly, we found that biomedical models tended to underperform compared to their generic counterparts. However, we found that small-scale biomedical instruction finetuning improved performance to a similar degree as larger-scale generic instruction finetuning. Zero-shot natural language processing (NLP) offers savings on the expenses associated with annotating datasets and the specialized knowledge required for applying NLP methods. Large, generative LMs trained to align with human objectives have demonstrated impressive zero-shot capabilities over a broad range of tasks. However, the effectiveness of these models in biomedical RE remains uncertain. To bridge this gap in understanding, we investigate how GPT-4 performs across several RE datasets. We experiment with the recent JSON generation features to generate structured output, which we use alternately by defining an explicit schema describing the relation structure, and inferring the structure from the prompt itself. Our work is the first to study zero-shot biomedical RE across a variety of datasets. Overall, performance was lower than that of fully-finetuned methods. Recall suffered in examples with more than a few relations. Entity mention boundaries were a major source of error, which future work could fruitfully address. In our previous work with generative LMs, we noted that RE performance decreased with the number of gold relations in an example. This observation aligns with the general pattern that recurrent neural network and transformer-based model performance tends to decrease with sequence length. Generative LMs also do not identify textual mentions or group them into entities, which are valuable information extraction tasks unto themselves. Therefore, in this age of generative methods, we revisit non-seq2seq methodology for biomedical RE. We adopt a sequential framework of named entity recognition (NER), clustering mentions into entities, followed by relation classification (RC). As errors early in the pipeline necessarily cause downstream errors, and NER performance is near its ceiling, we focus on improving clustering. We match state-of-the-art (SOTA) performance in NER, and substantially improve mention clustering performance by incorporating dependency parsing and gating string dissimilarity embeddings. Overall, we advance the field of biomedical RE in a few ways. In our experiments of finetuned LMs, we show that biomedicine-specific models are unnecessary, freeing researchers to make use of SOTA generic LMs. The relatively high few-shot performance in these experiments also suggests that biomedical RE can be reasonably accessible, as it is not so difficult to construct small datasets. Our investigation into zero-shot RE shows that SOTA LMs can compete with fully finetuned smaller LMs. Together these studies also demonstrate weaknesses of generative RE. Last, we show that non-generative RE methods still outperform generative methods in the fully-finetuned setting

    Deep Neural Architectures for End-to-End Relation Extraction

    Get PDF
    The rapid pace of scientific and technological advancements has led to a meteoric growth in knowledge, as evidenced by a sharp increase in the number of scholarly publications in recent years. PubMed, for example, archives more than 30 million biomedical articles across various domains and covers a wide range of topics including medicine, pharmacy, biology, and healthcare. Social media and digital journalism have similarly experienced their own accelerated growth in the age of big data. Hence, there is a compelling need for ways to organize and distill the vast, fragmented body of information (often unstructured in the form of natural human language) so that it can be assimilated, reasoned about, and ultimately harnessed. Relation extraction is an important natural language task toward that end. In relation extraction, semantic relationships are extracted from natural human language in the form of (subject, object, predicate) triples such that subject and object are mentions of discrete concepts and predicate indicates the type of relation between them. The difficulty of relation extraction becomes clear when we consider the myriad of ways the same relation can be expressed in natural language. Much of the current works in relation extraction assume that entities are known at extraction time, thus treating entity recognition as an entirely separate and independent task. However, recent studies have shown that entity recognition and relation extraction, when modeled together as interdependent tasks, can lead to overall improvements in extraction accuracy. When modeled in such a manner, the task is referred to as end-to-end relation extraction. In this work, we present four studies that introduce incrementally sophisticated architectures designed to tackle the task of end-to-end relation extraction. In the first study, we present a pipeline approach for extracting protein-protein interactions as affected by particular mutations. The pipeline system makes use of recurrent neural networks for protein detection, lexicons for gene normalization, and convolutional neural networks for relation extraction. In the second study, we show that a multi-task learning framework, with parameter sharing, can achieve state-of-the-art results for drug-drug interaction extraction. At its core, the model uses graph convolutions, with a novel attention-gating mechanism, over dependency parse trees. In the third study, we present a more efficient and general-purpose end-to-end neural architecture designed around the idea of the table-filling paradigm; for an input sentence of length n, all entities and relations are extracted in a single pass of the network in an indirect fashion by populating the cells of a corresponding n by n table using metric-based features. We show that this approach excels in both the general English and biomedical domains with extraction times that are up to an order of magnitude faster compared to the prior best. In the fourth and last study, we present an architecture for relation extraction that, in addition to being end-to-end, is able to handle cross-sentence and N-ary relations. Overall, our work contributes to the advancement of modern information extraction by exploring end-to-end solutions that are fast, accurate, and generalizable to many high-value domains

    Automatic Question Generation to Support Reading Comprehension of Learners - Content Selection, Neural Question Generation, and Educational Evaluation

    Get PDF
    Simply reading texts passively without actively engaging with their content is suboptimal for text comprehension since learners may miss crucial concepts or misunderstand essential ideas. In contrast, engaging learners actively by asking questions fosters text comprehension. However, educational resources frequently lack questions. Textbooks often contain only a few at the end of a chapter, and informal learning resources such as Wikipedia lack them entirely. Thus, in this thesis, we study to what extent questions about educational science texts can be automatically generated, tackling two research questions. The first question concerns selecting learning-relevant passages to guide the generation process. The second question investigates the generated questions' potential effects and applicability in reading comprehension scenarios. Our first contribution improves the understanding of neural question generation's quality in education. We find that the generators' high linguistic quality transfers to educational texts but that they require guidance by educational content selection. In consequence, we study multiple educational context and answer selection mechanisms. In our second contribution, we propose novel context selection approaches which target question-worthy sentences in texts. In contrast to previous works, our context selectors are guided by educational theory. The proposed methods perform competitive to related work while operating with educationally motivated decision criteria that are easier to understand for educational experts. The third contribution addresses answer selection methods to guide neural question generation with expected answers. Our experiments highlight the need for educational corpora for the task. Models trained on noneducational corpora do not transfer well to the educational domain. Given this discrepancy, we propose a novel corpus construction approach. It automatically derives educational answer selection corpora from textbooks. We verify the approach's usefulness by showing that neural models trained on the constructed corpora learn to detect learning-relevant concepts. In our last contribution, we use the insights from the previous experiments to design, implement, and evaluate an automatic question generator for educational use. We evaluate the proposed generator intrinsically with an expert annotation study and extrinsically with an empirical reading comprehension study. The two evaluation scenarios provide a nuanced view of the generated questions' strengths and weaknesses. Expert annotations attribute an educational value to roughly 60 % of the questions but also reveal various ways in which the questions still fall short of the quality experts desire. Furthermore, the reader-based evaluation indicates that the proposed educational question generator increases learning outcomes compared to a no-question control group. In summary, the results of the thesis improve the understanding of the content selection tasks in educational question generation and provide evidence that it can improve reading comprehension. As such, the proposed approaches are promising tools for authors and learners to promote active reading and thus foster text comprehension
    corecore