10 research outputs found

    Semantic Pivoting Model for Effective Event Detection

    Full text link
    Event Detection, which aims to identify and classify mentions of event instances from unstructured articles, is an important task in Natural Language Processing (NLP). Existing techniques for event detection only use homogeneous one-hot vectors to represent the event type classes, ignoring the fact that the semantic meaning of the types is important to the task. Such an approach is inefficient and prone to overfitting. In this paper, we propose a Semantic Pivoting Model for Effective Event Detection (SPEED), which explicitly incorporates prior information during training and captures semantically meaningful correlations between input and events. Experimental results show that our proposed model achieves state-of-the-art performance and outperforms the baselines in multiple settings without using any external resources.Comment: 11 pages, 4 figures; Accepted to ACIIDS 202

    Named entity recognition in chemical patents using ensemble of contextual language models

    Full text link
    Chemical patent documents describe a broad range of applications holding key reaction and compound information, such as chemical structure, reaction formulas, and molecular properties. These informational entities should be first identified in text passages to be utilized in downstream tasks. Text mining provides means to extract relevant information from chemical patents through information extraction techniques. As part of the Information Extraction task of the Cheminformatics Elsevier Melbourne University challenge, in this work we study the effectiveness of contextualized language models to extract reaction information in chemical patents. We assess transformer architectures trained on a generic and specialised corpora to propose a new ensemble model. Our best model, based on a majority ensemble approach, achieves an exact F1-score of 92.30% and a relaxed F1-score of 96.24%. The results show that ensemble of contextualized language models can provide an effective method to extract information from chemical patents

    A Two-Step Approach for Explainable Relation Extraction

    Get PDF
    International audienceKnowledge Graphs (KG) offer easy-to-process information. An important issue to build a KG from texts is the Relation Extraction (RE) task that identifies and labels relationships between entity mentions. In this paper, to address the RE problem, we propose to combine a deep learning approach for relation detection, and a symbolic method for relation classification. It allows to have at the same time the performance of deep learning methods and the interpretability of symbolic methods. This method has been evaluated and compared with state-ofthe-art methods on TACRED, a relation extraction benchmark, and has shown interesting quantitative and qualitative results

    Extracting Relations in Texts with Concepts of Neighbours

    Get PDF
    International audienceDuring the last decade, the need for reliable and massive Knowledge Graphs (KG) increased. KGs can be created in several ways: manually with forms or automatically with Information Extraction (IE), a natural language processing task for extracting knowledge from text. Relation Extraction is the part of IE that focuses on identifying relations between named entities in texts, which amounts to find new edges in a KG. Most recent approaches rely on deep learning, achieving state-ofthe-art performances. However, those performances are still too low to fully automatize the construction of reliable KGs, and human interaction remains necessary. This is made difficult by the statistical nature of deep learning methods that makes their predictions hardly interpretable. In this paper, we present a new symbolic and interpretable approach for Relation Extraction in texts. It is based on a modeling of the lexical and syntactic structure of text as a knowledge graph, and it exploits Concepts of Neighbours, a method based on Graph-FCA for computing similarities in knowledge graphs. An evaluation has been performed on a subset of TACRED (a relation extraction benchmark), showing promising results

    The value of numbers in clinical text classification

    Get PDF
    Clinical text often includes numbers of various types and formats. However, most current text classification approaches do not take advantage of these numbers. This study aims to demonstrate that using numbers as features can significantly improve the performance of text classification models. This study also demonstrates the feasibility of extracting such features from clinical text. Unsupervised learning was used to identify patterns of number usage in clinical text. These patterns were analyzed manually and converted into pattern-matching rules. Information extraction was used to incorporate numbers as features into a document representation model. We evaluated text classification models trained on such representation. Our experiments were performed with two document representation models (vector space model and word embedding model) and two classification models (support vector machines and neural networks). The results showed that even a handful of numerical features can significantly improve text classification performance. We conclude that commonly used document representations do not represent numbers in a way that machine learning algorithms can effectively utilize them as features. Although we demonstrated that traditional information extraction can be effective in converting numbers into features, further community-wide research is required to systematically incorporate number representation into the word embedding process

    Biographical information extraction: A language-agnostic methodology for datasets and models

    Get PDF
    A thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of Philosophy.Information extraction (IE) refers to the task of detecting and linking information contained in written texts. While it includes various subtasks, relation extraction (RE) is used to link two entities in a text via a common relation. RE can therefore be used to build linked databases of knowledge across a wide area of topics. Today, the task of RE is treated as a supervised machine learning (ML) task, where a model is trained using a specific architecture and a specific annotated dataset. These specific datasets typically aim to represent common patterns that the model is to learn, albeit at the cost of manual annotation, which can be costly and time-consuming. In addition, due to the nature of the training process, the models can be sensitive to a specific genre or topic, and are generally monolingual. It therefore stands to reason, that certain genres and topics have better models, as they are treated with a higher priority due to financial interests for instance. This in turn leads to RE models not being available to every area of research, leaving incomplete linked databases of knowledge. For instance, if the birthplace of a person is not correctly extracted, the place and the person can not be linked correctly, therefore not leaving linked databases incomplete. To address this problem, this thesis explores aspects of RE that could be adapted in ways which require little human effort, therefore making RE models more widely available. The first aspect is the annotated data. During the course of this thesis, Wikipedia and its subsidiaries are used as sources to automatically annotate sentences for RE. The dataset, which is aimed towards digital humanities (DH) and historical research, is automatically compiled by aligning sentences from Wikipedia articles with matching structured data from sources including Pantheon and Wikidata. By exploiting the structure of Wikipedia articles and robust named entity recognition (NER), information is matched with relatively high precision in order to compile annotated relation pairs for ten different relations that are important in the DH domain: birthdate, birthplace, deathdate, deathplace, occupation, parent, educated, child, sibling and other (all other relations). Furthermore, the effectiveness of the dataset is demonstrated by training a state-of-the-art neural model to classify relation pairs. For its evaluation, a manually annotated gold standard set is used. An investigation of the necessary adaptations to recreate the automatic process in a multilingual setting is also undertaken, looking specifically at English and German, for which similar neural models are trained and evaluated on a gold standard dataset. While the process is aimed here at training neural models for RE within the domain of digital humanities and history, it may be transferable to other domains

    The Mechanical Psychologist: How Computational Techniques Can Aid Social Researchers in the Analysis of High-Stakes Conversation

    Get PDF
    Qualitative coding is an essential observational tool for describing behaviour in the social sciences. However, it traditionally relies on manual, time-consuming, and error-prone methods performed by humans. To overcome these issues, cross-disciplinary researchers are increasingly exploring computational methods such as Natural Language Processing (NLP) and Machine Learning (ML) to annotate behaviour automatically. Automated methods offer scalability, error reduction, and the discovery of increasingly subtle patterns in data compared to human effort alone (N. C. Chen et al., 2018). Despite promising advancements, concerns regarding generalisability, mistrust of automation, and value alignment between humans and machines persist (Friedberg et al., 2012; Grimmer et al., 2021; Jiang et al., 2021; R. Levitan & Hirschberg, 2011; Mills, 2019; Nenkova et al., 2008; Rahimi et al., 2017; Yarkoni et al., 2021). This thesis investigates the potential of computational techniques, such as social signal processing, text mining, and machine learning, to streamline qualitative coding in the social sciences, focusing on two high-stakes conversational case studies. The first case study analyses political interviewing using a corpus of 691 interview transcripts from US news networks. Psychological behaviours associated with effective interviewing are measured and used to predict conversational quality through supervised machine learning. Feature engineering employs a Social Signal Processing (SSP) approach to extract latent behaviours from low-level social signals (Vinciarelli, Salamin, et al., 2009). Conversational quality, calculated from desired characteristics of interviewee speech, is validated by a human-rater study. The findings support the potential of computational approaches in qualitative coding while acknowledging challenges in interpreting low-level social signals. The second case study investigates the ability of machines to learn expert-defined behaviours from human annotation, specifically in detecting predatory behaviour in known cases of online child grooming. In this section, the author utilises 623 chat logs obtained from a US-based online watchdog, with expert annotators labelling a subset of these chat logs to train a large language model. The goal was to investigate the machine’s ability to detect eleven predatory behaviours based on expert annotations. The results show that the machine could detect several behaviours with as few as fifty labelled instances, but rare behaviours were frequently over-predicted. The author next implemented a collaborative human-AI approach to investigate the trade-off between human accuracy and machine efficiency. The results suggested that a human-in-the-loop approach could improve human efficiency and machine accuracy, achieving near-human performance on several behaviours approximately fifteen times faster than human effort alone. The conclusion emphasises the value of increased automation in social sciences while recognising the importance of social scientific expertise in cross-disciplinary re- search, especially when addressing real-world problems. It advocates for technology that augments and enhances human effort and expertise without replacing it entirely. This thesis acknowledges the challenges in interpreting computational signals and the importance of preserving human insight in qualitative coding. The thesis also highlights potential avenues for future research, such as refining computational methods for qualitative coding and exploring collaborative human-AI approaches to address the limitations of automated methods

    Twenty-five years of information extraction

    No full text
    corecore