483 research outputs found

    Semantics-driven Abstractive Document Summarization

    Get PDF
    The evolution of the Web over the last three decades has led to a deluge of scientific and news articles on the Internet. Harnessing these publications in different fields of study is critical to effective end user information consumption. Similarly, in the domain of healthcare, one of the key challenges with the adoption of Electronic Health Records (EHRs) for clinical practice has been the tremendous amount of clinical notes generated that can be summarized without which clinical decision making and communication will be inefficient and costly. In spite of the rapid advances in information retrieval and deep learning techniques towards abstractive document summarization, the results of these efforts continue to resemble extractive summaries, achieving promising results predominantly on lexical metrics but performing poorly on semantic metrics. Thus, abstractive summarization that is driven by intrinsic and extrinsic semantics of documents is not adequately explored. Resources that can be used for generating semantics-driven abstractive summaries include: • Abstracts of multiple scientific articles published in a given technical field of study to generate an abstractive summary for topically-related abstracts within the field, thus reducing the load of having to read semantically duplicate abstracts on a given topic. • Citation contexts from different authoritative papers citing a reference paper can be used to generate utility-oriented abstractive summary for a scientific article. • Biomedical articles and the named entities characterizing the biomedical articles along with background knowledge bases to generate entity and fact-aware abstractive summaries. • Clinical notes of patients and clinical knowledge bases for abstractive clinical text summarization using knowledge-driven multi-objective optimization. In this dissertation, we develop semantics-driven abstractive models based on intra- document and inter-document semantic analyses along with facts of named entities retrieved from domain-specific knowledge bases to produce summaries. Concretely, we propose a sequence of frameworks leveraging semantics at various granularity (e.g., word, sentence, document, topic, citations, and named entities) levels, by utilizing external resources. The proposed frameworks have been applied to a range of tasks including 1. Abstractive summarization of topic-centric multi-document scientific articles and news articles. 2. Abstractive summarization of scientific articles using crowd-sourced citation contexts. 3. Abstractive summarization of biomedical articles clustered based on entity-relatedness. 4. Abstractive summarization of clinical notes of patients with heart failure and Chest X-Rays recordings. The proposed approaches achieve impressive performance in terms of preserving semantics in abstractive summarization while paraphrasing. For summarization of topic-centric multiple scientific/news articles, we propose a three-stage approach where abstracts of scientific articles or news articles are clustered based on their topical similarity determined from topics generated using Latent Dirichlet Allocation (LDA), followed by extractive phase and abstractive phase. Then, in the next stage, we focus on abstractive summarization of biomedical literature where we leverage named entities in biomedical articles to 1) cluster related articles; and 2) leverage the named entities towards guiding abstractive summarization. Finally, in the last stage, we turn to external resources such as citation contexts pointing to a scientific article to generate a comprehensive and utility-centric abstractive summary of a scientific article, domain-specific knowledge bases to fill gaps in information about entities in a biomedical article to summarize and clinical notes to guide abstractive summarization of clinical text. Thus, the bottom-up progression of exploring semantics towards abstractive summarization in this dissertation starts with (i) Semantic Analysis of Latent Topics; builds on (ii) Internal and External Knowledge-I (gleaned from abstracts and Citation Contexts); and extends it to make it comprehensive using (iii) Internal and External Knowledge-II (Named Entities and Knowledge Bases)

    The MeSH-gram Neural Network Model: Extending Word Embedding Vectors with MeSH Concepts for UMLS Semantic Similarity and Relatedness in the Biomedical Domain

    Full text link
    Eliciting semantic similarity between concepts in the biomedical domain remains a challenging task. Recent approaches founded on embedding vectors have gained in popularity as they risen to efficiently capture semantic relationships The underlying idea is that two words that have close meaning gather similar contexts. In this study, we propose a new neural network model named MeSH-gram which relies on a straighforward approach that extends the skip-gram neural network model by considering MeSH (Medical Subject Headings) descriptors instead words. Trained on publicly available corpus PubMed MEDLINE, MeSH-gram is evaluated on reference standards manually annotated for semantic similarity. MeSH-gram is first compared to skip-gram with vectors of size 300 and at several windows contexts. A deeper comparison is performed with tewenty existing models. All the obtained results of Spearman's rank correlations between human scores and computed similarities show that MeSH-gram outperforms the skip-gram model, and is comparable to the best methods but that need more computation and external resources.Comment: 6 pages, 2 table

    Extractive Summarization : Experimental work on nursing notes in Finnish

    Get PDF
    Natural Language Processing (NLP) is a subfield of artificial intelligence and linguistics that is concerned with how a computer machine interacts with human language. With the increasing computational power and the advancement in technologies, researchers have been successful at proposing various NLP tasks that have already been implemented as real-world applications today. Automated text summarization is one of the many tasks that has not yet completely matured particularly in health sector. A success in this task would enable healthcare professionals to grasp patient's history in a minimal time resulting in faster decisions required for better care. Automatic text summarization is a process that helps shortening a large text without sacrificing important information. This could be achieved by paraphrasing the content known as the abstractive method or by concatenating relevant extracted sentences namely the extractive method. In general, this process requires the conversion of text into numerical form and then a method is executed to identify and extract relevant text. This thesis is an attempt of exploring NLP techniques used in extractive text summarization particularly in health domain. The work includes a comparison of basic summarizing models implemented on a corpus of patient notes written by nurses in Finnish language. Concepts and research studies required to understand the implementation have been documented along with the description of the code. A python-based project is structured to build a corpus and execute multiple summarizing models. For this thesis, we observe the performance of two textual embeddings namely Term Frequency - Inverse Document Frequency (TF-IDF) which is based on simple statistical measure and Word2Vec which is based on neural networks. For both models, LexRank, an unsupervised stochastic graph-based sentence scoring algorithm, is used for sentence extraction and a random selection method is used as a baseline method for evaluation. To evaluate and compare the performance of models, summaries of 15 patient care episodes of each model were provided to two human beings for manual evaluations. According to the results of the small sample dataset, we observe that both evaluators seem to agree with each other in preferring summaries produced by Word2Vec LexRank over the summaries generated by TF-IDF LexRank. Both models have also been observed, by both evaluators, to perform better than the baseline model of random selection

    Towards Personalized and Human-in-the-Loop Document Summarization

    Full text link
    The ubiquitous availability of computing devices and the widespread use of the internet have generated a large amount of data continuously. Therefore, the amount of available information on any given topic is far beyond humans' processing capacity to properly process, causing what is known as information overload. To efficiently cope with large amounts of information and generate content with significant value to users, we require identifying, merging and summarising information. Data summaries can help gather related information and collect it into a shorter format that enables answering complicated questions, gaining new insight and discovering conceptual boundaries. This thesis focuses on three main challenges to alleviate information overload using novel summarisation techniques. It further intends to facilitate the analysis of documents to support personalised information extraction. This thesis separates the research issues into four areas, covering (i) feature engineering in document summarisation, (ii) traditional static and inflexible summaries, (iii) traditional generic summarisation approaches, and (iv) the need for reference summaries. We propose novel approaches to tackle these challenges, by: i)enabling automatic intelligent feature engineering, ii) enabling flexible and interactive summarisation, iii) utilising intelligent and personalised summarisation approaches. The experimental results prove the efficiency of the proposed approaches compared to other state-of-the-art models. We further propose solutions to the information overload problem in different domains through summarisation, covering network traffic data, health data and business process data.Comment: PhD thesi

    A framework for the Comparative analysis of text summarization techniques

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data ScienceWe see that with the boom of information technology and IOT (Internet of things), the size of information which is basically data is increasing at an alarming rate. This information can always be harnessed and if channeled into the right direction, we can always find meaningful information. But the problem is this data is not always numerical and there would be problems where the data would be completely textual, and some meaning has to be derived from it. If one would have to go through these texts manually, it would take hours or even days to get a concise and meaningful information out of the text. This is where a need for an automatic summarizer arises easing manual intervention, reducing time and cost but at the same time retaining the key information held by these texts. In the recent years, new methods and approaches have been developed which would help us to do so. These approaches are implemented in lot of domains, for example, Search engines provide snippets as document previews, while news websites produce shortened descriptions of news subjects, usually as headlines, to make surfing easier. Broadly speaking, there are mainly two ways of text summarization – extractive and abstractive summarization. Extractive summarization is the approach in which important sections of the whole text are filtered out to form the condensed form of the text. While the abstractive summarization is the approach in which the text as a whole is interpreted and examined and after discerning the meaning of the text, sentences are generated by the model itself describing the important points in a concise way

    Innovations in Domain Knowledge Augmentation of Contextual Models

    Get PDF
    The digital transformation of our society is creating a tremendous amount of data at an unprecedented rate. A large part of this data is in unstructured text format. While enjoying the benefit of instantaneous data access, we are also burdened by information overload. In healthcare, clinicians have to spend a significant portion of their time reading, writing and synthesizing data in electronic patient record systems. Information overload is reported as one of the main factors contributing to physician burnout; however, information overload is not unique to healthcare. We need better practical tools to help us access the right information at the right time. This has led to a heightened interest in high-performing Natural Language Processing research and solutions. Natural Language Processing (NLP), or Computational Linguistics, is a sub-field of computer science that focuses on analyzing and representing human language. The most recent advancements in NLP are large pre-trained contextual language models (e.g., transformer based models), which are pre-trained on massive corpora, and their context-sensitive embeddings (i.e., learned representation of words) are used in downstream tasks. The introduction of these models has led to significant performance gains in various downstream tasks, including sentiment analysis, entity recognition, and question answering. Such models have the ability to change the embedding of a word based on its imputed meaning, which is derived from the surrounding context. Contextual models can only encode the knowledge available in raw text corpora. Injecting structured domain-specific knowledge into these contextual models could further improve their performance and efficiency. However, this is not a trivial task. It requires a deep understanding of the model’s architecture and the nature and structure of the domain knowledge incorporated into the model. Another challenge facing NLP is the “low-resource” problem, arising from a shortage of publicly available (domain-specific) large datasets for training purposes. The low-resource challenge is especially acute in the biomedical domain, where strict regulation for privacy protection prohibits many datasets from being publicly available to the NLP community. The severe shortage of clinical experts further exacerbates the lack of labeled training datasets for clinical NLP research. We approach these challenges from the knowledge augmentation angle. This thesis explores how knowledge found in structured knowledge bases, either in general-purpose lexical databases (e.g., WordNet) or domain-specific knowledge bases (e.g., the Unified Medical Language Systems or the International Classification of Diseases), can be used to address the low-resource problem. We show that by incorporating domain-specific prior knowledge into a deep learning NLP architecture, we can force an NLP model to learn the associations between distinctive terminologies that it otherwise may not have the opportunity to learn due to the scarcity of domain-specific datasets. Four distinct yet complementary strategies have been pursued. First, we investigate how contextual models can use structured knowledge contained in the lexical database WordNet to distinguish between semantically similar words. We update the input policy of a contextual model by introducing a new mix-up embedding strategy for the input embedding of the target word. We also introduce additional information, such as the degree of similarity between the definitions of the target and the candidate words. We demonstrate that this supplemental information has enabled the model to select candidate words that are semantically similar to the target word rather than those that are only appropriate for the sentence’s context. Having successfully proven that lexical knowledge can aid a contextual model in distinguishing between semantically similar words, we extend this approach to highly specialized vocabularies such as those found in medical text. We explore whether using domain-specific (medical) knowledge from a clinical Metathesaurus (UMLS Metathesaurus) in the architecture of a transformer-based encoder model can aid the model in building ‘semantically enriched’ contextual representations that will benefit from both the contextual learning and the domain knowledge. We also investigate whether incorporating structured medical knowledge into the pre-training phase of a transformer-based model can incentivize the model to learn more accurately the association between distinctive terminologies. This strategy is proven to be effective through a series of benchmark comparisons with other related models. After demonstrating the effect of structured domain (medical) knowledge on the performance of a transformer-based encoder model, we extend the medical features and illustrate that structured medical knowledge can also boost the performance of a (medical) summarization transformer-based sequence-to-sequence model. We introduce a guidance signal consisting of the medical terminologies in the input sequence. Moreover, the input policy is modified by utilizing the semantic types from UMLS, and we also propose a novel weighted loss function. Our study demonstrates the benefit of these strategies in providing a stronger incentive for the model to include relevant medical facts in the summarized output. We further examine whether an NLP model can take advantage of both the relational information between different labels and contextual embedding information by introducing a novel attention mechanism (instead of augmenting the architecture of contextual models with structured information as described in the previous paragraphs). We tackle the challenge of automatic ICD coding, which is the task of assigning codes of the International Classification of Diseases (ICD) system to medical notes. Through a novel attention mechanism, we integrate the information from a Graph Convolutional Network (GCN) that considers the relationship between various codes with the contextual sentence embeddings of the medical notes. Our experiments reveal that this enhancement effectively boosts the model’s performance in the automatic ICD coding task. The main contribution of this thesis is two-fold: (1) this thesis contributes to the computer science literature by demonstrating how domain-specific knowledge can be effectively incorporated into contextual models to improve model performance in NLP tasks that lack helpful training resources; and (2) the knowledge augmentation strategies and the contextual models developed in this research are shown to improve NLP performance in the biomedical field, where publicly available training datasets are scarce but domain-specific knowledge bases and data standards have achieved a wide adoption in electronic medical records systems
    corecore