3,795 research outputs found

    A Graph-Based Approach for the Summarization of Scientific Articles

    Get PDF
    Automatic text summarization is one of the eminent applications in the field of Natural Language Processing. Text summarization is the process of generating a gist from text documents. The task is to produce a summary which contains important, diverse and coherent information, i.e., a summary should be self-contained. The approaches for text summarization are conventionally extractive. The extractive approaches select a subset of sentences from an input document for a summary. In this thesis, we introduce a novel graph-based extractive summarization approach. With the progressive advancement of research in the various fields of science, the summarization of scientific articles has become an essential requirement for researchers. This is our prime motivation in selecting scientific articles as our dataset. This newly formed dataset contains scientific articles from the PLOS Medicine journal, which is a high impact journal in the field of biomedicine. The summarization of scientific articles is a single-document summarization task. It is a complex task due to various reasons, one of it being, the important information in the scientific article is scattered all over it and another reason being, scientific articles contain numerous redundant information. In our approach, we deal with the three important factors of summarization: importance, non-redundancy and coherence. To deal with these factors, we use graphs as they solve data sparsity problems and are computationally less complex. We employ bipartite graphical representation for the summarization task, exclusively. We represent input documents through a bipartite graph that consists of sentence nodes and entity nodes. This bipartite graph representation contains entity transition information which is beneficial for selecting the relevant sentences for a summary. We use a graph-based ranking algorithm to rank the sentences in a document. The ranks are considered as relevance scores of the sentences which are further used in our approach. Scientific articles contain reasonable amount of redundant information, for example, Introduction and Methodology sections contain similar information regarding the motivation and approach. In our approach, we ensure that the summary contains sentences which are non-redundant. Though the summary should contain important and non-redundant information of the input document, its sentences should be connected to one another such that it becomes coherent, understandable and simple to read. If we do not ensure that a summary is coherent, its sentences may not be properly connected. This leads to an obscure summary. Until now, only few summarization approaches take care of coherence. In our approach, we take care of coherence in two different ways: by using the graph measure and by using the structural information. We employ outdegree as the graph measure and coherence patterns for the structural information, in our approach. We use integer programming as an optimization technique, to select the best subset of sentences for a summary. The sentences are selected on the basis of relevance, diversity and coherence measure. The computation of these measures is tightly integrated and taken care of simultaneously. We use human judgements to evaluate coherence of summaries. We compare ROUGE scores and human judgements of different systems on the PLOS Medicine dataset. Our approach performs considerably better than other systems on this dataset. Also, we apply our approach on the standard DUC 2002 dataset to compare the results with the recent state-of-the-art systems. The results show that our graph-based approach outperforms other systems on DUC 2002. In conclusion, our approach is robust, i.e., it works on both scientific and news articles. Our approach has the further advantage of being semi-supervised

    Thematic Annotation: extracting concepts out of documents

    Get PDF
    Contrarily to standard approaches to topic annotation, the technique used in this work does not centrally rely on some sort of -- possibly statistical -- keyword extraction. In fact, the proposed annotation algorithm uses a large scale semantic database -- the EDR Electronic Dictionary -- that provides a concept hierarchy based on hyponym and hypernym relations. This concept hierarchy is used to generate a synthetic representation of the document by aggregating the words present in topically homogeneous document segments into a set of concepts best preserving the document's content. This new extraction technique uses an unexplored approach to topic selection. Instead of using semantic similarity measures based on a semantic resource, the later is processed to extract the part of the conceptual hierarchy relevant to the document content. Then this conceptual hierarchy is searched to extract the most relevant set of concepts to represent the topics discussed in the document. Notice that this algorithm is able to extract generic concepts that are not directly present in the document.Comment: Technical report EPFL/LIA. 81 pages, 16 figure

    An improved method for text summarization using lexical chains

    Get PDF
    This work is directed toward the creation of a system for automatically sum-marizing documents by extracting selected sentences. Several heuristics including position, cue words, and title words are used in conjunction with lexical chain in-formation to create a salience function that is used to rank sentences for extraction. Compiler technology, including the Flex and Bison tools, is used to create the AutoExtract summarizer that extracts and combines this information from the raw text. The WordNet database is used for the creation of the lexical chains. The AutoExtract summarizer performed better than the Microsoft Word97 AutoSummarize tool and the Sinope commercial summarizer in tests against ideal extracts and in tests judged by humans

    State of the art document clustering algorithms based on semantic similarity

    Get PDF
    The constant success of the Internet made the number of text documents in electronic forms increases hugely. The techniques to group these documents into meaningful clusters are becoming critical missions. The traditional clustering method was based on statistical features, and the clustering was done using a syntactic notion rather than semantically. However, these techniques resulted in un-similar data gathered in the same group due to polysemy and synonymy problems. The important solution to this issue is to document clustering based on semantic similarity, in which the documents are grouped according to the meaning and not keywords. In this research, eighty papers that use semantic similarity in different fields have been reviewed; forty of them that are using semantic similarity based on document clustering in seven recent years have been selected for a deep study, published between the years 2014 to 2020. A comprehensive literature review for all the selected papers is stated. Detailed research and comparison regarding their clustering algorithms, utilized tools, and methods of evaluation are given. This helps in the implementation and evaluation of the clustering of documents. The exposed research is used in the same direction when preparing the proposed research. Finally, an intensive discussion comparing the works is presented, and the result of our research is shown in figures

    Proceedings of the First Workshop on Computing News Storylines (CNewsStory 2015)

    Get PDF
    This volume contains the proceedings of the 1st Workshop on Computing News Storylines (CNewsStory 2015) held in conjunction with the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2015) at the China National Convention Center in Beijing, on July 31st 2015. Narratives are at the heart of information sharing. Ever since people began to share their experiences, they have connected them to form narratives. The study od storytelling and the field of literary theory called narratology have developed complex frameworks and models related to various aspects of narrative such as plots structures, narrative embeddings, characters’ perspectives, reader response, point of view, narrative voice, narrative goals, and many others. These notions from narratology have been applied mainly in Artificial Intelligence and to model formal semantic approaches to narratives (e.g. Plot Units developed by Lehnert (1981)). In recent years, computational narratology has qualified as an autonomous field of study and research. Narrative has been the focus of a number of workshops and conferences (AAAI Symposia, Interactive Storytelling Conference (ICIDS), Computational Models of Narrative). Furthermore, reference annotation schemes for narratives have been proposed (NarrativeML by Mani (2013)). The workshop aimed at bringing together researchers from different communities working on representing and extracting narrative structures in news, a text genre which is highly used in NLP but which has received little attention with respect to narrative structure, representation and analysis. Currently, advances in NLP technology have made it feasible to look beyond scenario-driven, atomic extraction of events from single documents and work towards extracting story structures from multiple documents, while these documents are published over time as news streams. Policy makers, NGOs, information specialists (such as journalists and librarians) and others are increasingly in need of tools that support them in finding salient stories in large amounts of information to more effectively implement policies, monitor actions of “big players” in the society and check facts. Their tasks often revolve around reconstructing cases either with respect to specific entities (e.g. person or organizations) or events (e.g. hurricane Katrina). Storylines represent explanatory schemas that enable us to make better selections of relevant information but also projections to the future. They form a valuable potential for exploiting news data in an innovative way.JRC.G.2-Global security and crisis managemen

    Automatic bilingual text document summarization.

    Get PDF
    Lo Sau-Han Silvia.Thesis (M.Phil.)--Chinese University of Hong Kong, 2002.Includes bibliographical references (leaves 137-143).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Definition of a summary --- p.2Chapter 1.2 --- Definition of text summarization --- p.3Chapter 1.3 --- Previous work --- p.4Chapter 1.3.1 --- Extract-based text summarization --- p.5Chapter 1.3.2 --- Abstract-based text summarization --- p.8Chapter 1.3.3 --- Sophisticated text summarization --- p.9Chapter 1.4 --- Summarization evaluation methods --- p.10Chapter 1.4.1 --- Intrinsic evaluation --- p.10Chapter 1.4.2 --- Extrinsic evaluation --- p.11Chapter 1.4.3 --- The TIPSTER SUMMAC text summarization evaluation --- p.11Chapter 1.4.4 --- Text Summarization Challenge (TSC) --- p.13Chapter 1.5 --- Research contributions --- p.14Chapter 1.5.1 --- Text summarization based on thematic term approach --- p.14Chapter 1.5.2 --- Bilingual news summarization based on an event-driven approach --- p.15Chapter 1.6 --- Thesis organization --- p.16Chapter 2 --- Text Summarization based on a Thematic Term Approach --- p.17Chapter 2.1 --- System overview --- p.18Chapter 2.2 --- Document preprocessor --- p.20Chapter 2.2.1 --- English corpus --- p.20Chapter 2.2.2 --- English corpus preprocessor --- p.22Chapter 2.2.3 --- Chinese corpus --- p.23Chapter 2.2.4 --- Chinese corpus preprocessor --- p.24Chapter 2.3 --- Corpus thematic term extractor --- p.24Chapter 2.4 --- Article thematic term extractor --- p.26Chapter 2.5 --- Sentence score generator --- p.29Chapter 2.6 --- Chapter summary --- p.30Chapter 3 --- Evaluation for Summarization using the Thematic Term Ap- proach --- p.32Chapter 3.1 --- Content-based similarity measure --- p.33Chapter 3.2 --- Experiments using content-based similarity measure --- p.36Chapter 3.2.1 --- English corpus and parameter training --- p.36Chapter 3.2.2 --- Experimental results using content-based similarity mea- sure --- p.38Chapter 3.3 --- Average inverse rank (AIR) method --- p.59Chapter 3.4 --- Experiments using average inverse rank method --- p.60Chapter 3.4.1 --- Corpora and parameter training --- p.61Chapter 3.4.2 --- Experimental results using AIR method --- p.62Chapter 3.5 --- Comparison between the content-based similarity measure and the average inverse rank method --- p.69Chapter 3.6 --- Chapter summary --- p.73Chapter 4 --- Bilingual Event-Driven News Summarization --- p.74Chapter 4.1 --- Corpora --- p.75Chapter 4.2 --- Topic and event definitions --- p.76Chapter 4.3 --- Architecture of bilingual event-driven news summarization sys- tem --- p.77Chapter 4.4 --- Bilingual event-driven approach summarization --- p.80Chapter 4.4.1 --- Dictionary-based term translation applying on English news articles --- p.80Chapter 4.4.2 --- Preprocessing for Chinese news articles --- p.89Chapter 4.4.3 --- Event clusters generation --- p.89Chapter 4.4.4 --- Cluster selection and summary generation --- p.96Chapter 4.5 --- Evaluation for summarization based on event-driven approach --- p.101Chapter 4.6 --- Experimental results on event-driven summarization --- p.103Chapter 4.6.1 --- Experimental settings --- p.103Chapter 4.6.2 --- Results and analysis --- p.105Chapter 4.7 --- Chapter summary --- p.113Chapter 5 --- Applying Event-Driven Summarization to a Parallel Corpus --- p.114Chapter 5.1 --- Parallel corpus --- p.115Chapter 5.2 --- Parallel documents preparation --- p.116Chapter 5.3 --- Evaluation methods for the event-driven summaries generated from the parallel corpus --- p.118Chapter 5.4 --- Experimental results and analysis --- p.121Chapter 5.4.1 --- Experimental settings --- p.121Chapter 5.4.2 --- Results and analysis --- p.123Chapter 5.5 --- Chapter summary --- p.132Chapter 6 --- Conclusions and Future Work --- p.133Chapter 6.1 --- Conclusions --- p.133Chapter 6.2 --- Future work --- p.135Bibliography --- p.137Chapter A --- English Stop Word List --- p.144Chapter B --- Chinese Stop Word List --- p.149Chapter C --- Event List Items on the Corpora --- p.151Chapter C.1 --- "Event list items for the topic ""Upcoming Philippine election""" --- p.151Chapter C.2 --- "Event list items for the topic ""German train derail"" " --- p.153Chapter C.3 --- "Event list items for the topic ""Electronic service delivery (ESD) scheme"" " --- p.154Chapter D --- The sample of an English article (9505001.xml). --- p.15

    NLP Driven Models for Automatically Generating Survey Articles for Scientific Topics.

    Full text link
    This thesis presents new methods that use natural language processing (NLP) driven models for summarizing research in scientific fields. Given a topic query in the form of a text string, we present methods for finding research articles relevant to the topic as well as summarization algorithms that use lexical and discourse information present in the text of these articles to generate coherent and readable extractive summaries of past research on the topic. In addition to summarizing prior research, good survey articles should also forecast future trends. With this motivation, we present work on forecasting future impact of scientific publications using NLP driven features.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113407/1/rahuljha_1.pd
    • …
    corecore