40 research outputs found

    Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation

    Get PDF
    Peer reviewe

    On the trade-off between redundancy and cohesiveness in extractive summarization

    Get PDF
    Extractive summaries are usually presented as lists of sentences with no expected cohesion between them and with plenty of redundant information if not accounted for. In this paper, we investigate the trade-offs incurred when aiming to control for inter-sentential cohesion and redundancy in extracted summaries, and their impact on their informativeness. As case study, we focus on the summarization of long, highly redundant documents and consider two optimization scenarios, reward-guided and with no supervision. In the reward-guided scenario, we compare systems that control for redundancy and cohesiveness during sentence scoring. In the unsupervised scenario, we introduce two systems that aim to control all three properties --informativeness, redundancy, and cohesiveness-- in a principled way. Both systems implement a psycholinguistic theory that simulates how humans keep track of relevant content units and how cohesiveness and non-redundancy constraints are applied in short-term memory during reading. Extensive automatic and human evaluations reveal that systems optimizing for --among other properties-- cohesiveness are capable of better organizing content in summaries compared to systems that optimize only for redundancy, while maintaining comparable informativeness. We find that the proposed unsupervised systems manage to extract highly cohesive summaries across varying levels of document redundancy, although sacrificing informativeness in the process. Finally, we lay evidence as to how simulated cognitive processes impact the trade-off between the analysed summary properties

    An enhanced binary bat and Markov clustering algorithms to improve event detection for heterogeneous news text documents

    Get PDF
    Event Detection (ED) works on identifying events from various types of data. Building an ED model for news text documents greatly helps decision-makers in various disciplines in improving their strategies. However, identifying and summarizing events from such data is a non-trivial task due to the large volume of published heterogeneous news text documents. Such documents create a high-dimensional feature space that influences the overall performance of the baseline methods in ED model. To address such a problem, this research presents an enhanced ED model that includes improved methods for the crucial phases of the ED model such as Feature Selection (FS), ED, and summarization. This work focuses on the FS problem by automatically detecting events through a novel wrapper FS method based on Adapted Binary Bat Algorithm (ABBA) and Adapted Markov Clustering Algorithm (AMCL), termed ABBA-AMCL. These adaptive techniques were developed to overcome the premature convergence in BBA and fast convergence rate in MCL. Furthermore, this study proposes four summarizing methods to generate informative summaries. The enhanced ED model was tested on 10 benchmark datasets and 2 Facebook news datasets. The effectiveness of ABBA-AMCL was compared to 8 FS methods based on meta-heuristic algorithms and 6 graph-based ED methods. The empirical and statistical results proved that ABBAAMCL surpassed other methods on most datasets. The key representative features demonstrated that ABBA-AMCL method successfully detects real-world events from Facebook news datasets with 0.96 Precision and 1 Recall for dataset 11, while for dataset 12, the Precision is 1 and Recall is 0.76. To conclude, the novel ABBA-AMCL presented in this research has successfully bridged the research gap and resolved the curse of high dimensionality feature space for heterogeneous news text documents. Hence, the enhanced ED model can organize news documents into distinct events and provide policymakers with valuable information for decision making

    Computing and Exploiting Document Structure to Improve Unsupervised Extractive Summarization of Legal Case Decisions

    Full text link
    Though many algorithms can be used to automatically summarize legal case decisions, most fail to incorporate domain knowledge about how important sentences in a legal decision relate to a representation of its document structure. For example, analysis of a legal case summarization dataset demonstrates that sentences serving different types of argumentative roles in the decision appear in different sections of the document. In this work, we propose an unsupervised graph-based ranking model that uses a reweighting algorithm to exploit properties of the document structure of legal case decisions. We also explore the impact of using different methods to compute the document structure. Results on the Canadian Legal Case Law dataset show that our proposed method outperforms several strong baselines.Comment: NLLP Workshop Camera Ready in EMNLP 202

    Influence des domaines de spécialité dans l'extraction de termes-clés

    Get PDF
    National audienceLes termes-clés sont les mots ou les expressions polylexicales qui représentent le contenu principal d'un document. Ils sont utiles pour diverses applications, telles que l'indexation automatique ou le résumé automatique, mais ne sont pas toujours disponibles. De ce fait, nous nous intéressons à l'extraction automatique de termes-clés et, plus particulièrement, à la difficulté de cette tâche lors du traitement de documents appartenant à certaines disciplines scientifiques. Au moyen de cinq corpus représentant cinq disciplines différentes (archéologie, linguistique, sciences de l'information, psychologie et chimie), nous déduisons une échelle de difficulté disciplinaire et analysons les facteurs qui influent sur cette difficulté

    Cognitive structures of content for controlled summarization

    Get PDF
    In the current information age, where over 1 Petabyte of data is created every day on the web, demand continues to rise for effective technological tools to aid end-users in consuming information in a timely way. Automatic summarization is the task of consuming a text document –or collection of documents-- and presenting the user with a shorter text, the \textit{summary}, that retains the gist of the information consumed. In general, a good summary should present content bits that are relevant –be informative--, non-redundant -be non-repetitive--, organized in a sensical way –be coherent--, and read as a unified thematic whole –be cohesive. The particular information needs of each user prompted many variations of the summarization task. Among them, extractive summarization consists of extracting spans of text -usually sentences- from the input document(s), concatenating them, and presenting them as the final summary. Traditionally, extractive systems focus their attention on presenting highly informative content, regardless of whether content bits are repeated or presented in an incoherent, non-cohesive manner. How to balance these properties remains an understudied problem, even though the understanding of the trade-offs between them could enable a system to produce text with relevant content that is also more readable to humans. This thesis argues that extractive summaries can be presented in a non-redundant, cohesive way, and still be informative. We investigate the interaction between these summary properties and develop models that balance their trade-off during document understanding and during summary production. At the core of these models, an algorithm --inspired by psycholinguistic models of memory-- simulates how humans keep track of relevant content in short-term memory, and how cohesion and non-redundancy constraints are applied among content bits in memory. The results are encouraging. When modeling trade-off during document understanding in an unsupervised scenario, we find that our models are able to detect relevant content, reduce redundancy, and significantly improve cohesion in summaries, especially when the input document exhibits high redundancy. Furthermore, we show that this balance can be controlled through specific, interpretable hyper-parameters. In a similar reinforcement learning scenario, we find that informativeness and cohesion can influence each other positively. Finally, when modeling trade-off during summary extraction, our models are able to better enforce cohesive ties between semantically similar text spans in neighboring sentences. Our approach produces summaries that are perceived by humans as more cohesive and as informative as summaries only built for informativeness. Catering to the need to process extremely long and redundant input, we design this system to be capable of consuming sequences of text of arbitrary length and test it on scenarios with single, long documents, and multi-documents

    Applied Deep Learning: Case Studies in Computer Vision and Natural Language Processing

    Get PDF
    Deep learning has proved to be successful for many computer vision and natural language processing applications. In this dissertation, three studies have been conducted to show the efficacy of deep learning models for computer vision and natural language processing. In the first study, an efficient deep learning model was proposed for seagrass scar detection in multispectral images which produced robust, accurate scars mappings. In the second study, an arithmetic deep learning model was developed to fuse multi-spectral images collected at different times with different resolutions to generate high-resolution images for downstream tasks including change detection, object detection, and land cover classification. In addition, a super-resolution deep model was implemented to further enhance remote sensing images. In the third study, a deep learning-based framework was proposed for fact-checking on social media to spot fake scientific news. The framework leveraged deep learning, information retrieval, and natural language processing techniques to retrieve pertinent scholarly papers for given scientific news and evaluate the credibility of the news

    Comments-oriented document summarization: Understanding documents with readers' feedback

    Get PDF
    Comments left by readers on Web documents contain valuable information that can be utilized in different information retrieval tasks including document search, visualization, and summarization. In this paper, we study the problem of comments-oriented document summarization and aim to summarize a Web document (e.g., a blog post) by considering not only its content, but also the comments left by its readers. We identify three relations (namely, topic, quotation, andmention) by which comments can be linked to one another, and model the relations in three graphs. The importance of each comment is then scored by: (i) graph-based method, where the three graphs are merged into a multirelation graph; (ii) tensor-based method, where the three graphs are used to construct a 3rd-order tensor. To generate a comments-oriented summary, we extract sentences from the given Web document using either feature-biased approach or uniform-document approach. The former scores sentences to bias keywords derived from comments; while the latter scores sentences uniformly with comments. In our experiments using a set of blog posts with manually labeled sentences, our proposed summarization methods utilizing comments showed significant improvement over those not using comments. The methods using feature-biased sentence extraction approach were observed to outperform that using uniform-document approach

    Argument mining with graph representation learning

    Get PDF
    Argument Mining (AM) is a unique task in Natural Language Processing (NLP) that targets arguments: a meaningful logical structure in human language. Since the argument plays a significant role in the legal field, the interdisciplinary study of AM on legal texts has significant promise. For years, a pipeline architecture has been used as the standard paradigm in this area. Although this simplifies the development and management of AM systems, the connection between different parts of the pipeline causes inevitable shortcomings such as cascading error propagation. This paper presents an alternative perspective of the AM task, whereby legal documents are represented as graph structures and the AM task is undertaken as a hybrid approach incorporating Graph Neural Networks (GNNs), graph augmentation and collective classification. GNNs have been demonstrated to be an effective method for representation learning on graphs, and they have been successfully applied to many other NLP tasks. In contrast to previous pipeline-based architecture, our approach results in a single end-to-end classifier for the identification and classification of argumentative text segments. Experiments based on corpora from both the European Court of Human Rights (ECHR) and the Court of Jus- tice of the European Union (CJEU) show that our approach achieves strong results compared to state-of-the-art baselines. Both the graph augmentation and collective classification steps are shown to improve performance on both datasets when compared to using GNNs alone
    corecore