560 research outputs found

    NLP Driven Models for Automatically Generating Survey Articles for Scientific Topics.

    Full text link
    This thesis presents new methods that use natural language processing (NLP) driven models for summarizing research in scientific fields. Given a topic query in the form of a text string, we present methods for finding research articles relevant to the topic as well as summarization algorithms that use lexical and discourse information present in the text of these articles to generate coherent and readable extractive summaries of past research on the topic. In addition to summarizing prior research, good survey articles should also forecast future trends. With this motivation, we present work on forecasting future impact of scientific publications using NLP driven features.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113407/1/rahuljha_1.pd

    Semantic Segmentation of Legal Documents via Rhetorical Roles

    Full text link
    Legal documents are unstructured, use legal jargon, and have considerable length, making them difficult to process automatically via conventional text processing techniques. A legal document processing system would benefit substantially if the documents could be segmented into coherent information units. This paper proposes a new corpus of legal documents annotated (with the help of legal experts) with a set of 13 semantically coherent units labels (referred to as Rhetorical Roles), e.g., facts, arguments, statute, issue, precedent, ruling, and ratio. We perform a thorough analysis of the corpus and the annotations. For automatically segmenting the legal documents, we experiment with the task of rhetorical role prediction: given a document, predict the text segments corresponding to various roles. Using the created corpus, we experiment extensively with various deep learning-based baseline models for the task. Further, we develop a multitask learning (MTL) based deep model with document rhetorical role label shift as an auxiliary task for segmenting a legal document. The proposed model shows superior performance over the existing models. We also experiment with model performance in the case of domain transfer and model distillation techniques to see the model performance in limited data conditions.Comment: 19 pages, Accepted at Natural Legal Language Processing Workshop, EMNLP 202

    Rhetorical relations for information retrieval

    Full text link
    Typically, every part in most coherent text has some plausible reason for its presence, some function that it performs to the overall semantics of the text. Rhetorical relations, e.g. contrast, cause, explanation, describe how the parts of a text are linked to each other. Knowledge about this socalled discourse structure has been applied successfully to several natural language processing tasks. This work studies the use of rhetorical relations for Information Retrieval (IR): Is there a correlation between certain rhetorical relations and retrieval performance? Can knowledge about a document's rhetorical relations be useful to IR? We present a language model modification that considers rhetorical relations when estimating the relevance of a document to a query. Empirical evaluation of different versions of our model on TREC settings shows that certain rhetorical relations can benefit retrieval effectiveness notably (> 10% in mean average precision over a state-of-the-art baseline)

    Computing and Exploiting Document Structure to Improve Unsupervised Extractive Summarization of Legal Case Decisions

    Full text link
    Though many algorithms can be used to automatically summarize legal case decisions, most fail to incorporate domain knowledge about how important sentences in a legal decision relate to a representation of its document structure. For example, analysis of a legal case summarization dataset demonstrates that sentences serving different types of argumentative roles in the decision appear in different sections of the document. In this work, we propose an unsupervised graph-based ranking model that uses a reweighting algorithm to exploit properties of the document structure of legal case decisions. We also explore the impact of using different methods to compute the document structure. Results on the Canadian Legal Case Law dataset show that our proposed method outperforms several strong baselines.Comment: NLLP Workshop Camera Ready in EMNLP 202

    Thirty years of Artificial Intelligence and Law:the second decade

    Get PDF
    The first issue of Artificial Intelligence and Law journal was published in 1992. This paper provides commentaries on nine significant papers drawn from the Journal’s second decade. Four of the papers relate to reasoning with legal cases, introducing contextual considerations, predicting outcomes on the basis of natural language descriptions of the cases, comparing different ways of representing cases, and formalising precedential reasoning. One introduces a method of analysing arguments that was to become very widely used in AI and Law, namely argumentation schemes. Two relate to ontologies for the representation of legal concepts and two take advantage of the increasing availability of legal corpora in this decade, to automate document summarisation and for the mining of arguments

    Investigating the role of argumentation in the rhetorical analysis of scientific publications with neural multi-task learning models

    Get PDF
    Exponential growth in the number of scientific publications yields the need for effective automatic analysis of rhetorical aspects of scientific writing. Acknowledging the argumentative nature of scientific text, in this work we investigate the link between the argumentative structure of scientific publications and rhetorical aspects such as discourse categories or citation contexts. To this end, we (1) augment a corpus of scientific publications annotated with four layers of rhetoric annotations with argumentation annotations and (2) investigate neural multi-task learning architectures combining argument extraction with a set of rhetorical classification tasks. By coupling rhetorical classifiers with the extraction of argumentative components in a joint multi-task learning setting, we obtain significant performance gains for different rhetorical analysis tasks

    A Graph-Based Approach for the Summarization of Scientific Articles

    Get PDF
    Automatic text summarization is one of the eminent applications in the field of Natural Language Processing. Text summarization is the process of generating a gist from text documents. The task is to produce a summary which contains important, diverse and coherent information, i.e., a summary should be self-contained. The approaches for text summarization are conventionally extractive. The extractive approaches select a subset of sentences from an input document for a summary. In this thesis, we introduce a novel graph-based extractive summarization approach. With the progressive advancement of research in the various fields of science, the summarization of scientific articles has become an essential requirement for researchers. This is our prime motivation in selecting scientific articles as our dataset. This newly formed dataset contains scientific articles from the PLOS Medicine journal, which is a high impact journal in the field of biomedicine. The summarization of scientific articles is a single-document summarization task. It is a complex task due to various reasons, one of it being, the important information in the scientific article is scattered all over it and another reason being, scientific articles contain numerous redundant information. In our approach, we deal with the three important factors of summarization: importance, non-redundancy and coherence. To deal with these factors, we use graphs as they solve data sparsity problems and are computationally less complex. We employ bipartite graphical representation for the summarization task, exclusively. We represent input documents through a bipartite graph that consists of sentence nodes and entity nodes. This bipartite graph representation contains entity transition information which is beneficial for selecting the relevant sentences for a summary. We use a graph-based ranking algorithm to rank the sentences in a document. The ranks are considered as relevance scores of the sentences which are further used in our approach. Scientific articles contain reasonable amount of redundant information, for example, Introduction and Methodology sections contain similar information regarding the motivation and approach. In our approach, we ensure that the summary contains sentences which are non-redundant. Though the summary should contain important and non-redundant information of the input document, its sentences should be connected to one another such that it becomes coherent, understandable and simple to read. If we do not ensure that a summary is coherent, its sentences may not be properly connected. This leads to an obscure summary. Until now, only few summarization approaches take care of coherence. In our approach, we take care of coherence in two different ways: by using the graph measure and by using the structural information. We employ outdegree as the graph measure and coherence patterns for the structural information, in our approach. We use integer programming as an optimization technique, to select the best subset of sentences for a summary. The sentences are selected on the basis of relevance, diversity and coherence measure. The computation of these measures is tightly integrated and taken care of simultaneously. We use human judgements to evaluate coherence of summaries. We compare ROUGE scores and human judgements of different systems on the PLOS Medicine dataset. Our approach performs considerably better than other systems on this dataset. Also, we apply our approach on the standard DUC 2002 dataset to compare the results with the recent state-of-the-art systems. The results show that our graph-based approach outperforms other systems on DUC 2002. In conclusion, our approach is robust, i.e., it works on both scientific and news articles. Our approach has the further advantage of being semi-supervised

    Pre-training Transformers on Indian Legal Text

    Full text link
    Natural Language Processing in the legal domain been benefited hugely by the emergence of Transformer-based Pre-trained Language Models (PLMs) pre-trained on legal text. There exist PLMs trained over European and US legal text, most notably LegalBERT. However, with the rapidly increasing volume of NLP applications on Indian legal documents, and the distinguishing characteristics of Indian legal text, it has become necessary to pre-train LMs over Indian legal text as well. In this work, we introduce transformer-based PLMs pre-trained over a large corpus of Indian legal documents. We also apply these PLMs over several benchmark legal NLP tasks over both Indian legal text, as well as over legal text belonging to other domains (countries). The NLP tasks with which we experiment include Legal Statute Identification from facts, Semantic segmentation of court judgements, and Court Judgement Prediction. Our experiments demonstrate the utility of the India-specific PLMs developed in this work
    • …
    corecore