560 research outputs found
NLP Driven Models for Automatically Generating Survey Articles for Scientific Topics.
This thesis presents new methods that use natural language processing (NLP) driven models for summarizing research in scientific fields. Given a topic query in the form of a text string, we present methods for finding research articles relevant to the topic as well as summarization algorithms that use lexical and discourse information present in the text of these articles to generate coherent and readable extractive summaries of past research on the topic. In addition to summarizing prior research, good survey articles should also forecast future trends. With this motivation, we present work on forecasting future impact of scientific publications using NLP driven features.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113407/1/rahuljha_1.pd
Semantic Segmentation of Legal Documents via Rhetorical Roles
Legal documents are unstructured, use legal jargon, and have considerable
length, making them difficult to process automatically via conventional text
processing techniques. A legal document processing system would benefit
substantially if the documents could be segmented into coherent information
units. This paper proposes a new corpus of legal documents annotated (with the
help of legal experts) with a set of 13 semantically coherent units labels
(referred to as Rhetorical Roles), e.g., facts, arguments, statute, issue,
precedent, ruling, and ratio. We perform a thorough analysis of the corpus and
the annotations. For automatically segmenting the legal documents, we
experiment with the task of rhetorical role prediction: given a document,
predict the text segments corresponding to various roles. Using the created
corpus, we experiment extensively with various deep learning-based baseline
models for the task. Further, we develop a multitask learning (MTL) based deep
model with document rhetorical role label shift as an auxiliary task for
segmenting a legal document. The proposed model shows superior performance over
the existing models. We also experiment with model performance in the case of
domain transfer and model distillation techniques to see the model performance
in limited data conditions.Comment: 19 pages, Accepted at Natural Legal Language Processing Workshop,
EMNLP 202
Rhetorical relations for information retrieval
Typically, every part in most coherent text has some plausible reason for its
presence, some function that it performs to the overall semantics of the text.
Rhetorical relations, e.g. contrast, cause, explanation, describe how the parts
of a text are linked to each other. Knowledge about this socalled discourse
structure has been applied successfully to several natural language processing
tasks. This work studies the use of rhetorical relations for Information
Retrieval (IR): Is there a correlation between certain rhetorical relations and
retrieval performance? Can knowledge about a document's rhetorical relations be
useful to IR? We present a language model modification that considers
rhetorical relations when estimating the relevance of a document to a query.
Empirical evaluation of different versions of our model on TREC settings shows
that certain rhetorical relations can benefit retrieval effectiveness notably
(> 10% in mean average precision over a state-of-the-art baseline)
Computing and Exploiting Document Structure to Improve Unsupervised Extractive Summarization of Legal Case Decisions
Though many algorithms can be used to automatically summarize legal case
decisions, most fail to incorporate domain knowledge about how important
sentences in a legal decision relate to a representation of its document
structure. For example, analysis of a legal case summarization dataset
demonstrates that sentences serving different types of argumentative roles in
the decision appear in different sections of the document. In this work, we
propose an unsupervised graph-based ranking model that uses a reweighting
algorithm to exploit properties of the document structure of legal case
decisions. We also explore the impact of using different methods to compute the
document structure. Results on the Canadian Legal Case Law dataset show that
our proposed method outperforms several strong baselines.Comment: NLLP Workshop Camera Ready in EMNLP 202
Thirty years of Artificial Intelligence and Law:the second decade
The first issue of Artificial Intelligence and Law journal was published in 1992. This paper provides commentaries on nine significant papers drawn from the Journal’s second decade. Four of the papers relate to reasoning with legal cases, introducing contextual considerations, predicting outcomes on the basis of natural language descriptions of the cases, comparing different ways of representing cases, and formalising precedential reasoning. One introduces a method of analysing arguments that was to become very widely used in AI and Law, namely argumentation schemes. Two relate to ontologies for the representation of legal concepts and two take advantage of the increasing availability of legal corpora in this decade, to automate document summarisation and for the mining of arguments
Investigating the role of argumentation in the rhetorical analysis of scientific publications with neural multi-task learning models
Exponential growth in the number of scientific publications yields the need for effective automatic analysis of rhetorical aspects of scientific writing. Acknowledging the argumentative nature of scientific text, in this work we investigate the link between the argumentative structure of scientific publications and rhetorical aspects such as discourse categories or citation contexts. To this end, we (1) augment a corpus of scientific publications annotated with four layers of rhetoric annotations with argumentation annotations and (2) investigate neural multi-task learning architectures combining argument extraction with a set of rhetorical classification tasks. By coupling rhetorical classifiers with the extraction of argumentative components in a joint multi-task learning setting, we obtain significant performance gains for different rhetorical analysis tasks
A Graph-Based Approach for the Summarization of Scientific Articles
Automatic text summarization is one of the eminent applications in the field of
Natural Language Processing. Text summarization is the process of generating
a gist from text documents. The task is to produce a summary which contains
important, diverse and coherent information, i.e., a summary should be self-contained.
The approaches for text summarization are conventionally extractive.
The extractive approaches select a subset of sentences from an input document
for a summary. In this thesis, we introduce a novel graph-based extractive summarization
approach.
With the progressive advancement of research in the various fields of science,
the summarization of scientific articles has become an essential requirement for
researchers. This is our prime motivation in selecting scientific articles as our
dataset. This newly formed dataset contains scientific articles from the PLOS
Medicine journal, which is a high impact journal in the field of biomedicine.
The summarization of scientific articles is a single-document summarization task.
It is a complex task due to various reasons, one of it being, the important information
in the scientific article is scattered all over it and another reason being, scientific
articles contain numerous redundant information. In our approach, we deal
with the three important factors of summarization: importance, non-redundancy
and coherence. To deal with these factors, we use graphs as they solve data sparsity
problems and are computationally less complex.
We employ bipartite graphical representation for the summarization task, exclusively.
We represent input documents through a bipartite graph that consists of
sentence nodes and entity nodes. This bipartite graph representation contains entity
transition information which is beneficial for selecting the relevant sentences
for a summary. We use a graph-based ranking algorithm to rank the sentences in
a document. The ranks are considered as relevance scores of the sentences which
are further used in our approach.
Scientific articles contain reasonable amount of redundant information, for example,
Introduction and Methodology sections contain similar information regarding
the motivation and approach. In our approach, we ensure that the summary contains
sentences which are non-redundant.
Though the summary should contain important and non-redundant information of
the input document, its sentences should be connected to one another such that
it becomes coherent, understandable and simple to read. If we do not ensure
that a summary is coherent, its sentences may not be properly connected. This
leads to an obscure summary. Until now, only few summarization approaches
take care of coherence. In our approach, we take care of coherence in two different
ways: by using the graph measure and by using the structural information. We
employ outdegree as the graph measure and coherence patterns for the structural
information, in our approach.
We use integer programming as an optimization technique, to select the best subset
of sentences for a summary. The sentences are selected on the basis of relevance,
diversity and coherence measure. The computation of these measures is
tightly integrated and taken care of simultaneously.
We use human judgements to evaluate coherence of summaries. We compare
ROUGE scores and human judgements of different systems on the PLOS Medicine
dataset. Our approach performs considerably better than other systems on this
dataset. Also, we apply our approach on the standard DUC 2002 dataset to compare
the results with the recent state-of-the-art systems. The results show that our
graph-based approach outperforms other systems on DUC 2002. In conclusion,
our approach is robust, i.e., it works on both scientific and news articles. Our
approach has the further advantage of being semi-supervised
Pre-training Transformers on Indian Legal Text
Natural Language Processing in the legal domain been benefited hugely by the
emergence of Transformer-based Pre-trained Language Models (PLMs) pre-trained
on legal text. There exist PLMs trained over European and US legal text, most
notably LegalBERT. However, with the rapidly increasing volume of NLP
applications on Indian legal documents, and the distinguishing characteristics
of Indian legal text, it has become necessary to pre-train LMs over Indian
legal text as well. In this work, we introduce transformer-based PLMs
pre-trained over a large corpus of Indian legal documents. We also apply these
PLMs over several benchmark legal NLP tasks over both Indian legal text, as
well as over legal text belonging to other domains (countries). The NLP tasks
with which we experiment include Legal Statute Identification from facts,
Semantic segmentation of court judgements, and Court Judgement Prediction. Our
experiments demonstrate the utility of the India-specific PLMs developed in
this work
- …