46,938 research outputs found
Semantic Segmentation of Legal Documents via Rhetorical Roles
Legal documents are unstructured, use legal jargon, and have considerable
length, making them difficult to process automatically via conventional text
processing techniques. A legal document processing system would benefit
substantially if the documents could be segmented into coherent information
units. This paper proposes a new corpus of legal documents annotated (with the
help of legal experts) with a set of 13 semantically coherent units labels
(referred to as Rhetorical Roles), e.g., facts, arguments, statute, issue,
precedent, ruling, and ratio. We perform a thorough analysis of the corpus and
the annotations. For automatically segmenting the legal documents, we
experiment with the task of rhetorical role prediction: given a document,
predict the text segments corresponding to various roles. Using the created
corpus, we experiment extensively with various deep learning-based baseline
models for the task. Further, we develop a multitask learning (MTL) based deep
model with document rhetorical role label shift as an auxiliary task for
segmenting a legal document. The proposed model shows superior performance over
the existing models. We also experiment with model performance in the case of
domain transfer and model distillation techniques to see the model performance
in limited data conditions.Comment: 19 pages, Accepted at Natural Legal Language Processing Workshop,
EMNLP 202
Pre-training Transformers on Indian Legal Text
Natural Language Processing in the legal domain been benefited hugely by the
emergence of Transformer-based Pre-trained Language Models (PLMs) pre-trained
on legal text. There exist PLMs trained over European and US legal text, most
notably LegalBERT. However, with the rapidly increasing volume of NLP
applications on Indian legal documents, and the distinguishing characteristics
of Indian legal text, it has become necessary to pre-train LMs over Indian
legal text as well. In this work, we introduce transformer-based PLMs
pre-trained over a large corpus of Indian legal documents. We also apply these
PLMs over several benchmark legal NLP tasks over both Indian legal text, as
well as over legal text belonging to other domains (countries). The NLP tasks
with which we experiment include Legal Statute Identification from facts,
Semantic segmentation of court judgements, and Court Judgement Prediction. Our
experiments demonstrate the utility of the India-specific PLMs developed in
this work
Thematic Annotation: extracting concepts out of documents
Contrarily to standard approaches to topic annotation, the technique used in
this work does not centrally rely on some sort of -- possibly statistical --
keyword extraction. In fact, the proposed annotation algorithm uses a large
scale semantic database -- the EDR Electronic Dictionary -- that provides a
concept hierarchy based on hyponym and hypernym relations. This concept
hierarchy is used to generate a synthetic representation of the document by
aggregating the words present in topically homogeneous document segments into a
set of concepts best preserving the document's content.
This new extraction technique uses an unexplored approach to topic selection.
Instead of using semantic similarity measures based on a semantic resource, the
later is processed to extract the part of the conceptual hierarchy relevant to
the document content. Then this conceptual hierarchy is searched to extract the
most relevant set of concepts to represent the topics discussed in the
document. Notice that this algorithm is able to extract generic concepts that
are not directly present in the document.Comment: Technical report EPFL/LIA. 81 pages, 16 figure
- …