27 research outputs found

    LEGAL-BERT : The Muppets straight out of law school

    Get PDF
    BERT has achieved impressive performance in several NLP tasks. However, there has been limited investigation on its adaptation guidelines in specialised domains. Here we focus on the legal domain, where we explore several approaches for applying BERT models to downstream legal tasks, evaluating on multiple datasets. Our findings indicate that the previous guidelines for pre-training and fine-tuning, often blindly followed, do not always generalize well in the legal domain. Thus we propose a systematic investigation of the available strategies when applying BERT in specialised domains. These are: (a) use the original BERT out of the box, (b) adapt BERT by additional pre-training on domain-specific corpora, and (c) pre-train BERT from scratch on domain-specific corpora. We also propose a broader hyper-parameter search space when fine-tuning for downstream tasks and we release LEGAL-BERT, a family of BERT models intended to assist legal NLP research, computational law, and legal technology applications

    Neural legal judgment prediction in English

    Get PDF
    Legal judgment prediction is the task of automatically predicting the outcome of a court case, given a text describing the case's facts. Previous work on using neural models for this task has focused on Chinese; only feature-based models (e.g., using bags of words and topics) have been considered in English. We release a new English legal judgment prediction dataset, containing cases from the European Court of Human Rights. We evaluate a broad variety of neural models on the new dataset, establishing strong baselines that surpass previous feature-based models in three tasks: (1) binary violation classification; (2) multi-label classification; (3) case importance prediction. We also explore if models are biased towards demographic information via data anonymization. As a side-product, we propose a hierarchical version of BERT, which bypasses BERT's length limitation

    An empirical study on large-scale multi-label text classification including few and zero-shot labels

    Get PDF
    Large-scale Multi-label Text Classification (LMTC) has a wide range of Natural Language Processing (NLP) applications and presents interesting challenges. First, not all labels are well represented in the training set, due to the very large label set and the skewed label distributions of LMTC datasets. Also, label hierarchies and differences in human labelling guidelines may affect graph-aware annotation proximity. Finally, the label hierarchies are periodically updated, requiring LMTC models capable of zero-shot generalization. Current state-of-the-art LMTC models employ Label-Wise Attention Networks (LWANs), which (1) typically treat LMTC as flat multi-label classification; (2) may use the label hierarchy to improve zero-shot learning, although this practice is vastly understudied; and (3) have not been combined with pre-trained Transformers (e.g. BERT), which have led to state-of-the-art results in several NLP benchmarks. Here, for the first time, we empirically evaluate a battery of LMTC methods from vanilla LWANs to hierarchical classification approaches and transfer learning, on frequent, few, and zero-shot learning on three datasets from different domains. We show that hierarchical methods based on Probabilistic Label Trees (PLTs) outperform LWANs. Furthermore, we show that Transformer-based approaches outperform the state-of-the-art in two of the datasets, and we propose a new state-of-the-art method which combines BERT with LWANs. Finally, we propose new models that leverage the label hierarchy to improve few and zero-shot learning, considering on each dataset a graph-aware annotation proximity measure that we introduce

    Extreme multi-label legal text classification: a case study in EU legislation

    Get PDF
    We consider the task of Extreme Multi-Label Text Classification (XMTC) in the legal domain. We release a new dataset of 57k legislative documents from EURLEX, the European Union’s public document database, annotated with concepts from EUROVOC, a multidisciplinary thesaurus. The dataset is substantially larger than previous EURLEX datasets and suitable for XMTC, few-shot and zero-shot learning. Experimenting with several neural classifiers, we show that BIGRUs with self-attention outperform the current multi-label state-of-the-art methods, which employ label-wise attention. Replacing CNNs with BIGRUs in label-wise attention networks leads to the best overall performance

    Paragraph-level rationale extraction through regularization : a case study on European Court of Human Rights cases

    Get PDF
    Interpretability or explainability is an emerging research field in NLP. From a user-centric point of view, the goal is to build models that provide proper justification for their decisions, similar to those of humans, by requiring the models to satisfy additional constraints. To this end, we introduce a new application on legal text where, contrary to mainstream literature targeting word-level rationales, we conceive rationales as selected paragraphs in multi-paragraph structured court cases. We also release a new dataset comprising European Court of Human Rights cases, including annotations for paragraph-level rationales. We use this dataset to study the effect of already proposed rationale constraints, i.e., sparsity, continuity, and comprehensiveness, formulated as regularizers. Our findings indicate that some of these constraints are not beneficial in paragraph-level rationale extraction, while others need re-formulation to better handle the multi-label nature of the task we consider. We also introduce a new constraint, singularity, which further improves the quality of rationales, even compared with noisy rationale supervision. Experimental results indicate that the newly introduced task is very challenging and there is a large scope for further research

    LexGLUE : a benchmark dataset for legal language understanding in English

    Get PDF
    Law, interpretations of law, legal arguments, agreements, etc. are typically expressed in writing, leading to the production of vast corpora of legal text. Their analysis, which is at the center of legal practice, becomes increasingly elaborate as these collections grow in size. Natural language understanding (NLU) technologies can be a valuable tool to support legal practitioners in these endeavors. Their usefulness, however, largely depends on whether current state-of-the-art models can generalize across various tasks in the legal domain. To answer this currently open question, we introduce the Legal General Language Understanding Evaluation (LexGLUE) benchmark, a collection of datasets for evaluating model performance across a diverse set of legal NLU tasks in a standardized way. We also provide an evaluation and analysis of several generic and legal-oriented models demonstrating that the latter consistently offer performance improvements across multiple tasks

    Which bills are lobbied? Predicting and interpreting lobbying activity in the US

    Get PDF
    Using lobbying data from OpenSecrets.org, we offer several experiments applying machine learning techniques to predict if a piece of legislation (US bill) has been subjected to lobbying activities or not. We also investigate the influence of the intensity of the lobbying activity on how discernible a lobbied bill is from one that was not subject to lobbying. We compare the performance of a number of different models (logistic regression, random forest, CNN and LSTM) and text embedding representations (BOW, TF-IDF, GloVe, Law2Vec). We report results of above 0.85\% ROC AUC scores, and 78\% accuracy. Model performance significantly improves (95\% ROC AUC, and 88\% accuracy) when bills with higher lobbying intensity are looked at. We also propose a method that could be used for unlabelled data. Through this we show that there is a considerably large number of previously unlabelled US bills where our predictions suggest that some lobbying activity took place. We believe our method could potentially contribute to the enforcement of the US Lobbying Disclosure Act (LDA) by indicating the bills that were likely to have been affected by lobbying but were not filed as such

    Growth curve mixed nonlinear models in quails.

    Get PDF
    Our aim was to evaluate the use and application of different nonlinear mixed models, as well as to compare them with approach in nonlinear fixed models, for describing the growth curve of meat-type quails according to gender. A total of 15,002 and 15,408 records of males and females were used, respectively. The body weights were regressed on age of the animals using nonlinear models (Brody; Gompertz; Logistic, Morgan-Mercer-Flodin, Richards and Von Bertalanffy). All model parameters were considered fixed, whereas parameters related to asymptotic weight and maturity rate were fitted as random effects. The Bayesian Information Criterion was used to find the model of best fit. For both genders, the model that used the Morgan-Mercer-Flodin function with the inclusion of asymptotic weight as a random effect was considered the best-fitting model because it reduced the residual variance and increased the accuracy. Based on the lower absolute growth rate and growth velocity of male quails compared to that of females, it can be inferred that males should be slaughtered later. Given the results of this study, it can contribute to the current knowledge about animal yield, specifically at the best moment to slaughter and, this sense, improv the quality genetic of the populations in time

    Is the karyotype of neotropical boid snakes really conserved? Cytotaxonomy, chromosomal rearrangements and karyotype organization in the Boidae family

    Get PDF
    Boids are primitive snakes from a basal lineage that is widely distributed in Neotropical region. Many of these species are both morphologically and biogeographically divergent, and the relationship among some species remains uncertain even with evolutionary and phylogenetic studies being proposed for the group. For a better understanding of the evolutionary relationship between these snakes, we cytogenetically analysed 7 species and 3 subspecies of Neotropical snakes from the Boidae family using different chromosomal markers. The karyotypes of Boa constrictor occidentalis, Corallus hortulanus, Eunectes notaeus, Epicrates cenchria and Epicrates assisi are presented here for the first time with the redescriptions of the karyotypes of Boa constrictor constrictor, B. c. amarali, Eunectes murinus and Epicrates crassus. The three subspecies of Boa, two species of Eunectes and three species of Epicrates exhibit 2n = 36 chromosomes. In contrast, C. hortulanus presented a totally different karyotype composition for the Boidae family, showing 2n = 40 chromosomes with a greater number of macrochromosomes. Furthermore, chromosomal mapping of telomeric sequences revealed the presence of interstitial telomeric sites (ITSs) on many chromosomes in addition to the terminal markings on all chromosomes of all taxa analysed, with the exception of E. notaeus. Thus, we demonstrate that the karyotypes of these snakes are not as highly conserved as previously thought. Moreover, we provide an overview of the current cytotaxonomy of the group. © 2016 Viana et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

    Named entity recognition, linking and generation for Greek legislation

    No full text
    We investigate named entity recognition in Greek legislation using stateof- the-art deep neural network architectures. The recognized entities are used to enrich the Greek legislation knowledge graph with more detailed information about persons, organizations, geopolitical entities, legislation references, geographical landmarks and public document references.We also interlink the textual references of the recognized entities to the corresponding entities represented in other open public datasets and, in this way, we enable new sophisticated ways of querying Greek legislation. Relying on the results of the aforementioned methods we generate and publish a new dataset of geographical landmarks mentioned in Greek legislation. We make available publicly all datasets and other resources used in our study. Our work is the first of its kind for the Greek language in such an extended form and one of the few that examines legal text in a full spectrum, for both entity recognition and linking. © 2018 The authors and IOS Press
    corecore