22 research outputs found
Anonymity at Risk? Assessing Re-Identification Capabilities of Large Language Models
Anonymity of both natural and legal persons in court rulings is a critical
aspect of privacy protection in the European Union and Switzerland. With the
advent of LLMs, concerns about large-scale re-identification of anonymized
persons are growing. In accordance with the Federal Supreme Court of
Switzerland, we explore the potential of LLMs to re-identify individuals in
court rulings by constructing a proof-of-concept using actual legal data from
the Swiss federal supreme court. Following the initial experiment, we
constructed an anonymized Wikipedia dataset as a more rigorous testing ground
to further investigate the findings. With the introduction and application of
the new task of re-identifying people in texts, we also introduce new metrics
to measure performance. We systematically analyze the factors that influence
successful re-identifications, identifying model size, input length, and
instruction tuning among the most critical determinants. Despite high
re-identification rates on Wikipedia, even the best LLMs struggled with court
decisions. The complexity is attributed to the lack of test datasets, the
necessity for substantial training resources, and data sparsity in the
information used for re-identification. In conclusion, this study demonstrates
that re-identification using LLMs may not be feasible for now, but as the
proof-of-concept on Wikipedia showed, it might become possible in the future.
We hope that our system can help enhance the confidence in the security of
anonymized decisions, thus leading to the courts being more confident to
publish decisions
MultiLegalSBD: A Multilingual Legal Sentence Boundary Detection Dataset
Sentence Boundary Detection (SBD) is one of the foundational building blocks
of Natural Language Processing (NLP), with incorrectly split sentences heavily
influencing the output quality of downstream tasks. It is a challenging task
for algorithms, especially in the legal domain, considering the complex and
different sentence structures used. In this work, we curated a diverse
multilingual legal dataset consisting of over 130'000 annotated sentences in 6
languages. Our experimental results indicate that the performance of existing
SBD models is subpar on multilingual legal data. We trained and tested
monolingual and multilingual models based on CRF, BiLSTM-CRF, and transformers,
demonstrating state-of-the-art performance. We also show that our multilingual
models outperform all baselines in the zero-shot setting on a Portuguese test
set. To encourage further research and development by the community, we have
made our dataset, models, and code publicly available.Comment: Accepted at ICAIL 202
Resolving Legalese: A Multilingual Exploration of Negation Scope Resolution in Legal Documents
Resolving the scope of a negation within a sentence is a challenging NLP
task. The complexity of legal texts and the lack of annotated in-domain
negation corpora pose challenges for state-of-the-art (SotA) models when
performing negation scope resolution on multilingual legal data. Our
experiments demonstrate that models pre-trained without legal data underperform
in the task of negation scope resolution. Our experiments, using language
models exclusively fine-tuned on domains like literary texts and medical data,
yield inferior results compared to the outcomes documented in prior
cross-domain experiments. We release a new set of annotated court decisions in
German, French, and Italian and use it to improve negation scope resolution in
both zero-shot and multilingual settings. We achieve token-level F1-scores of
up to 86.7% in our zero-shot cross-lingual experiments, where the models are
trained on two languages of our legal datasets and evaluated on the third. Our
multilingual experiments, where the models were trained on all available
negation data and evaluated on our legal datasets, resulted in F1-scores of up
to 91.1%
Survey of Artificial Intelligence for Card Games and Its Application to the Swiss Game Jass
In the last decades we have witnessed the success of applications of
Artificial Intelligence to playing games. In this work we address the
challenging field of games with hidden information and card games in
particular. Jass is a very popular card game in Switzerland and is closely
connected with Swiss culture. To the best of our knowledge, performances of
Artificial Intelligence agents in the game of Jass do not outperform top
players yet. Our contribution to the community is two-fold. First, we provide
an overview of the current state-of-the-art of Artificial Intelligence methods
for card games in general. Second, we discuss their application to the use-case
of the Swiss card game Jass. This paper aims to be an entry point for both
seasoned researchers and new practitioners who want to join in the Jass
challenge
ClassActionPrediction: A Challenging Benchmark for Legal Judgment Prediction of Class Action Cases in the US
The research field of Legal Natural Language Processing (NLP) has been very
active recently, with Legal Judgment Prediction (LJP) becoming one of the most
extensively studied tasks. To date, most publicly released LJP datasets
originate from countries with civil law. In this work, we release, for the
first time, a challenging LJP dataset focused on class action cases in the US.
It is the first dataset in the common law system that focuses on the harder and
more realistic task involving the complaints as input instead of the often used
facts summary written by the court. Additionally, we study the difficulty of
the task by collecting expert human predictions, showing that even human
experts can only reach 53% accuracy on this dataset. Our Longformer model
clearly outperforms the human baseline (63%), despite only considering the
first 2,048 tokens. Furthermore, we perform a detailed error analysis and find
that the Longformer model is significantly better calibrated than the human
experts. Finally, we publicly release the dataset and the code used for the
experiments
SCALE: Scaling up the Complexity for Advanced Language Model Evaluation
Recent strides in Large Language Models (LLMs) have saturated many NLP
benchmarks (even professional domain-specific ones), emphasizing the need for
novel, more challenging novel ones to properly assess LLM capabilities. In this
paper, we introduce a novel NLP benchmark that poses challenges to current LLMs
across four key dimensions: processing long documents (up to 50K tokens),
utilizing domain specific knowledge (embodied in legal texts), multilingual
understanding (covering five languages), and multitasking (comprising legal
document to document Information Retrieval, Court View Generation, Leading
Decision Summarization, Citation Extraction, and eight challenging Text
Classification tasks). Our benchmark comprises diverse legal NLP datasets from
the Swiss legal system, allowing for a comprehensive study of the underlying
Non-English, inherently multilingual, federal legal system. Despite recent
advances, efficiently processing long documents for intense review/analysis
tasks remains an open challenge for language models. Also, comprehensive,
domain-specific benchmarks requiring high expertise to develop are rare, as are
multilingual benchmarks. This scarcity underscores our contribution's value,
considering most public models are trained predominantly on English corpora,
while other languages remain understudied, particularly for practical
domain-specific NLP tasks. Our benchmark allows for testing and advancing the
state-of-the-art LLMs. As part of our study, we evaluate several pre-trained
multilingual language models on our benchmark to establish strong baselines as
a point of reference. Despite the large size of our datasets (tens to hundreds
of thousands of examples), existing publicly available models struggle with
most tasks, even after in-domain pretraining. We publish all resources
(benchmark suite, pre-trained models, code) under a fully permissive open CC
BY-SA license
Maintenance of leaf N controls the photosynthetic CO 2 response of grassland species exposed to 9 years of free-air CO 2 enrichment
Determining underlying physiological patterns governing plant productivity and diversity in grasslands are critical to evaluate species responses to future environmental conditions of elevated CO 2 and nitrogen (N) deposition. In a 9-year experiment, N was added to monocultures of seven C 3 grassland species exposed to elevated atmospheric CO 2 (560 μmol CO 2  mol −1 ) to evaluate how N addition affects CO 2 responsiveness in species of contrasting functional groups. Functional groups differed in their responses to elevated CO 2 and N treatments. Forb species exhibited strong down-regulation of leaf N mass concentrations (−26%) and photosynthetic capacity (−28%) in response to elevated CO 2 , especially at high N supply, whereas C 3 grasses did not. Hence, achieved photosynthetic performance was markedly enhanced for C 3 grasses (+68%) in elevated CO 2 , but not significantly for forbs. Differences in access to soil resources between forbs and grasses may distinguish their responses to elevated CO 2 and N addition. Forbs had lesser root biomass, a lower distribution of biomass to roots, and lower specific root length than grasses. Maintenance of leaf N, possibly through increased root foraging in this nutrient-poor grassland, was necessary to sustain stimulation of photosynthesis under long-term elevated CO 2 . Dilution of leaf N and associated photosynthetic down-regulation in forbs under elevated [CO 2 ], relative to the C 3 grasses, illustrates the potential for shifts in species composition and diversity in grassland ecosystems that have significant forb and grass components.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/78679/1/j.1365-2486.2009.02058.x.pd
LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models
The advent of large language models (LLMs) and their adoption by the legal
community has given rise to the question: what types of legal reasoning can
LLMs perform? To enable greater study of this question, we present LegalBench:
a collaboratively constructed legal reasoning benchmark consisting of 162 tasks
covering six different types of legal reasoning. LegalBench was built through
an interdisciplinary process, in which we collected tasks designed and
hand-crafted by legal professionals. Because these subject matter experts took
a leading role in construction, tasks either measure legal reasoning
capabilities that are practically useful, or measure reasoning skills that
lawyers find interesting. To enable cross-disciplinary conversations about LLMs
in the law, we additionally show how popular legal frameworks for describing
legal reasoning -- which distinguish between its many forms -- correspond to
LegalBench tasks, thus giving lawyers and LLM developers a common vocabulary.
This paper describes LegalBench, presents an empirical evaluation of 20
open-source and commercial LLMs, and illustrates the types of research
explorations LegalBench enables.Comment: 143 pages, 79 tables, 4 figure
LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models
The advent of large language models (LLMs) and their adoption by the legal community has given rise to the question: what types of legal reasoning can LLMs perform? To enable greater study of this question, we present LegalBench: a collaboratively constructed legal reasoning benchmark consisting of 162 tasks covering six different types of legal reasoning. LegalBench was built through an interdisciplinary process, in which we collected tasks designed and hand-crafted by legal professionals. Because these subject matter experts took a leading role in construction, tasks either measure legal reasoning capabilities that are practically useful, or measure reasoning skills that lawyers find interesting. To enable cross-disciplinary conversations about LLMs in the law, we additionally show how popular legal frameworks for describing legal reasoning—which distinguish between its many forms—correspond to LegalBench tasks, thus giving lawyers and LLM developers a common vocabulary. This paper describes LegalBench, presents an empirical evaluation of 20 open-source and commercial LLMs, and illustrates the types of research explorations LegalBench enables
Re-Identifizierung in Gerichtsurteilen mit Simap Daten
<p>Die digitale Transformation erreicht nach und nach immer mehr Bereiche der Justiz. Bereits heute veröffentlichen viele Gerichte ihre Urteile in anonymisierter Form im Internet. Gleichzeitig werden technische Hilfsmittel, die auch zur Re-Identifikation dieser Urteile eingesetzt werden können, immer leistungsfähiger und ausgeklügelter. In der vorliegenden Untersuchung wurde im Bereich des öffentlichen Beschaffungswesens – durch ein vergleichsweise einfaches «String-Matching» mit Simap Projektnummern – eine Re- Identifikation von Verfahrensbeteiligten von bis zu 81.2 Prozent erreicht.</p>