1,328 research outputs found
Lynx: A knowledge-based AI service platform for content processing, enrichment and analysis for the legal domain
The EU-funded project Lynx focuses on the creation of a knowledge graph for the legal domain (Legal Knowledge Graph, LKG) and its use for the semantic processing, analysis and enrichment of documents from the legal domain. This article describes the use cases covered in the project, the entire developed platform and the semantic analysis services that operate on the documents. © 202
Interactive document summarisation.
This paper describes the Interactive Document Summariser (IDS), a dynamic document summarisation system, which can help users of digital libraries to access on-line documents more effectively. IDS provides dynamic control over summary characteristics, such as length and topic focus, so that changes made by the user are instantly reflected in an on-screen summary. A range of 'summary-in-context' views support seamless transitions between summaries and their source documents. IDS creates summaries by extracting keyphrases from a document with the Kea system, scoring sentences according to the keyphrases that they contain, and then extracting the highest scoring sentences. We report an evaluation of IDS summaries, in which human assessors identified suitable summary sentences in source documents, against which IDS summaries were judged. We found that IDS summaries were better than baseline summaries, and identify the characteristics of Kea keyphrases that lead to the best summaries
Investigating Hallucinations in Pruned Large Language Models for Abstractive Summarization
Despite the remarkable performance of generative large language models (LLMs)
on abstractive summarization, they face two significant challenges: their
considerable size and tendency to hallucinate. Hallucinations are concerning
because they erode reliability and raise safety issues. Pruning is a technique
that reduces model size by removing redundant weights, enabling more efficient
sparse inference. Pruned models yield downstream task performance comparable to
the original, making them ideal alternatives when operating on a limited
budget. However, the effect that pruning has upon hallucinations in abstractive
summarization with LLMs has yet to be explored. In this paper, we provide an
extensive empirical study across five summarization datasets, two
state-of-the-art pruning methods, and five instruction-tuned LLMs.
Surprisingly, we find that hallucinations from pruned LLMs are less prevalent
than the original models. Our analysis suggests that pruned models tend to
depend more on the source document for summary generation. This leads to a
higher lexical overlap between the generated summary and the source document,
which could be a reason for the reduction in hallucination risk
Bloated Disclosures: Can ChatGPT Help Investors Process Financial Information?
Generative AI tools such as ChatGPT can fundamentally change the way
investors process information. We probe the economic usefulness of these tools
in summarizing complex corporate disclosures using the stock market as a
laboratory. The unconstrained summaries are dramatically shorter, often by more
than 70% compared to the originals, whereas their information content is
amplified. When a document has a positive (negative) sentiment, its summary
becomes more positive (negative). More importantly, the summaries are more
effective at explaining stock market reactions to the disclosed information.
Motivated by these findings, we propose a measure of information "bloat." We
show that bloated disclosure is associated with adverse capital markets
consequences, such as lower price efficiency and higher information asymmetry.
Finally, we show that the model is effective at constructing targeted summaries
that identify firms' (non-)financial performance and risks. Collectively, our
results indicate that generative language modeling adds considerable value for
investors with information processing constraints
Automated Semantic Analysis, Legal Assessment, and Summarization of Standard Form Contracts
Consumers are confronted with standard form contracts on a daily basis, for example, when shopping online, registering for online platforms, or opening bank accounts. With expected revenue of more than 343 billion Euro in 2020, e-commerce is an ever more important branch of the European economy. Accepting standard form contracts often is a prerequisite to access products or services, and consumers frequently do so without reading, let alone understanding, them. Consumer protection organizations can advise and represent consumers in such situations of power imbalance. However, with increasing demand, limited budgets, and ever more complex regulations, they struggle to provide the necessary support. This thesis investigates techniques for the automated semantic analysis, legal assessment, and summarization of standard form contracts in German and English, which can be used to support consumers and those who protect them. We focus on Terms and Conditions from the fast growing market of European e-commerce, but also show that the developed techniques can in parts be applied to other types of standard form contracts. We elicited requirements from consumers and consumer advocates to understand their needs, identified the most relevant clause topics, and analyzed the processes in consumer protection organizations concerning the handling of standard form contracts. Based on these insights, a pipeline for the automated semantic analysis, legal assessment, and summarization of standard form contracts was developed. The components of this pipeline can automatically identify and extract standard form contracts from the internet and hierarchically structure them into their individual clauses. Clause topics can be automatically identified, and relevant information can be extracted. Clauses can then be legally assessed, either using a knowledge-base we constructed or through binary classification by a transformer model. This information is then used to create summaries that are tailored to the needs of the different user groups. For each step of the pipeline, different approaches were developed and compared, from classical rule-based systems to deep learning techniques. Each approach was evaluated on German and English corpora containing more than 10,000 clauses, which were annotated as part of this thesis. The developed pipeline was prototypically implemented as part of a web-based tool to support consumer advocates in analyzing and assessing standard form contracts. The implementation was evaluated with experts from two German consumer protection organizations with questionnaires and task-based evaluations. The results of the evaluation show that our system can identify over 50 different types of clauses, which cover more than 90% of the clauses typically occurring in Terms and Conditions from online shops, with an accuracy of 0.80 to 0.84. The system can also automatically extract 21 relevant data points from these clauses with a precision of 0.91 and a recall of 0.86. On a corpus of more than 200 German clauses, the system was also able to assess the legality of clauses with an accuracy of 0.90. The expert evaluation has shown that the system is indeed able to support consumer advocates in their daily work by reducing the time they need to analyze and assess clauses in standard form contracts
Bringing order into the realm of Transformer-based language models for artificial intelligence and law
Transformer-based language models (TLMs) have widely been recognized to be a
cutting-edge technology for the successful development of deep-learning-based
solutions to problems and applications that require natural language processing
and understanding. Like for other textual domains, TLMs have indeed pushed the
state-of-the-art of AI approaches for many tasks of interest in the legal
domain. Despite the first Transformer model being proposed about six years ago,
there has been a rapid progress of this technology at an unprecedented rate,
whereby BERT and related models represent a major reference, also in the legal
domain. This article provides the first systematic overview of TLM-based
methods for AI-driven problems and tasks in the legal sphere. A major goal is
to highlight research advances in this field so as to understand, on the one
hand, how the Transformers have contributed to the success of AI in supporting
legal processes, and on the other hand, what are the current limitations and
opportunities for further research development.Comment: Please refer to the published version: Greco, C.M., Tagarelli, A.
(2023) Bringing order into the realm of Transformer-based language models for
artificial intelligence and law. Artif Intell Law, Springer Nature. November
2023. https://doi.org/10.1007/s10506-023-09374-
JURI SAYS:An Automatic Judgement Prediction System for the European Court of Human Rights
In this paper we present the web platform JURI SAYS that automatically predicts decisions of the European Court of Human Rights based on communicated cases, which are published by the court early in the proceedings and are often available many years before the final decision is made. Our system therefore predicts future judgements of the court. The platform is available at jurisays.com and shows the predictions compared to the actual decisions of the court. It is automatically updated every month by including the prediction for the new cases. Additionally, the system highlights the sentences and paragraphs that are most important for the prediction (i.e. violation vs. no violation of human rights)
A New Multilingual Authoring Tool of Semistructured Legal Documents
Los enfoques actuales de gestión de la documentación multilingüe hacen uso de la traducción humana, la traducción automática (TA) y la traducción asistida por ordenador (TAO) para producir versiones de un solo documento en variosidiomas. Sin embargo, losrecientes avances en generación de lenguaje natural (GLN) indican que es posible implementarsistemas independientes del lenguaje a fin de producir documentos en variosidiomas, independientes de una lengua origen, de forma más eficiente y rentable. En este artículo presentamos GenTur —una herramienta de ayuda a la redacción para producir contratosturísticos en variosidiomas. Se prestará especial atención a dos elementos básicos de su implementación: por un lado, la interlengua xgtling usada para la representación discursiva de los contratos, y por otro lado, el desarrollo de una arquitectura que permita a la citada interlengua generar contratosturísticos por medio del algoritmo de generación GT-Mth
Multi-LexSum: Real-World Summaries of Civil Rights Lawsuits at Multiple Granularities
With the advent of large language models, methods for abstractive
summarization have made great strides, creating potential for use in
applications to aid knowledge workers processing unwieldy document collections.
One such setting is the Civil Rights Litigation Clearinghouse (CRLC)
(https://clearinghouse.net),which posts information about large-scale civil
rights lawsuits, serving lawyers, scholars, and the general public. Today,
summarization in the CRLC requires extensive training of lawyers and law
students who spend hours per case understanding multiple relevant documents in
order to produce high-quality summaries of key events and outcomes. Motivated
by this ongoing real-world summarization effort, we introduce Multi-LexSum, a
collection of 9,280 expert-authored summaries drawn from ongoing CRLC writing.
Multi-LexSum presents a challenging multi-document summarization task given the
length of the source documents, often exceeding two hundred pages per case.
Furthermore, Multi-LexSum is distinct from other datasets in its multiple
target summaries, each at a different granularity (ranging from one-sentence
"extreme" summaries to multi-paragraph narrations of over five hundred words).
We present extensive analysis demonstrating that despite the high-quality
summaries in the training data (adhering to strict content and style
guidelines), state-of-the-art summarization models perform poorly on this task.
We release Multi-LexSum for further research in summarization methods as well
as to facilitate development of applications to assist in the CRLC's mission at
https://multilexsum.github.io.Comment: 37 pages, 2 figures, 9 table
- …