Search CORE

10 research outputs found

Towards an Open Platform for Legal Information

Author: Blume Till
Ostendorff Malte
Ostendorff Saskia
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 27/05/2020
Field of study

Recent advances in the area of legal information systems have led to a variety of applications that promise support in processing and accessing legal documents. Unfortunately, these applications have various limitations, e.g., regarding scope or extensibility. Furthermore, we do not observe a trend towards open access in digital libraries in the legal domain as we observe in other domains, e.g., economics of computer science. To improve open access in the legal domain, we present our approach for an open source platform to transparently process and access Legal Open Data. This enables the sustainable development of legal applications by offering a single technology stack. Moreover, the approach facilitates the development and deployment of new technologies. As proof of concept, we implemented six technologies and generated metadata for more than 250,000 German laws and court decisions. Thus, we can provide users of our platform not only access to legal documents, but also the contained information.Comment: Accepted at ACM/IEEE Joint Conference on Digital Libraries (JCDL) 202

arXiv.org e-Print Archive

Crossref

Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning

Author: Ostendorff Malte
Rehm Georg
Publication venue
Publication date: 23/01/2023
Field of study

Most Transformer language models are primarily pretrained on English text, limiting their use for other languages. As the model sizes grow, the performance gap between English and other languages with fewer compute and data resources increases even further. Consequently, more resource-efficient training methods are needed to bridge the gap for languages with fewer resources available. To address this problem, we introduce a cross-lingual and progressive transfer learning approach, called CLP-Transfer, that transfers models from a source language, for which pretrained models are publicly available, like English, to a new target language. As opposed to prior work, which focused on the cross-lingual transfer between two languages, we extend the transfer to the model size. Given a pretrained model in a source language, we aim for a same-sized model in a target language. Instead of training a model from scratch, we exploit a smaller model that is in the target language but requires much fewer resources. Both small and source models are then used to initialize the token embeddings of the larger model based on the overlapping vocabulary of the source and target language. All remaining weights are reused from the model in the source language. This approach outperforms the sole cross-lingual transfer and can save up to 80% of the training steps compared to the random initialization

arXiv.org e-Print Archive

AspectCSE: Sentence Embeddings for Aspect-based Semantic Textual Similarity using Contrastive Learning and Structured Knowledge

Author: Gerber Emanuel
Matthes Florian
Ostendorff Malte
Schopf Tim
Publication venue
Publication date: 22/07/2023
Field of study

Generic sentence embeddings provide a coarse-grained approximation of semantic textual similarity but ignore specific aspects that make texts similar. Conversely, aspect-based sentence embeddings provide similarities between texts based on certain predefined aspects. Thus, similarity predictions of texts are more targeted to specific requirements and more easily explainable. In this paper, we present AspectCSE, an approach for aspect-based contrastive learning of sentence embeddings. Results indicate that AspectCSE achieves an average improvement of 3.97% on information retrieval tasks across multiple aspects compared to the previous best results. We also propose using Wikidata knowledge graph properties to train models of multi-aspect sentence embeddings in which multiple specific aspects are simultaneously considered during similarity predictions. We demonstrate that multi-aspect embeddings outperform single-aspect embeddings on aspect-specific information retrieval tasks. Finally, we examine the aspect-based sentence embedding space and demonstrate that embeddings of semantically similar aspect labels are often close, even without explicit similarity training between different aspect labels.Comment: Accepted to the 14th International Conference on Recent Advances in Natural Language Processing (RANLP 2023

arXiv.org e-Print Archive

Tokenizer Choice For LLM Training: Negligible or Crucial?

Author: Abdelwahab Hammam
Ali Mehdi
Buschhoff Jasper Schulze
Doll Niclas
Ebert Jan
Flores-Herr Nicolas
Fromm Michael
Jain Charvi
John Chelsea
Jurkschat Lena
Kesselheim Stefan
Klug Katrin
Leveling Johannes
Lübbering Max
Ostendorff Malte
Rutmann Richard
Sifa Rafet
Suarez Pedro Ortiz
Thellmann Klaudia
Weber Alexander Arno
Weinbach Samuel
Publication venue
Publication date: 18/10/2023
Field of study

The recent success of LLMs has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as a blind spot. Shedding light on this underexplored area, we conduct a comprehensive study on the influence of tokenizer choice on LLM downstream performance by training 24 mono- and multilingual LLMs at a 2.6B parameter scale, ablating different tokenizer algorithms and parameterizations. Our studies highlight that the tokenizer choice can significantly impact the model's downstream performance, training and inference costs. In particular, we find that the common tokenizer evaluation metrics fertility and parity are not always predictive of model downstream performance, rendering these metrics a questionable proxy for the model's downstream performance. Furthermore, we show that multilingual tokenizers trained on the five most frequent European languages require vocabulary size increases of factor three in comparison to English. While English-only tokenizers have been applied to the training of multi-lingual LLMs, we find that this approach results in a severe downstream performance degradation and additional training costs of up to 68%, due to an inefficient tokenization vocabulary

arXiv.org e-Print Archive