10 research outputs found
Towards an Open Platform for Legal Information
Recent advances in the area of legal information systems have led to a
variety of applications that promise support in processing and accessing legal
documents. Unfortunately, these applications have various limitations, e.g.,
regarding scope or extensibility. Furthermore, we do not observe a trend
towards open access in digital libraries in the legal domain as we observe in
other domains, e.g., economics of computer science. To improve open access in
the legal domain, we present our approach for an open source platform to
transparently process and access Legal Open Data. This enables the sustainable
development of legal applications by offering a single technology stack.
Moreover, the approach facilitates the development and deployment of new
technologies. As proof of concept, we implemented six technologies and
generated metadata for more than 250,000 German laws and court decisions. Thus,
we can provide users of our platform not only access to legal documents, but
also the contained information.Comment: Accepted at ACM/IEEE Joint Conference on Digital Libraries (JCDL)
202
Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning
Most Transformer language models are primarily pretrained on English text,
limiting their use for other languages. As the model sizes grow, the
performance gap between English and other languages with fewer compute and data
resources increases even further. Consequently, more resource-efficient
training methods are needed to bridge the gap for languages with fewer
resources available. To address this problem, we introduce a cross-lingual and
progressive transfer learning approach, called CLP-Transfer, that transfers
models from a source language, for which pretrained models are publicly
available, like English, to a new target language. As opposed to prior work,
which focused on the cross-lingual transfer between two languages, we extend
the transfer to the model size. Given a pretrained model in a source language,
we aim for a same-sized model in a target language. Instead of training a model
from scratch, we exploit a smaller model that is in the target language but
requires much fewer resources. Both small and source models are then used to
initialize the token embeddings of the larger model based on the overlapping
vocabulary of the source and target language. All remaining weights are reused
from the model in the source language. This approach outperforms the sole
cross-lingual transfer and can save up to 80% of the training steps compared to
the random initialization
AspectCSE: Sentence Embeddings for Aspect-based Semantic Textual Similarity using Contrastive Learning and Structured Knowledge
Generic sentence embeddings provide a coarse-grained approximation of
semantic textual similarity but ignore specific aspects that make texts
similar. Conversely, aspect-based sentence embeddings provide similarities
between texts based on certain predefined aspects. Thus, similarity predictions
of texts are more targeted to specific requirements and more easily
explainable. In this paper, we present AspectCSE, an approach for aspect-based
contrastive learning of sentence embeddings. Results indicate that AspectCSE
achieves an average improvement of 3.97% on information retrieval tasks across
multiple aspects compared to the previous best results. We also propose using
Wikidata knowledge graph properties to train models of multi-aspect sentence
embeddings in which multiple specific aspects are simultaneously considered
during similarity predictions. We demonstrate that multi-aspect embeddings
outperform single-aspect embeddings on aspect-specific information retrieval
tasks. Finally, we examine the aspect-based sentence embedding space and
demonstrate that embeddings of semantically similar aspect labels are often
close, even without explicit similarity training between different aspect
labels.Comment: Accepted to the 14th International Conference on Recent Advances in
Natural Language Processing (RANLP 2023
Tokenizer Choice For LLM Training: Negligible or Crucial?
The recent success of LLMs has been predominantly driven by curating the
training dataset composition, scaling of model architectures and dataset sizes
and advancements in pretraining objectives, leaving tokenizer influence as a
blind spot. Shedding light on this underexplored area, we conduct a
comprehensive study on the influence of tokenizer choice on LLM downstream
performance by training 24 mono- and multilingual LLMs at a 2.6B parameter
scale, ablating different tokenizer algorithms and parameterizations. Our
studies highlight that the tokenizer choice can significantly impact the
model's downstream performance, training and inference costs. In particular, we
find that the common tokenizer evaluation metrics fertility and parity are not
always predictive of model downstream performance, rendering these metrics a
questionable proxy for the model's downstream performance. Furthermore, we show
that multilingual tokenizers trained on the five most frequent European
languages require vocabulary size increases of factor three in comparison to
English. While English-only tokenizers have been applied to the training of
multi-lingual LLMs, we find that this approach results in a severe downstream
performance degradation and additional training costs of up to 68%, due to an
inefficient tokenization vocabulary