Search CORE

5 research outputs found

Introducing the NLU Showroom: A NLU Demonstrator for the German Language

Author: Doll Niclas
Giesselbach Sven
Horstmann Heike
Wegener Dennis
Publication venue: OASIcs - OpenAccess Series in Informatics. 3rd Conference on Language, Data and Knowledge (LDK 2021)
Publication date: 01/01/2021
Field of study

We present the NLU Showroom, a platform for interactively demonstrating the functionality of natural language understanding models with easy to use visual interfaces. The NLU Showroom focuses primarily on the German language, as not many German NLU resources exist. However, it also serves corresponding English models to reach a broader audience. With the NLU Showroom we demonstrate and compare the capabilities and limitations of a variety of NLP/NLU models. The four initial demonstrators include a) a comparison on how different word representations capture semantic similarity b) a comparison on how different sentence representations interpret sentence similarity c) a showcase on analyzing reviews with NLU d) a showcase on finding links between entities. The NLU Showroom is build on state-of-the-art architectures for model serving and data processing. It targets a broad audience, from newbies to researchers but puts a focus on putting the presented models in the context of industrial applications

Dagstuhl Research Online Publication Server

Tokenizer Choice For LLM Training: Negligible or Crucial?

Author: Abdelwahab Hammam
Ali Mehdi
Buschhoff Jasper Schulze
Doll Niclas
Ebert Jan
Flores-Herr Nicolas
Fromm Michael
Jain Charvi
John Chelsea
Jurkschat Lena
Kesselheim Stefan
Klug Katrin
Leveling Johannes
Lübbering Max
Ostendorff Malte
Rutmann Richard
Sifa Rafet
Suarez Pedro Ortiz
Thellmann Klaudia
Weber Alexander Arno
Weinbach Samuel
Publication venue
Publication date: 18/10/2023
Field of study

The recent success of LLMs has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as a blind spot. Shedding light on this underexplored area, we conduct a comprehensive study on the influence of tokenizer choice on LLM downstream performance by training 24 mono- and multilingual LLMs at a 2.6B parameter scale, ablating different tokenizer algorithms and parameterizations. Our studies highlight that the tokenizer choice can significantly impact the model's downstream performance, training and inference costs. In particular, we find that the common tokenizer evaluation metrics fertility and parity are not always predictive of model downstream performance, rendering these metrics a questionable proxy for the model's downstream performance. Furthermore, we show that multilingual tokenizers trained on the five most frequent European languages require vocabulary size increases of factor three in comparison to English. While English-only tokenizers have been applied to the training of multi-lingual LLMs, we find that this approach results in a severe downstream performance degradation and additional training costs of up to 68%, due to an inefficient tokenization vocabulary

arXiv.org e-Print Archive

Business models for freight and logistics services. Working paper 4 of the study LowCarb-RFC-European Rail Freight Corridors going Carbon Neutral

Author: Doll Claus
Hitzler Matthias
Horvat Djerdj
Meyer Niclas
Publication venue: Fraunhofer ISI, Karlsruhe
Publication date
Field of study

Fraunhofer-ePrints

Myth or possibility - Institutional reforms and change management for mode shift in freight transport: Summary report 1 - LowCarb-RFC - European Rail Freight Corridors Going Carbon Neutral

Author: Doll Claus
Gandenberger Carsten
Horvat Djerdj
Kenny Samuel
Köhler Jonathan
Maibach Markus
Meyer Niclas
Petry Christoph
Publication venue: Fraunhofer ISI, Karlsruhe
Publication date
Field of study

Die LowCarb-RFC-Studie baut auf aktuellen Arbeiten zur Verkehrsverlagerung im Güterverkehr auf die Bahn auf. Diese vernachlässigen sämtlich die Kernfragen hierzu: (1) was sind die tatsächlichen Auswirkungen auf Regionen und deren Verkehrssysteme, und (2) ob und wie sich die wesentlichen Institutionen fit für diese Aufgabe machen lassen. Die Studie untersucht dies für zwei europäische Güterverkehrskorridore sowie den Logistikstandort Nordrhein-Westfalen. Sie wendet drei unterschiedliche Modelle zur Verkehrssimulation und Nachhaltigkeitsbewertung an, untersucht institutionelle Reformprozesse innerhalb und außerhalb des Verkehrsbereichs und betreibt eine Plattform zum Wissens- und Erfahrungsaustausch für Vertreter der Verkehrsunternehmen, der verladenden Wirtschaft sowie aus der Politik. Die Studie wird von der Mercator-Stiftung in Kooperation mit der European Climate Foundation im Zeitraum 2015 bis 2018 kofinanziert

Fraunhofer-ePrints