Search CORE

4 research outputs found

Tokenizer Choice For LLM Training: Negligible or Crucial?

Author: Abdelwahab Hammam
Ali Mehdi
Buschhoff Jasper Schulze
Doll Niclas
Ebert Jan
Flores-Herr Nicolas
Fromm Michael
Jain Charvi
John Chelsea
Jurkschat Lena
Kesselheim Stefan
Klug Katrin
Leveling Johannes
Lübbering Max
Ostendorff Malte
Rutmann Richard
Sifa Rafet
Suarez Pedro Ortiz
Thellmann Klaudia
Weber Alexander Arno
Weinbach Samuel
Publication venue
Publication date: 18/10/2023
Field of study

The recent success of LLMs has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as a blind spot. Shedding light on this underexplored area, we conduct a comprehensive study on the influence of tokenizer choice on LLM downstream performance by training 24 mono- and multilingual LLMs at a 2.6B parameter scale, ablating different tokenizer algorithms and parameterizations. Our studies highlight that the tokenizer choice can significantly impact the model's downstream performance, training and inference costs. In particular, we find that the common tokenizer evaluation metrics fertility and parity are not always predictive of model downstream performance, rendering these metrics a questionable proxy for the model's downstream performance. Furthermore, we show that multilingual tokenizers trained on the five most frequent European languages require vocabulary size increases of factor three in comparison to English. While English-only tokenizers have been applied to the training of multi-lingual LLMs, we find that this approach results in a severe downstream performance degradation and additional training costs of up to 68%, due to an inefficient tokenization vocabulary

arXiv.org e-Print Archive

Interactive Visualisation Techniques for the Web of Data

Author: Berners-Lee Tim
Pietriga Emmanuel
Thellmann Klaudia
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 13/05/2019
Field of study

International audienceThe RDF format offers powerful possibilities for machines, such as reasoning or federated queries over interlinked datasets. However, presenting RDF data to humans is very challenging: its very structure defeats traditionnal approaches, as it separates information into small pieces, making it difficult for users to make sense of it. My PhD work proposes an approach that presents RDF data in a context, to make them understandable by humans. We first describe S-Paths, a system to support set-based exploration of a dataset's content. We show that it works well on simple models, but that its efficiency is limited by performance issues on very abstract models. Then we lay the basis for a second project, whose aim is to take one more step back and put these sets of entities in a broader context, to give a structural overview of Linked Datasets

HAL-CentraleSupelec

Crossref

INRIA a CCSD electronic archive server

HAL-Rennes 1

LinDA - Visualising and Exploring Linked Data

Author: Auer Sören
Orlandi Fabrizio
Thellmann Klaudia
Publication venue
Publication date
Field of study

The main goal of our work in the context of the LinDA (Linked Data Analytics) project is to offer small and medium sized enterprises (SMEs) possibilities for integrating and consuming data by using Linked Data technologies. One of the major challenges of this project consists in providing user-friendly means of exploring and visualising Linked Data. To achieve this, a Semantic Web application has been created, based on state-of-the-art linked data visualisation approaches, which allows a largely automatic matching and binding of data to visualisations. Hence, in this demo paper we demonstrate the potential of a visualisation framework which is capable of dealing with different data formats, serialisations and Semantic Web ontologies

Fraunhofer-ePrints

Linked Geospatial Data

Author: C Gutierrez
C Lutz
C Nikolaou
C Nikolaou
Charalampos Nikolaou
D Calvanese
D Espinoza-Molina
Eliseo Clementini
I Meiri
J Klímek
J Mylopoulos
JF Allen
Klaudia Thellmann
Konstantina Bereta
Konstantina Bereta
Kostis Kyzirakos
M Perry
M Wessel
Manolis Koubarakis
MJ Egenhofer
Stefan Brüggemann
Tomasz Imieliński
Publication venue: Springer Nature
Publication date: 01/01/2019
Field of study

Huge amounts of geospatial data have been made freely available recently on the Web. For example, maps from geospatial search engines like Google Maps, images from satellites, open geospatial data from national cartographic agencies and user-contributed geospatial content from social networks. This article surveys the state of the art in the area of linked geospatial data, i.e., geospatial data made available on the Web using linked data technologies such as RDF and SPARQL, and interlinked with other Web data to increase its value for users and applications

Crossref

Oxford University Research Archive