Search CORE

2 research outputs found

Tokenizer Choice For LLM Training: Negligible or Crucial?

Author: Abdelwahab Hammam
Ali Mehdi
Buschhoff Jasper Schulze
Doll Niclas
Ebert Jan
Flores-Herr Nicolas
Fromm Michael
Jain Charvi
John Chelsea
Jurkschat Lena
Kesselheim Stefan
Klug Katrin
Leveling Johannes
Lübbering Max
Ostendorff Malte
Rutmann Richard
Sifa Rafet
Suarez Pedro Ortiz
Thellmann Klaudia
Weber Alexander Arno
Weinbach Samuel
Publication venue
Publication date: 18/10/2023
Field of study

The recent success of LLMs has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as a blind spot. Shedding light on this underexplored area, we conduct a comprehensive study on the influence of tokenizer choice on LLM downstream performance by training 24 mono- and multilingual LLMs at a 2.6B parameter scale, ablating different tokenizer algorithms and parameterizations. Our studies highlight that the tokenizer choice can significantly impact the model's downstream performance, training and inference costs. In particular, we find that the common tokenizer evaluation metrics fertility and parity are not always predictive of model downstream performance, rendering these metrics a questionable proxy for the model's downstream performance. Furthermore, we show that multilingual tokenizers trained on the five most frequent European languages require vocabulary size increases of factor three in comparison to English. While English-only tokenizers have been applied to the training of multi-lingual LLMs, we find that this approach results in a severe downstream performance degradation and additional training costs of up to 68%, due to an inefficient tokenization vocabulary

arXiv.org e-Print Archive

Dosimetry and optimal scan time of 18FSiTATE-PET/CT in patients with neuroendocrine tumours

Author: Auernhammer Christoph J.
Bartenstein Peter
Beyer Leonie
Brendel Matthias
Böning Guido
Cyran Clemens C.
Gildehaus F. J.
Gosewisch Astrid
Ilhan Harun
Jurkschat Klaus
Lindner Simon
Mittlmeier Lena M.
Rübenthaler Johannes
Schirrmacher Ralf
Spitzweg Christine
Tiling Reinhold
Todica Andrei
Unterrainer Marcus
Völter Friederike
Wenter Vera
Wängler Björn
Wängler Carmen
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2021
Field of study

PURPOSE Radiolabelled somatostatin analogues targeting somatostatin receptors (SSR) are well established for combined positron emission tomography/computer tomography (PET/CT) imaging of neuroendocrine tumours (NET). 18FSiTATE has recently been introduced showing high image quality, promising clinical performance and improved logistics compared to the clinical reference standard 68Ga-DOTA-TOC. Here we present the first dosimetry and optimal scan time analysis. METHODS Eight NET patients received a 18FSiTATE-PET/CT (250 ± 66~MBq) with repeated emission scans (10, 30, 60, 120, 180~min after injection). Biodistribution in normal organs and SSR-positive tumour uptake were assessed. Dosimetry estimates for risk organs were determined using a combined linear-monoexponential model, and by applying 18F S-values and reference target masses for the ICRP89 adult male or female (OLINDA 2.0). Tumour-to-background ratios were compared quantitatively and visually between different scan times. RESULTS After 1 h, normal organs showed similar tracer uptake with only negligible changes until 3 h post-injection. In contrast, tracer uptake by tumours increased progressively for almost all types of metastases, thus increasing tumour-to-background ratios over time. Dosimetry resulted in a total effective dose of 0.015 ± 0.004~mSv/MBq. Visual evaluation revealed no clinically relevant discrepancies between later scan times, but image quality was rated highest in 60 and 120~min images. CONCLUSION 18FSiTATE-PET/CT in NET shows overall high tumour-to-background ratios from 60 to 180~min after injection and an effective dose comparable to 68Ga-labelled alternatives. For clinical use of 18FSiTATE, the best compromise between image quality and tumour-to-background contrast is reached at 120~min, followed by 60~min after injection

Open Access LMU

PubMed Central

Hochschulbibliothekszentrum des Landes Nordrhein-Westfalen (hbz)