Search CORE

4 research outputs found

What company does my news article refer to? Tackling multiclass problems with topic modeling

Author: Farrell Patricio
Kunkel Julian
Lübbering Max
Publication venue
Publication date: 01/01/2019
Field of study

While it is technically trivial to search for the company name to predict the company a new article refers to, it often leads to incorrect results. In this article, we compare the two approaches bag-of-words with k-nearest neighbors and Latent Dirichlet Allocation with k-nearest neighbor by assessing their applicability for predicting the S\&P 500 company which is mentioned in a business news article or press release. Both approaches are evaluated on a corpus of 13k documents containing 84\% news articles and 16\% press releases. While the bag-of-words approach yields accurate predictions, it is highly inefficient due to its gigantic feature space. The Latent Dirichlet Allocation approach, on the other hand, manages to achieve roughly the same prediction accuracy (0.58 instead of 0.62) but reduces the feature space by a factor of seven

Publications Server of the Weierstrass Institute for Applied Analysis and Stochastics

Repositorium für Naturwissenschaften und Technik

Tokenizer Choice For LLM Training: Negligible or Crucial?

Author: Abdelwahab Hammam
Ali Mehdi
Buschhoff Jasper Schulze
Doll Niclas
Ebert Jan
Flores-Herr Nicolas
Fromm Michael
Jain Charvi
John Chelsea
Jurkschat Lena
Kesselheim Stefan
Klug Katrin
Leveling Johannes
Lübbering Max
Ostendorff Malte
Rutmann Richard
Sifa Rafet
Suarez Pedro Ortiz
Thellmann Klaudia
Weber Alexander Arno
Weinbach Samuel
Publication venue
Publication date: 18/10/2023
Field of study

The recent success of LLMs has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as a blind spot. Shedding light on this underexplored area, we conduct a comprehensive study on the influence of tokenizer choice on LLM downstream performance by training 24 mono- and multilingual LLMs at a 2.6B parameter scale, ablating different tokenizer algorithms and parameterizations. Our studies highlight that the tokenizer choice can significantly impact the model's downstream performance, training and inference costs. In particular, we find that the common tokenizer evaluation metrics fertility and parity are not always predictive of model downstream performance, rendering these metrics a questionable proxy for the model's downstream performance. Furthermore, we show that multilingual tokenizers trained on the five most frequent European languages require vocabulary size increases of factor three in comparison to English. While English-only tokenizers have been applied to the training of multi-lingual LLMs, we find that this approach results in a severe downstream performance degradation and additional training costs of up to 68%, due to an inefficient tokenization vocabulary

arXiv.org e-Print Archive

Failure of thymic deletion and instability of autoreactive Tregs drive autoimmunity in immune-privileged liver

Author: Anna-Lena Müller
Ansgar W. Lohse
Antonella Carambia
Christina Weiler-Normann
Christoph Schramm
Daria Krzikalla
David Lübbering
Dorothee Schwinge
Fenja A. Schuran
Johannes Herkel
Lena Schlott
Marcial Sebode
Max Preti
Miriam Schakat
Sören Weidemann
Tobias Poch
Publication venue: 'American Society for Clinical Investigation'
Publication date: 01/03/2021
Field of study

The liver is an immune-privileged organ that can deactivate autoreactive T cells. Yet in autoimmune hepatitis (AIH), autoreactive T cells can defy hepatic control and attack the liver. To elucidate how tolerance to self-antigens is lost during AIH pathogenesis, we generated a spontaneous mouse model of AIH, based on recognition of an MHC class II–restricted model peptide in hepatocytes by autoreactive CD4+ T cells. We found that the hepatic peptide was not expressed in the thymus, leading to deficient thymic deletion and resulting in peripheral abundance of autoreactive CD4+ T cells. In the liver, autoreactive CD4+ effector T cells accumulated within portal ectopic lymphoid structures and maturated toward pathogenic IFN-γ and TNF coproducing cells. Expansion and pathogenic maturation of autoreactive effector T cells was enabled by a selective increase of plasticity and instability of autoantigen-specific Tregs but not of nonspecific Tregs. Indeed, antigen-specific Tregs were reduced in frequency and manifested increased IL-17 production, reduced epigenetic demethylation, and reduced expression of Foxp3. As a consequence, autoantigen-specific Tregs had a reduced suppressive capacity, as compared with that of nonspecific Tregs. In conclusion, loss of tolerance and the pathogenesis of AIH were enabled by combined failure of thymic deletion and peripheral regulation

Directory of Open Access Journals