4 research outputs found

    What company does my news article refer to? Tackling multiclass problems with topic modeling

    Get PDF
    While it is technically trivial to search for the company name to predict the company a new article refers to, it often leads to incorrect results. In this article, we compare the two approaches bag-of-words with k-nearest neighbors and Latent Dirichlet Allocation with k-nearest neighbor by assessing their applicability for predicting the S\&P 500 company which is mentioned in a business news article or press release. Both approaches are evaluated on a corpus of 13k documents containing 84\% news articles and 16\% press releases. While the bag-of-words approach yields accurate predictions, it is highly inefficient due to its gigantic feature space. The Latent Dirichlet Allocation approach, on the other hand, manages to achieve roughly the same prediction accuracy (0.58 instead of 0.62) but reduces the feature space by a factor of seven

    Tokenizer Choice For LLM Training: Negligible or Crucial?

    Full text link
    The recent success of LLMs has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as a blind spot. Shedding light on this underexplored area, we conduct a comprehensive study on the influence of tokenizer choice on LLM downstream performance by training 24 mono- and multilingual LLMs at a 2.6B parameter scale, ablating different tokenizer algorithms and parameterizations. Our studies highlight that the tokenizer choice can significantly impact the model's downstream performance, training and inference costs. In particular, we find that the common tokenizer evaluation metrics fertility and parity are not always predictive of model downstream performance, rendering these metrics a questionable proxy for the model's downstream performance. Furthermore, we show that multilingual tokenizers trained on the five most frequent European languages require vocabulary size increases of factor three in comparison to English. While English-only tokenizers have been applied to the training of multi-lingual LLMs, we find that this approach results in a severe downstream performance degradation and additional training costs of up to 68%, due to an inefficient tokenization vocabulary

    Failure of thymic deletion and instability of autoreactive Tregs drive autoimmunity in immune-privileged liver

    No full text
    The liver is an immune-privileged organ that can deactivate autoreactive T cells. Yet in autoimmune hepatitis (AIH), autoreactive T cells can defy hepatic control and attack the liver. To elucidate how tolerance to self-antigens is lost during AIH pathogenesis, we generated a spontaneous mouse model of AIH, based on recognition of an MHC class II–restricted model peptide in hepatocytes by autoreactive CD4+ T cells. We found that the hepatic peptide was not expressed in the thymus, leading to deficient thymic deletion and resulting in peripheral abundance of autoreactive CD4+ T cells. In the liver, autoreactive CD4+ effector T cells accumulated within portal ectopic lymphoid structures and maturated toward pathogenic IFN-γ and TNF coproducing cells. Expansion and pathogenic maturation of autoreactive effector T cells was enabled by a selective increase of plasticity and instability of autoantigen-specific Tregs but not of nonspecific Tregs. Indeed, antigen-specific Tregs were reduced in frequency and manifested increased IL-17 production, reduced epigenetic demethylation, and reduced expression of Foxp3. As a consequence, autoantigen-specific Tregs had a reduced suppressive capacity, as compared with that of nonspecific Tregs. In conclusion, loss of tolerance and the pathogenesis of AIH were enabled by combined failure of thymic deletion and peripheral regulation
    corecore