5 research outputs found

    Introducing the NLU Showroom: A NLU Demonstrator for the German Language

    Get PDF
    We present the NLU Showroom, a platform for interactively demonstrating the functionality of natural language understanding models with easy to use visual interfaces. The NLU Showroom focuses primarily on the German language, as not many German NLU resources exist. However, it also serves corresponding English models to reach a broader audience. With the NLU Showroom we demonstrate and compare the capabilities and limitations of a variety of NLP/NLU models. The four initial demonstrators include a) a comparison on how different word representations capture semantic similarity b) a comparison on how different sentence representations interpret sentence similarity c) a showcase on analyzing reviews with NLU d) a showcase on finding links between entities. The NLU Showroom is build on state-of-the-art architectures for model serving and data processing. It targets a broad audience, from newbies to researchers but puts a focus on putting the presented models in the context of industrial applications

    Tokenizer Choice For LLM Training: Negligible or Crucial?

    Full text link
    The recent success of LLMs has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as a blind spot. Shedding light on this underexplored area, we conduct a comprehensive study on the influence of tokenizer choice on LLM downstream performance by training 24 mono- and multilingual LLMs at a 2.6B parameter scale, ablating different tokenizer algorithms and parameterizations. Our studies highlight that the tokenizer choice can significantly impact the model's downstream performance, training and inference costs. In particular, we find that the common tokenizer evaluation metrics fertility and parity are not always predictive of model downstream performance, rendering these metrics a questionable proxy for the model's downstream performance. Furthermore, we show that multilingual tokenizers trained on the five most frequent European languages require vocabulary size increases of factor three in comparison to English. While English-only tokenizers have been applied to the training of multi-lingual LLMs, we find that this approach results in a severe downstream performance degradation and additional training costs of up to 68%, due to an inefficient tokenization vocabulary

    Myth or possibility - Institutional reforms and change management for mode shift in freight transport: Summary report 1 - LowCarb-RFC - European Rail Freight Corridors Going Carbon Neutral

    No full text
    Die LowCarb-RFC-Studie baut auf aktuellen Arbeiten zur Verkehrsverlagerung im Güterverkehr auf die Bahn auf. Diese vernachlässigen sämtlich die Kernfragen hierzu: (1) was sind die tatsächlichen Auswirkungen auf Regionen und deren Verkehrssysteme, und (2) ob und wie sich die wesentlichen Institutionen fit für diese Aufgabe machen lassen. Die Studie untersucht dies für zwei europäische Güterverkehrskorridore sowie den Logistikstandort Nordrhein-Westfalen. Sie wendet drei unterschiedliche Modelle zur Verkehrssimulation und Nachhaltigkeitsbewertung an, untersucht institutionelle Reformprozesse innerhalb und außerhalb des Verkehrsbereichs und betreibt eine Plattform zum Wissens- und Erfahrungsaustausch für Vertreter der Verkehrsunternehmen, der verladenden Wirtschaft sowie aus der Politik. Die Studie wird von der Mercator-Stiftung in Kooperation mit der European Climate Foundation im Zeitraum 2015 bis 2018 kofinanziert
    corecore