24 research outputs found

    TwistBytes - identification of Cuneiform languages and German dialects at VarDial 2019

    Get PDF
    We describe our approaches for the German Dialect Identification (GDI) and the Cuneiform Language Identification (CLI) tasks at the VarDial Evaluation Campaign 2019. The goal was to identify dialects of Swiss German in GDI and Sumerian and Akkadian in CLI. In GDI, the system should distinguish four dialects from the German-speaking part of Switzerland. Our system for GDI achieved third place out of 6 teams, with a macro averaged F-1 of 74.6%. In CLI, the system should distinguish seven languages written in cuneiform script. Our system achieved third place out of 8 teams, with a macro averaged F-1 of 74.7%

    TRANSLIT : a large-scale name transliteration resource

    Get PDF
    Transliteration is the process of expressing a proper name from a source language in the characters of a target language (e.g. from Cyrillic to Latin characters). We present TRANSLIT, a large-scale corpus with approx. 1.6 million entries in more than 180 languages with about 3 million variations of person and geolocation names. The corpus is based on various public data sources, which have been transformed into a unified format to simplify their usage, plus a newly compiled dataset from Wikipedia. In addition, we apply several machine learning methods to establish baselines for automatically detecting transliterated names in various languages. Our best systems achieve an accuracy of 92\% on identification of transliterated pairs

    Twist Bytes : German dialect identification with data mining optimization

    Get PDF
    We describe our approaches used in the German Dialect Identification (GDI) task at the VarDial Evaluation Campaign 2018. The goal was to identify to which out of four dialects spoken in German speaking part of Switzerland a sentence belonged to. We adopted two different metaclassifier approaches and used some data mining insights to improve the preprocessing and the meta-classifier parameters. Especially, we focused on using different feature extraction methods and how to combine them, since they influenced the performance very differently of the system. Our system achieved second place out of 8 teams, with a macro averaged F-1 of 64.6%. We also participated on the surprise dialect task with a multi-label approach

    ZHAW-InIT : social media geolocation at VarDial 2020

    Get PDF
    We describe our approaches for the Social Media Geolocation (SMG) task at the VarDial Evaluation Campaign 2020. The goal was to predict geographical location (latitudes and longitudes) given an input text. There were three subtasks corresponding to German-speaking Switzerland (CH), Germany and Austria (DE-AT), and Croatia, Bosnia and Herzegovina, Montenegro and Serbia (BCMS). We submitted solutions to all subtasks but focused our development efforts on the CH subtask, where we achieved third place out of 16 submissions with a median distance of 15.93 km and had the best result of 14 unconstrained systems. In the DE-AT subtask, we ranked sixth out of ten submissions (fourth of 8 unconstrained systems) and for BCMS we achieved fourth place out of 13 submissions (second of 11 unconstrained systems)

    ZHAW-InIT at GermEval 2020 task 4 : low-resource speech-to-text

    Get PDF
    This paper presents the contribution of ZHAW-InIT to Task 4 ”Low-Resource STT” at GermEval 2020. The goal of the task is to develop a system for translating Swiss German dialect speech into Standard German text in the domain of parliamentary debates. Our approach is based on Jasper, a CNN Acoustic Model, which we fine-tune on the task data. We enhance the base system with an extended Language Model containing in-domain data and speed perturbation and run further experiments with post-processing. Our submission achieved first place with a final Word Error Rate of 40.29%

    Design patterns for resource-constrained automated deep-learning methods

    Get PDF
    We present an extensive evaluation of a wide variety of promising design patterns for automated deep-learning (AutoDL) methods, organized according to the problem categories of the 2019 AutoDL challenges, which set the task of optimizing both model accuracy and search efficiency under tight time and computing constraints. We propose structured empirical evaluations as the most promising avenue to obtain design principles for deep-learning systems due to the absence of strong theoretical support. From these evaluations, we distill relevant patterns which give rise to neural network design recommendations. In particular, we establish (a) that very wide fully connected layers learn meaningful features faster; we illustrate (b) how the lack of pretraining in audio processing can be compensated by architecture search; we show (c) that in text processing deep-learning-based methods only pull ahead of traditional methods for short text lengths with less than a thousand characters under tight resource limitations; and lastly we present (d) evidence that in very data- and computing-constrained settings, hyperparameter tuning of more traditional machine-learning methods outperforms deep-learning systems

    Short-time dynamic patterns of bioaerosol generation and displacement in an indoor environment

    Full text link
    Short-time dynamics and distribution of airborne biological and total particles were assessed in a large university hallway by particle counting using laser particle counters and impaction air samplers. Particle numbers of four different size ranges were determined every 2 minutes over several hours. Bioaerosols (culturable bacteria and fungi determined as colony-forming units) were directly collected every 5 minutes on Petri dishes containing the corresponding growth medium. Results clearly show distinct shorttime dynamics of particulate aerosols, both of biological and non-biological origin. These reproducible periodic patterns are closely related to periods when lectures are held in lecture rooms and the intermissions in between where students are present in the hallway. Peaks of airborne culturable bacteria were observed with a periodicity of 1 hour. Bioaerosol concentrations follow synchronously the variation in total number of particles. These highly reproducible temporal dynamics have to be considered when monitoring indoor environments with respect to air quality

    Fantastically reasonable: ambivalence in the representation of science and technology in super-hero comics

    Full text link
    A long-standing contrast in academic discussions of science concerns its perceived disenchanting or enchanting public impact. In one image, science displaces magical belief in unknowable entities with belief in knowable forces and processes and reduces all things to a single technical measure. In the other, science is itself magically transcendent, expressed in technological adulation and an image of scientists as wizards or priests. This paper shows that these contrasting images are also found in representations of science in super-hero comics, which, given their lowly status in Anglo-American culture, would seem an unlikely place to find such commonality with academic discourse. It is argued that this is evidence that the contrast constitutes an ambivalence arising from the dilemmas that science poses; they are shared rhetorics arising from and reflexively feeding a set of broad cultural concerns. This is explored through consideration of representations of science at a number of levels in the comics, with particular focus on the science-magic constellation, and enchanted and disenchanted imagery in representations of technology and scientists. It is concluded that super-hero comics are one cultural arena where the public meaning of science is actively worked out, an activity that unites “expert” and “non-expert” alike

    spMMMP at GermEval 2018 shared task : classification of offensive content in tweets using convolutional neural networks and gated recurrent units

    No full text
    In this paper, we propose two different systems for classifying offensive language in micro-blog messages from twitter (”tweet”). The first system uses an ensemble of convolutional neural networks (CNN), whose outputs are then fed to a meta-classifier for the final prediction. The second system uses a combination of a CNN and a gated recurrent unit (GRU) together with a transfer-learning approach based on pretraining with a large, automatically translated dataset
    corecore