18 research outputs found
TwistBytes - identification of Cuneiform languages and German dialects at VarDial 2019
We describe our approaches for the German Dialect Identification (GDI) and the Cuneiform Language Identification (CLI) tasks at the VarDial Evaluation Campaign 2019. The goal was to identify dialects of Swiss German in GDI and Sumerian and Akkadian in CLI. In GDI, the system should distinguish four dialects from the German-speaking part of Switzerland. Our system for GDI achieved third place out of 6 teams, with a macro averaged F-1 of 74.6%. In CLI, the system should distinguish seven languages written in cuneiform script. Our system achieved third place out of 8 teams, with a macro averaged F-1 of 74.7%
TRANSLIT : a large-scale name transliteration resource
Transliteration is the process of expressing a proper name from a source language in the characters of a target language (e.g. from Cyrillic to Latin characters). We present TRANSLIT, a large-scale corpus with approx. 1.6 million entries in more than 180 languages with about 3 million variations of person and geolocation names. The corpus is based on various public data sources, which have been transformed into a unified format to simplify their usage, plus a newly compiled dataset from Wikipedia. In addition, we apply several machine learning methods to establish baselines for automatically detecting transliterated names in various languages. Our best systems achieve an accuracy of 92\% on identification of transliterated pairs
Methods of NLP in arts management
The boost of digital archives and libraries in art, literature, and music; the shift of cultural marketing and cultural criticism to social media and online platforms; and the emergence of new digital art and cultural products lead to an enormous increase in digital data, creating challenges as well as opportunities for arts management practitioners and researchers. For arts practitioners, NLP can be used to improve marketing and communication for target group analysis, event evaluation, (social) media analysis, pricing, social media optimization, advertisement targeting, or search engine optimization. In the field of archives, collections, and libraries, NLP can contribute to the improvement of indexing, consistency, and quality of databases as well as the development of suitable search algorithms. In the distribution of cultural products, online platforms can be improved and the markets analyzed
ZHAW-InIT : social media geolocation at VarDial 2020
We describe our approaches for the Social Media Geolocation (SMG) task at the VarDial Evaluation Campaign 2020. The goal was to predict geographical location (latitudes and longitudes) given an input text. There were three subtasks corresponding to German-speaking Switzerland (CH), Germany and Austria (DE-AT), and Croatia, Bosnia and Herzegovina, Montenegro and Serbia (BCMS). We submitted solutions to all subtasks but focused our development efforts on the CH subtask, where we achieved third place out of 16 submissions with a median distance of 15.93 km and had the best result of 14 unconstrained systems. In the DE-AT subtask, we ranked sixth out of ten submissions (fourth of 8 unconstrained systems) and for BCMS we achieved fourth place out of 13 submissions (second of 11 unconstrained systems)
Twist Bytes : German dialect identification with data mining optimization
We describe our approaches used in the German Dialect Identification (GDI) task at the VarDial Evaluation Campaign 2018. The goal was to identify to which out of four dialects spoken in German speaking part of Switzerland a sentence belonged to. We adopted two different metaclassifier approaches and used some data mining insights to improve the preprocessing and the meta-classifier parameters. Especially, we focused on using different feature extraction methods and how to combine them, since they influenced the performance very differently of the system. Our system achieved second place out of 8 teams, with a macro averaged F-1 of 64.6%. We also participated on the surprise dialect task with a multi-label approach
CEASR : a corpus for evaluating automatic speech recognition
In this paper, we present CEASR, a Corpus for Evaluating ASR quality. It is a data set derived from public speech corpora, containing manual transcripts enriched with metadata along with transcripts generated by several modern state-of-the-art ASR systems. CEASR provides this data in a unified structure, consistent across all corpora and systems with normalised transcript texts and metadata.
We then use CEASR to evaluate the quality of ASR systems on the basis of their Word Error Rate (WER). Our experiments show, among other results, a substantial difference in quality between commercial versus open-source ASR tools and differences up to a factor of ten for single systems on different corpora. By using CEASR, we could very efficiently and easily obtain these results. This shows that our corpus enables researchers to perform ASR-related evaluations and various in-depth analyses with noticeably reduced effort: without the need to collect, process and transcribe the speech data themselves
ZHAW-InIT at GermEval 2020 task 4 : low-resource speech-to-text
This paper presents the contribution of ZHAW-InIT to Task 4 ”Low-Resource STT” at GermEval 2020. The goal of the task is to develop a system for translating Swiss German dialect speech into Standard German text in the domain of parliamentary debates. Our approach is based on Jasper, a CNN Acoustic Model, which we fine-tune on the task data. We enhance the base system with an extended Language Model containing in-domain data and speed perturbation and run further experiments with post-processing. Our submission achieved first place with a final Word Error Rate of 40.29%
Design patterns for resource-constrained automated deep-learning methods
We present an extensive evaluation of a wide variety of promising design patterns for automated deep-learning (AutoDL) methods, organized according to the problem categories of the 2019 AutoDL challenges, which set the task of optimizing both model accuracy and search efficiency under tight time and computing constraints. We propose structured empirical evaluations as the most promising avenue to obtain design principles for deep-learning systems due to the absence of strong theoretical support. From these evaluations, we distill relevant patterns which give rise to neural network design recommendations. In particular, we establish (a) that very wide fully connected layers learn meaningful features faster; we illustrate (b) how the lack of pretraining in audio processing can be compensated by architecture search; we show (c) that in text processing deep-learning-based methods only pull ahead of traditional methods for short text lengths with less than a thousand characters under tight resource limitations; and lastly we present (d) evidence that in very data- and computing-constrained settings, hyperparameter tuning of more traditional machine-learning methods outperforms deep-learning systems