Search CORE

12 research outputs found

Evaluation of language identification methods using 285 languages

Author: Jauhiainen Heidi Annika
Jauhiainen Tommi Sakari
Linden Bo Krister Johan
Publication venue: 'Linkoping University Electronic Press'
Publication date: 01/01/2017
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Evaluating HeLI with non-linear mappings

Author: Jauhiainen Heidi Annika
Jauhiainen Tommi Sakari
Linden Bo Krister Johan
Publication venue: The Association for Computational Linguistics
Publication date: 01/01/2017
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Discriminating between Mandarin Chinese and Swiss-German varieties using adaptive language models

Author: Jauhiainen Heidi Annika
Jauhiainen Tommi Sakari
Linden Bo Krister Johan
Publication venue: The Association for Computational Linguistics
Publication date: 30/04/2019
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Language Set Identification in Noisy Synthetic Multilingual Documents

Author: Jauhiainen Heidi Annika
Jauhiainen Tommi Sakari
Linden Krister
Publication venue: Springer International Publishing AG
Publication date: 01/01/2015
Field of study

Proceeding volume: Part IIn this paper, we reconsider the problem of language identification of multilingual documents. Automated language identification algorithms have been improving steadily from the seventies until recent years. The current state-of-the-art language identifiers are quite efficient even with only a few characters and this gives us enough reason to again evaluate the possibility to use existing language identifiers for monolingual text to detect the language set of a multilingual document. We are using a previously developed language identifier for monolingual documents with the multilingual documents from the WikipediaMulti dataset published in a recent study. Our method outperforms previous methods tested with the same data, achieving an F 1-score of 97.6 when classifying between 44 languages.Peer reviewe

Crossref

Helsingin yliopiston digitaalinen arkisto

Language and Dialect Identification of Cuneiform Texts

Author: Alstola Tero
Jauhiainen Heidi Annika
Jauhiainen Tommi Sakari
Linden Bo Krister Johan
Publication venue: The Association for Computational Linguistics
Publication date: 30/04/2019
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

HeLI, a Word-Based Backoff Method for Language Identification

Author: Jauhiainen Heidi Annika
Jauhiainen Tommi Sakari
Linden Bo Krister Johan
Publication venue
Publication date: 01/01/2016
Field of study

In this paper we describe the Helsinki language identification method, HeLI, and the resources we created for and used in the 3rd edition of the Discriminating between Similar Languages (DSL) shared task, which was organized as part of the VarDial 2016 workshop. The shared task comprised of a total of 8 tracks, of which we participated in 7. The shared task had a record number of participants, with 17 teams providing results for the closed track of the test set A. Our system reached the 2nd position in 4 tracks (A closed and open, B1 open and B2 open) and in this paper we are focusing on the methods and data used for those tracks. We describe our word-based back-off method in mathematical notation. We also describe how we selected the corpus we used in the open tracks.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

HeLI-based Experiments in Swiss German Dialect Identification

Author: Jauhiainen Heidi Annika
Jauhiainen Tommi Sakari
Linden Bo Krister Johan
Publication venue: The Association for Computational Linguistics
Publication date: 01/08/2018
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

HeLI-based Experiments in Discriminating Between Dutch and Flemish Subtitles

Author: Jauhiainen Heidi Annika
Jauhiainen Tommi Sakari
Linden Bo Krister Johan
Publication venue: The Association for Computational Linguistics
Publication date: 01/08/2018
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Iterative Language Model Adaptation for Indo-Aryan Language Identification

Author: Jauhiainen Heidi Annika
Jauhiainen Tommi Sakari
Linden Bo Krister Johan
Publication venue: The Association for Computational Linguistics
Publication date: 01/08/2018
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Improving OCR of historical newspapers and journals published in Finland

Author: Breuel Thomas M
Drobac Senka
Jauhiainen Tommi Sakari
Levenshtein Vladimir I.
Publication venue: 'American College of Medical Physics (ACMP)'
Publication date: 01/01/2019
Field of study

Peer reviewe

Crossref

Helsingin yliopiston digitaalinen arkisto