Search CORE

8 research outputs found

Few-Shot and Zero-Shot Learning for Historical Text Normalization

Author: Bollmann Marcel
Korchagina Natalia
Søgaard Anders
Publication venue
Publication date: 01/01/2019
Field of study

Historical text normalization often relies on small training datasets. Recent work has shown that multi-task learning can lead to significant improvements by exploiting synergies with related datasets, but there has been no systematic study of different multi-task learning architectures. This paper evaluates 63~multi-task learning configurations for sequence-to-sequence-based historical text normalization across ten datasets from eight languages, using autoencoding, grapheme-to-phoneme mapping, and lemmatization as auxiliary tasks. We observe consistent, significant improvements across languages when training data for the target task is limited, but minimal or no improvements when training data is abundant. We also show that zero-shot learning outperforms the simple, but relatively strong, identity baseline.Comment: Accepted at DeepLo-201

arXiv.org e-Print Archive

Crossref

Publikationer från Linköpings universitet

Copenhagen University Research Information System

ZORA

Digitala Vetenskapliga Arkivet - Academic Archive On-line

A Large-Scale Comparison of Historical Text Normalization Systems

Author: Bollmann Marcel
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2019
Field of study

There is no consensus on the state-of-the-art approach to historical text normalization. Many techniques have been proposed, including rule-based methods, distance metrics, character-based statistical machine translation, and neural encoder--decoder models, but studies have used different datasets, different evaluation methods, and have come to different conclusions. This paper presents the largest study of historical text normalization done so far. We critically survey the existing literature and report experiments on eight languages, comparing systems spanning all categories of proposed normalization techniques, analysing the effect of training data quantity, and using different evaluation methods. The datasets and scripts are made publicly available.Comment: Accepted at NAACL 201

arXiv.org e-Print Archive

Crossref

Publikationer från Linköpings universitet

Copenhagen University Research Information System

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Few-Shot and Zero-Shot Learning for Historical Text Normalization

Author: Bollmann Marcel
Korchagina Natalia
Søgaard Anders
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2019
Field of study

Crossref

Copenhagen University Research Information System

ZORA

Learning attention for historical text normalization by learning to pronounce

Author: Bingel Joachim
Bollmann Marcel
Søgaard Anders
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2017
Field of study

Crossref

Copenhagen University Research Information System

LL(O)D and NLP perspectives on semantic change for humanities research

Author: Apostol Elena-Simona
Armaselu Florentina
Cimiano Philipp
Khan Anas Fahad
Liebeskind Chaya
McGillivray Barbara
Truică Ciprian-Octavian
Utka Andrius
Valūnaitė Oleškevičienė Giedrė
van Erp Marieke
Publication venue
Publication date: 01/01/2022
Field of study

CC BY 4.0This paper presents an overview of the LL(O)D and NLP methods, tools and data for detecting and representing semantic change, with its main application in humanities research. The paper’s aim is to provide the starting point for the construction of a workflow and set of multilingual diachronic ontologies within the humanities use case of the COST Action Nexus Linguarum, European network for Web-centred linguistic data science, CA18209. The survey focuses on the essential aspects needed to understand the current trends and to build applications in this area of study

Mykolas Romeris University Institutional Repository