Search CORE

923 research outputs found

Few-Shot and Zero-Shot Learning for Historical Text Normalization

Author: Bollmann Marcel
Korchagina Natalia
Søgaard Anders
Publication venue
Publication date: 01/01/2019
Field of study

Historical text normalization often relies on small training datasets. Recent work has shown that multi-task learning can lead to significant improvements by exploiting synergies with related datasets, but there has been no systematic study of different multi-task learning architectures. This paper evaluates 63~multi-task learning configurations for sequence-to-sequence-based historical text normalization across ten datasets from eight languages, using autoencoding, grapheme-to-phoneme mapping, and lemmatization as auxiliary tasks. We observe consistent, significant improvements across languages when training data for the target task is limited, but minimal or no improvements when training data is abundant. We also show that zero-shot learning outperforms the simple, but relatively strong, identity baseline.Comment: Accepted at DeepLo-201

arXiv.org e-Print Archive

Crossref

Publikationer från Linköpings universitet

Copenhagen University Research Information System

ZORA

Digitala Vetenskapliga Arkivet - Academic Archive On-line

A Large-Scale Comparison of Historical Text Normalization Systems

Author: Bollmann Marcel
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2019
Field of study

There is no consensus on the state-of-the-art approach to historical text normalization. Many techniques have been proposed, including rule-based methods, distance metrics, character-based statistical machine translation, and neural encoder--decoder models, but studies have used different datasets, different evaluation methods, and have come to different conclusions. This paper presents the largest study of historical text normalization done so far. We critically survey the existing literature and report experiments on eight languages, comparing systems spanning all categories of proposed normalization techniques, analysing the effect of training data quantity, and using different evaluation methods. The datasets and scripts are made publicly available.Comment: Accepted at NAACL 201

arXiv.org e-Print Archive

Crossref

Publikationer från Linköpings universitet

Copenhagen University Research Information System

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Towards a machine-learning architecture for lexical functional grammar parsing

Author: Chrupała Grzegorz
Publication venue: Dublin City University. School of Computing
Publication date: 01/11/2008
Field of study

Data-driven grammar induction aims at producing wide-coverage grammars of human languages. Initial efforts in this field produced relatively shallow linguistic representations such as phrase-structure trees, which only encode constituent structure. Recent work on inducing deep grammars from treebanks addresses this shortcoming by also recovering non-local dependencies and grammatical relations. My aim is to investigate the issues arising when adapting an existing Lexical Functional Grammar (LFG) induction method to a new language and treebank, and find solutions which will generalize robustly across multiple languages. The research hypothesis is that by exploiting machine-learning algorithms to learn morphological features, lemmatization classes and grammatical functions from treebanks we can reduce the amount of manual specification and improve robustness, accuracy and domain- and language -independence for LFG parsing systems. Function labels can often be relatively straightforwardly mapped to LFG grammatical functions. Learning them reliably permits grammar induction to depend less on language-specific LFG annotation rules. I therefore propose ways to improve acquisition of function labels from treebanks and translate those improvements into better-quality f-structure parsing. In a lexicalized grammatical formalism such as LFG a large amount of syntactically relevant information comes from lexical entries. It is, therefore, important to be able to perform morphological analysis in an accurate and robust way for morphologically rich languages. I propose a fully data-driven supervised method to simultaneously lemmatize and morphologically analyze text and obtain competitive or improved results on a range of typologically diverse languages

Irish Universities

DCU Online Research Access Service

Few-Shot and Zero-Shot Learning for Historical Text Normalization

Author: Bollmann Marcel
Korchagina Natalia
Søgaard Anders
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2019
Field of study

Crossref

Copenhagen University Research Information System

ZORA

Collating Medieval Vernacular Texts. Aligning Witnesses, Classifying Variants

Author: Camps Jean-Baptiste
Ing Lucence
Spadini Elena
Publication venue: HAL CCSD
Publication date: 09/07/2019
Field of study

International audienc