54 research outputs found
Synthetic Treebanking for Cross-Lingual Dependency Parsing
accepted to appear in the special issue on Cross-Language Algorithms and ApplicationsPeer reviewe
Cross-Lingual Dependency Parsing for Closely Related Languages - Helsinki's Submission to VarDial 2017
This paper describes the submission from the University of Helsinki to the
shared task on cross-lingual dependency parsing at VarDial 2017. We present
work on annotation projection and treebank translation that gave good results
for all three target languages in the test set. In particular, Slovak seems to
work well with information coming from the Czech treebank, which is in line
with related work. The attachment scores for cross-lingual models even surpass
the fully supervised models trained on the target language treebank. Croatian
is the most difficult language in the test set and the improvements over the
baseline are rather modest. Norwegian works best with information coming from
Swedish whereas Danish contributes surprisingly little
Introduction to the special issue on cross-language algorithms and applications
With the increasingly global nature of our everyday interactions, the need for multilingual technologies to support efficient and efective information access and communication cannot be overemphasized. Computational modeling of language has been the focus of
Natural Language Processing, a subdiscipline of Artificial Intelligence. One of the current challenges for this discipline is to design methodologies and algorithms that are cross-language in order to create multilingual technologies rapidly. The goal of this JAIR special
issue on Cross-Language Algorithms and Applications (CLAA) is to present leading research in this area, with emphasis on developing unifying themes that could lead to the development of the science of multi- and cross-lingualism. In this introduction, we provide the reader with the motivation for this special issue and summarize the contributions of the papers that have been included. The selected papers cover a broad range of cross-lingual technologies including machine translation, domain and language adaptation for sentiment
analysis, cross-language lexical resources, dependency parsing, information retrieval and knowledge representation. We anticipate that this special issue will serve as an invaluable resource for researchers interested in topics of cross-lingual natural language processing.Postprint (published version
Irish treebanking and parsing: a preliminary evaluation
Language resources are essential for linguistic research and the development of NLP applications. Low- density languages, such as Irish, therefore lack significant research in this area. This paper describes the early stages in the development of new language resources for Irish – namely the first Irish dependency treebank and the first Irish statistical dependency parser. We present the methodology behind building our new treebank and the steps we take to leverage upon the few existing resources. We discuss language specific choices made when defining our dependency labelling scheme, and describe interesting Irish language characteristics such as prepositional attachment, copula and clefting. We manually develop a small treebank of 300 sentences based on an existing POS-tagged corpus and report an inter-annotator agreement of 0.7902. We train MaltParser to achieve preliminary parsing results for Irish and describe a bootstrapping approach for further stages of development
Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources
Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen
Improving Zero-shot Cross-lingual Transfer between Closely Related Languages by Injecting Character-level Noise
Cross-lingual transfer between a high-resource language and its dialects or
closely related language varieties should be facilitated by their similarity.
However, current approaches that operate in the embedding space do not take
surface similarity into account. This work presents a simple yet effective
strategy to imrove cross-lingual transfer between closely related varieties. We
propose to augment the data of the high-resource source language with
character-level noise to make the model more robust towards spelling
variations. Our strategy shows consistent improvements over several languages
and tasks: Zero-shot transfer of POS tagging and topic identification between
language varieties from the Finnic, West and North Germanic, and Western
Romance language branches. Our work provides evidence for the usefulness of
simple surface-level noise in improving transfer between language varieties.Comment: ACL 202
- …