9 research outputs found
Crossings as a side effect of dependency lengths
The syntactic structure of sentences exhibits a striking regularity:
dependencies tend to not cross when drawn above the sentence. We investigate
two competing explanations. The traditional hypothesis is that this trend
arises from an independent principle of syntax that reduces crossings
practically to zero. An alternative to this view is the hypothesis that
crossings are a side effect of dependency lengths, i.e. sentences with shorter
dependency lengths should tend to have fewer crossings. We are able to reject
the traditional view in the majority of languages considered. The alternative
hypothesis can lead to a more parsimonious theory of language.Comment: the discussion section has been expanded significantly; in press in
Complexity (Wiley
A Dependency Treebank for Telugu
In this paper, we describe the annotation and development of Telugu treebank following the Universal Dependencies framework. We manually annotated 1328 sentences from a Telugu grammar textbook and the treebank is freely available from Universal Dependencies version 2.1.1 In this paper, we discuss some language specific annotation issues and decisions; and report preliminary experiments with POS tagging and dependency parsing. To the best of our knowledge, this is the first freely accessible and open dependency treebank for Telugu
HamleDT 2.0: Thirty Dependency Treebanks Stanfordized
We present HamleDT 2.0 (HArmonized Multi-LanguagE Dependency Treebank). HamleDT 2.0 is a collection of 30 existing treebanks harmonized into a common annotation style, the Prague Dependencies, and further transformed into Stanford Dependencies, a treebank annotation style that became popular recently.
We use the newest basic Universal Stanford Dependencies, without added language-specific subtypes. We describe both of the annotation styles, including adjustments that were necessary to make, and provide details about the conversion process. We also discuss the differences between the two styles, evaluating their advantages and disadvantages, and note the effects of the differences on the conversion.
We regard the stanfordization as generally successful, although we admit several shortcomings, especially in the distinction between direct and indirect objects, that have to be addressed in future.
We release part of HamleDT 2.0 freely; we are not allowed to redistribute the whole dataset, but we do provide the conversion pipeline
CLARA: A New Generation of Researchers in Common Language Resources and Their Applications
CLARA (Common Language Resources and Their Applications) is a Marie Curie Initial Training Network which ran from 2009 until 2014 with the aim of providing researcher training in crucial areas related to language resources and infrastructure. The scope of the project was broad and included infrastructure design, lexical semantic modeling, domain modeling, multimedia and multi-modal communication, applications, and parsing technologies and grammar models. An international consortium of 9 partners and 12 associate partners employed researchers in 19 new positions and organized a training program consisting of 10 thematic courses and summer/winter schools. The project has resulted in new theoretical insights as well as new resources and tools. Most importantly, the project has trained a new generation of researchers who can perform advanced research and development in language resources and technologies.Peer reviewe
Massive Choice, Ample Tasks (MaChAmp): A Toolkit for Multi-task Learning in NLP
Transfer learning, particularly approaches that combine multi-task learning
with pre-trained contextualized embeddings and fine-tuning, have advanced the
field of Natural Language Processing tremendously in recent years. In this
paper we present MaChAmp, a toolkit for easy fine-tuning of contextualized
embeddings in multi-task settings. The benefits of MaChAmp are its flexible
configuration options, and the support of a variety of natural language
processing tasks in a uniform toolkit, from text classification and sequence
labeling to dependency parsing, masked language modeling, and text generation.Comment: https://machamp-nlp.github.io
Prague Dependency Style Treebank for Tamil
Annotated corpora such as treebanks are important for the development of parsers, language applications as well as understanding of the
language itself. Only very few languages possess these scarce resources. In this paper, we describe our efforts in syntactically annotating
a small corpora (600 sentences) of Tamil language. Our annotation is similar to Prague Dependency Treebank (PDT) and consists of
annotation at 2 levels or layers: (i) morphological layer (m-layer) and (ii) analytical layer (a-layer). For both the layers, we introduce
annotation schemes i.e. positional tagging for m-layer and dependency relations for a-layers. Finally, we discuss some of the issues in
treebank development for Tamil