2,517 research outputs found
TamilTB: An Effort Towards Building a Dependency Treebank for Tamil
Annotated corpora such as treebanks are important for the development
of parsers, language applications as well as understanding of the language itself.
Only very few languages possess these scarce resources. In this paper, we describe
our effort in syntactically annotating a small corpora (600 sentences) of Tamil
language. Our annotation is similar to Prague Dependency Treebank (PDT 2.0)
and consists of 2 levels or layers: (i) morphological layer (m-layer) and (ii) analytical
layer (a-layer). For both the layers, we introduce annotation schemes i.e. positional
tagging for m-layer and dependency relations (and how dependency structures
should be drawn) for a-layers. Finally, we evaluate our corpora in the tagging and
parsing task using well known taggers and parsers and discuss some general issues
in annotation for Tamil language
Crossings as a side effect of dependency lengths
The syntactic structure of sentences exhibits a striking regularity:
dependencies tend to not cross when drawn above the sentence. We investigate
two competing explanations. The traditional hypothesis is that this trend
arises from an independent principle of syntax that reduces crossings
practically to zero. An alternative to this view is the hypothesis that
crossings are a side effect of dependency lengths, i.e. sentences with shorter
dependency lengths should tend to have fewer crossings. We are able to reject
the traditional view in the majority of languages considered. The alternative
hypothesis can lead to a more parsimonious theory of language.Comment: the discussion section has been expanded significantly; in press in
Complexity (Wiley
Cross-lingual RST Discourse Parsing
Discourse parsing is an integral part of understanding information flow and
argumentative structure in documents. Most previous research has focused on
inducing and evaluating models from the English RST Discourse Treebank.
However, discourse treebanks for other languages exist, including Spanish,
German, Basque, Dutch and Brazilian Portuguese. The treebanks share the same
underlying linguistic theory, but differ slightly in the way documents are
annotated. In this paper, we present (a) a new discourse parser which is
simpler, yet competitive (significantly better on 2/3 metrics) to state of the
art for English, (b) a harmonization of discourse treebanks across languages,
enabling us to present (c) what to the best of our knowledge are the first
experiments on cross-lingual discourse parsing.Comment: To be published in EACL 2017, 13 page
- …