166 research outputs found
Babel Treebank of Public Messages in Croatian
AbstractThe paper presents the process of constructing a publicly available treebank of public messages written in Croatian. The messages were collected from various electronic sources – e-mail, blog, Facebook and SMS – and published on the Zagreb Museum of Contemporary Art LED facade within the Babel art project. The project aimed to use the facade as an open-space blog or social interface for enabling citizens to publicly express their views. Construction and current state of the treebank is presented along with future work plans. A comparison of Babel Treebank with Croatian Dependency Treebank and SETimes.HR treebank regarding differing domains and annotation schemes is briefly sketched. The treebank is used as a test platform for introducing a new standard for syntactic annotation of Croatian texts. An experiment with morphosyntactic tagging and dependency parsing of the treebank is conducted, providing first insight to computational processing of non-standard text in Croatian
Cross-lingual Dependency Parsing of Related Languages with Rich Morphosyntactic Tagsets
This paper addresses cross-lingual dependency parsing using rich morphosyntactic tagsets. In our case study, we experiment with three related Slavic languages:
Croatian, Serbian and Slovene. Four different dependency treebanks are used for
monolingual parsing, direct cross-lingual
parsing, and a recently introduced crosslingual parsing approach that utilizes statistical machine translation and annotation projection. We argue for the benefits
of using rich morphosyntactic tagsets in
cross-lingual parsing and empirically support the claim by showing large improvements over an impoverished common feature representation in form of a reduced
part-of-speech tagset. In the process, we
improve over the previous state-of-the-art
scores in dependency parsing for all three
languages.Published versio
hr500k – A Reference Training Corpus of Croatian.
In this paper we present hr500k, a Croatian reference training corpus of 500 thousand tokens, segmented at document, sentence and word level, and annotated for morphosyntax, lemmas, dependency syntax, named entities, and semantic roles. We present each annotation layer via basic label statistics and describe the final encoding of the resource in CoNLL and TEI formats. We also give a description of the rather turbulent history of the resource and give insights into the topic and genre distribution in the corpus. Finally, we discuss further enrichments of the corpus with additional layers, which are already underway
Overview of the SPMRL 2013 shared task: cross-framework evaluation of parsing morphologically rich languages
This paper reports on the first shared task on statistical parsing of morphologically rich languages (MRLs). The task features data sets from nine languages, each available both in constituency and dependency annotation. We report on the preparation of the data sets, on the proposed parsing scenarios, and on the evaluation metrics for parsing MRLs given different representation types. We present and analyze parsing results obtained by the task participants, and then provide an analysis and comparison of the parsers across languages and frameworks, reported for gold input as well as more realistic parsing scenarios
Overview of the SPMRL 2013 Shared Task: A Cross-Framework Evaluation of Parsing Morphologically Rich Languages
International audienceThis paper reports on the first shared task on statistical parsing of morphologically rich lan- guages (MRLs). The task features data sets from nine languages, each available both in constituency and dependency annotation. We report on the preparation of the data sets, on the proposed parsing scenarios, and on the eval- uation metrics for parsing MRLs given dif- ferent representation types. We present and analyze parsing results obtained by the task participants, and then provide an analysis and comparison of the parsers across languages and frameworks, reported for gold input as well as more realistic parsing scenarios
ANNOTATED DISJUNCT FOR MACHINE TRANSLATION
Most information found in the Internet is available in English version. However,
most people in the world are non-English speaker. Hence, it will be of great advantage
to have reliable Machine Translation tool for those people. There are many
approaches for developing Machine Translation (MT) systems, some of them are
direct, rule-based/transfer, interlingua, and statistical approaches. This thesis focuses
on developing an MT for less resourced languages i.e. languages that do not have
available grammar formalism, parser, and corpus, such as some languages in South
East Asia. The nonexistence of bilingual corpora motivates us to use direct or transfer
approaches. Moreover, the unavailability of grammar formalism and parser in the
target languages motivates us to develop a hybrid between direct and transfer
approaches. This hybrid approach is referred as a hybrid transfer approach. This
approach uses the Annotated Disjunct (ADJ) method. This method, based on Link
Grammar (LG) formalism, can theoretically handle one-to-one, many-to-one, and
many-to-many word(s) translations. This method consists of transfer rules module
which maps source words in a source sentence (SS) into target words in correct
position in a target sentence (TS). The developed transfer rules are demonstrated on
English → Indonesian translation tasks. An experimental evaluation is conducted to
measure the performance of the developed system over available English-Indonesian
MT systems. The developed ADJ-based MT system translated simple, compound, and
complex English sentences in present, present continuous, present perfect, past, past
perfect, and future tenses with better precision than other systems, with the accuracy
of 71.17% in Subjective Sentence Error Rate metric
Formal Linguistic Models and Knowledge Processing. A Structuralist Approach to Rule-Based Ontology Learning and Population
2013 - 2014The main aim of this research is to propose a structuralist approach for knowledge processing by means of ontology learning and population, achieved starting from unstructured and structured texts. The method suggested includes distributional semantic approaches and NL formalization theories, in order to develop a framework, which relies upon deep linguistic analysis... [edited by author]XIII n.s
Extensible Dependency Grammar: a modular grammar formalism based on multigraph description
This thesis develops Extensible Dependency Grammar (XDG), a new grammar formalism combining dependency grammar, model-theoretic syntax, and Jackendoff\u27;s parallel grammar architecture. The design of XDG is strongly geared towards modularity: grammars can be modularly extended by any linguistic aspect such as grammatical functions, word order, predicate-argument structure, scope, information structure and prosody, where each aspect is modeled largely independently on a separate dimension. The intersective demands of the dimensions make many complex linguistic phenomena such as extraction in syntax, scope ambiguities in the semantics, and control and raising in the syntax-semantics interface simply fall out as by-products without further stipulation. This thesis makes three main contributions: 1. The first formalization of XDG as a multigraph description language in higher order logic, and investigations of its expressivity and computational complexity. 2. The first implementation of XDG, the XDG Development Kit (XDK), an extensive grammar development environment built around a constraint parser for XDG. 3. The first application of XDG to natural language, modularly modeling a fragment of English
- …