733 research outputs found
Marrying Universal Dependencies and Universal Morphology
The Universal Dependencies (UD) and Universal Morphology (UniMorph) projects
each present schemata for annotating the morphosyntactic details of language.
Each project also provides corpora of annotated text in many languages - UD at
the token level and UniMorph at the type level. As each corpus is built by
different annotators, language-specific decisions hinder the goal of universal
schemata. With compatibility of tags, each project's annotations could be used
to validate the other's. Additionally, the availability of both type- and
token-level resources would be a boon to tasks such as parsing and homograph
disambiguation. To ease this interoperability, we present a deterministic
mapping from Universal Dependencies v2 features into the UniMorph schema. We
validate our approach by lookup in the UniMorph corpora and find a
macro-average of 64.13% recall. We also note incompatibilities due to paucity
of data on either side. Finally, we present a critical evaluation of the
foundations, strengths, and weaknesses of the two annotation projects.Comment: UDW1
Automated Implementation Process of Machine Translation System for Related Languages
The paper presents an attempt to automate all data creation processes of a rule-based shallow-transfer machine translation system. The presented methods were tested on four fully functional translation systems covering language pairs: Slovenian paired with Serbian, Czech, English and Estonian language. An extensive range of evaluation tests was performed to assess the applicability of the methods
Proceedings of the Seventh International Conference Formal Approaches to South Slavic and Balkan languages
Proceedings of the Seventh International Conference Formal Approaches to South Slavic and Balkan Languages publishes 17 papers that were presented at the conference organised in Dubrovnik, Croatia, 4-6 Octobre 2010
A Computational Lexicon and Representational Model for Arabic Multiword Expressions
The phenomenon of multiword expressions (MWEs) is increasingly recognised as a serious and challenging issue that has attracted the attention of researchers in various language-related disciplines. Research in these many areas has emphasised the primary role of MWEs in the process of analysing and understanding language, particularly in the computational treatment of natural languages. Ignoring MWE knowledge in any NLP system reduces the possibility of achieving high precision outputs. However, despite the enormous wealth of MWE research and language resources available for English and some other languages, research on Arabic MWEs (AMWEs) still faces multiple challenges, particularly in key computational tasks such as extraction, identification, evaluation, language resource building, and lexical representations.
This research aims to remedy this deficiency by extending knowledge of AMWEs and making noteworthy contributions to the existing literature in three related research areas on the way towards building a computational lexicon of AMWEs. First, this study develops a general understanding of AMWEs by establishing a detailed conceptual framework that includes a description of an adopted AMWE concept and its distinctive properties at multiple linguistic levels. Second, in the use of AMWE extraction and discovery tasks, the study employs a hybrid approach that combines knowledge-based and data-driven computational methods for discovering multiple types of AMWEs. Third, this thesis presents a representative system for AMWEs which consists of multilayer encoding of extensive linguistic descriptions.
This project also paves the way for further in-depth AMWE-aware studies in NLP and linguistics to gain new insights into this complicated phenomenon in standard Arabic. The implications of this research are related to the vital role of the AMWE lexicon, as a new lexical resource, in the improvement of various ANLP tasks and the potential opportunities this lexicon provides for linguists to analyse and explore AMWE phenomena
Evaluation of a Substitution Method for Idiom Transformation in Statistical Machine Translation
We evaluate a substitution based technique for improving Statistical Machine Translation performance on idiomatic multiword expressions. The method operates by performing substitution on the original idiom with its literal meaning before translation, with a second substitution step replacing literal meanings with idioms following translation. We detail our approach, outline our implementation and provide an evaluation of the method for the language pair English/Brazilian-Portuguese. Our results show improvements in translation accuracy on sentences containing either morphosyntactically constrained or unconstrained idioms. We discuss the consequences of our results and outline potential extensions to this process
- …