57 research outputs found
Proceedings
Proceedings of the NODALIDA 2011 Workshop
Constraint Grammar Applications.
Editors: Eckhard Bick, Kristin Hagen, Kaili Müürisep, Trond Trosterud.
NEALT Proceedings Series, Vol. 14 (2011), vi+69 pp.
© 2011 The editors and contributors.
Published by
Northern European Association for Language
Technology (NEALT)
http://omilia.uio.no/nealt .
Electronically published at
Tartu University Library (Estonia)
http://hdl.handle.net/10062/19231
English vs. Esperanto: A comparative study of clausal word order in a Minimalist framework
Both English and Esperanto are international auxiliary languages, but English is deemed as an SVO language with rigid word order, while Esperanto, although considered predominantly SVO, allows for relatively free constituent order according to some scholars. The goal of this thesis is to determine if this is the case and identify whether this difference in constituency leniency can be attributed to parametric differences between English and Esperanto. To answer this, the thesis seeks to uncover the underlying syntactic structure of Esperanto in transitive constructions and compare it to the syntactic structure of English.
This thesis studies the order of the subject, object, and verb in both main and embedded clause types to identify potential parametric differences and analyse the patterns through the Minimalist framework, and the Principles and Parameters model.
To identify which transitive word order patterns are common in English and Esperanto corpora studies were conducted for both languages to identify the word order patterns used and how often they occurred. The English data were retrieved from the Georgetown University Multilayer corpus, while Arbobanko were used form the Esperanto data. In addition to the corpus study, a survey was conducted for the Esperanto data to test the acceptability of each word order.
My data reflect less word order variety in Esperanto than a previous study conducted by Gledhill (2000). My data does, however, reflect a greater word order variety in Esperanto than English as stated by other scholars. These differences found in word order patterns between the two languages could, however, not be accounted for by significant parametric differences. Instead, a greater variation in non-obligatory constituent movements
Universal Discourse Representation Structure Parsing
We consider the task of crosslingual semantic parsing in the style of Discourse Representation Theory (DRT) where knowledge from annotated corpora in a resource-rich language is transferred via bitext to guide learning in other languages. We introduce Universal Discourse Representation Theory (UDRT), a variant of DRT that explicitly anchors semantic representations to tokens in the linguistic input. We develop a semantic parsing framework based on the Transformer architecture and utilize it to obtain semantic resources in multiple languages following two learning schemes. The many-to-one approach translates non-English text to English, and then runs a relatively accurate English parser on the translated text, while the one-to-many approach translates gold standard English to non-English text and trains multiple parsers (one per language) on the translations. Experimental results on the Parallel Meaning Bank show that our proposal outperforms strong baselines by a wide margin and can be used to construct (silver-standard) meaning banks for 99 languages
Language Processing and the Artificial Mind: Teaching Code Literacy in the Humanities
Humanities majors often find themselves in jobs where they either manage programmers or work with them in close collaboration. These interactions often pose difficulties because specialists in literature, history, philosophy, and so on are not usually code literate. They do not understand what tasks computers are best suited to, or how programmers solve problems. Learning code literacy would be a great benefit to humanities majors, but the traditional computer science curriculum is heavily math oriented, and students outside of science and technology majors are often math averse. Yet they are often interested in language, linguistics, and science fiction. This thesis is a case study to explore whether computational linguistics and artificial intelligence provide a suitable setting for teaching basic code literacy. I researched, designed, and taught a course called “Language Processing and the Artificial Mind.” Instead of math, it focuses on language processing, artificial intelligence, and the formidable challenges that programmers face when trying to create machines that understand natural language. This thesis is a detailed description of the material, how the material was chosen, and the outcome for student learning. Student performance on exams indicates that students learned code literacy basics and important linguistics issues in natural language processing. An exit survey indicates that students found the course to be valuable, though a minority reacted negatively to the material on programming. Future studies should explore teaching code literacy with less programming and new ways to make coding more interesting to the target audience
Cross-Lingual Link Discovery for Under-Resourced Languages
CC BY-NC 4.0In this paper, we provide an overview of current technologies for cross-lingual link discovery, and we discuss challenges,
experiences and prospects of their application to under-resourced languages. We first introduce the goals of cross-lingual
linking and associated technologies, and in particular, the role that the Linked Data paradigm (Bizer et al., 2011) applied
to language data can play in this context. We define under-resourced languages with a specific focus on languages actively
used on the internet, i.e., languages with a digitally versatile speaker community, but limited support in terms of language
technology. We argue that languages for which considerable amounts of textual data and (at least) a bilingual word list are
available, techniques for cross-lingual linking can be readily applied, and that these enable the implementation of downstream
applications for under-resourced languages via the localisation and adaptation of existing technologies and resources
Recommended from our members
Semantic chunking
Long sentences pose a challenge for natural language processing (NLP) applications. They are associated with a complex information structure leading to increased requirements for processing resources. Although the issue is present in many areas of research, there is little uniformity in the solutions used by research communities dedicated to individual NLP applications. Different aspects of the problem are addressed by different tasks, such as sentence simplification or shallow chunking.
The main contribution of this thesis is the introduction of the task of semantic chunking as a general approach to reducing the cost of processing long sentences. The goal of semantic chunking is to find semantically contained fragments of a sentence representation that can be processed independently and recombined without loss of information. We anchor its principles in established concepts of semantic theory, in particular event and situation semantics. Most of the experiments in this thesis focus on semantic chunking defined on complex semantic representations in Dependency Minimal Recursion Semantics (DMRS),
but we also demonstrate that the task can be performed on sentence strings. We present three chunking models: a) rule-based proof-of-concept DMRS chunking system; b) a semi-supervised sequence labelling neural model for surface semantic chunking; c) a system capable of finding semantic chunk boundaries based on the inherent structure of DMRS graphs, generalisable in the form of descriptive templates. We show how semantic chunking can be applied within a divide-and-conquer processing paradigm, using as an example the task of realization from DMRS. The application of semantic chunking yields noticeable efficiency gains without decreasing the quality of results
NLP for Language Varieties of Italy: Challenges and the Path Forward
Italy is characterized by a one-of-a-kind linguistic diversity landscape in
Europe, which implicitly encodes local knowledge, cultural traditions, artistic
expression, and history of its speakers. However, over 30 language varieties in
Italy are at risk of disappearing within few generations. Language technology
has a main role in preserving endangered languages, but it currently struggles
with such varieties as they are under-resourced and mostly lack standardized
orthography, being mainly used in spoken settings. In this paper, we introduce
the linguistic context of Italy and discuss challenges facing the development
of NLP technologies for Italy's language varieties. We provide potential
directions and advocate for a shift in the paradigm from machine-centric to
speaker-centric NLP. Finally, we propose building a local community towards
responsible, participatory development of speech and language technologies for
languages and dialects of Italy.Comment: 16 pages, 3 figures, 4 table
- …