329 research outputs found
Towards a machine-learning architecture for lexical functional grammar parsing
Data-driven grammar induction aims at producing wide-coverage grammars of human languages. Initial efforts in this field produced relatively shallow linguistic representations such as phrase-structure trees, which only encode constituent structure. Recent work on inducing deep grammars from treebanks addresses this shortcoming by also
recovering non-local dependencies and grammatical relations. My aim is to investigate the issues arising when adapting an existing Lexical Functional Grammar (LFG) induction method to a new language and treebank, and find solutions which will generalize robustly across multiple languages.
The research hypothesis is that by exploiting machine-learning algorithms to learn morphological features, lemmatization classes and grammatical functions from treebanks we can reduce the amount of manual specification and improve robustness, accuracy and domain- and language -independence for LFG parsing systems. Function labels can often be relatively straightforwardly mapped to LFG grammatical functions. Learning them reliably permits grammar induction to depend less on language-specific LFG annotation rules. I therefore propose ways to improve acquisition of function labels from treebanks and translate those improvements into better-quality f-structure parsing.
In a lexicalized grammatical formalism such as LFG a large amount of syntactically relevant information comes from lexical entries. It is, therefore, important to be able
to perform morphological analysis in an accurate and robust way for morphologically rich languages. I propose a fully data-driven supervised method to simultaneously
lemmatize and morphologically analyze text and obtain competitive or improved results on a range of typologically diverse languages
Deep machine learning for syntactic annotation projection
U ovom radu istražuje se prijenosno učenje kroz više jezika s ciljem omogućavanja sintaktičke analize jezika koji nemaju dovoljno označenih podataka za učenje. Najbolji pristupi rješavanju problema uključuju projekciju oznaka sintaktičkih ovisnosti preko paralelnih tekstova, iz jezika koji imaju mnogo označenih podataka za učenje u jezike koji imaju nedovoljno. U prvom poglavlju opisuju se osnovni pojmovi morfološkog označivanja rečenica i parsanja njihovih ovisnosnih stabala kao i način označivanja sintaktičkih ovisnosti. Prvi pristup rješavanja problema projekcije oznaka sintaktičkih ovisnosti je opisan u drugom poglavlju. Zasnovan je na algoritmu predstavljenom u znanstvenom radu Multilingual Projection for Parsing Truly Low-Resource Languages [10]. Predložene su prilagodbe algoritma koje vode poboljšanju rezultata. U trećem poglavlju predstavljena je ideja o upotrebi neuronskih mreža za projekcije oznaka sintaktičkih ovisnosti te nekoliko ideja kojima rad u budućnosti može biti unaprijeđen.The purpose of this thesis was to explore cross-lingual transfer learning to dependency parsing, with a goal of enabling syntactic analysis for low-resource languages. The best approaches involve annotation projection: the transfer of dependency structures via parallel texts, from resource-rich to low-resource languages. In the first chapter, basic concepts of part of speech tagging and dependency parsing are described as well as the way of annotating texts. The first approach to solving an annotation projection problem is described in the second chapter. It is based on the algorithm presented in the paper Multilingual Projection for Parsing Truly Low-Resource Languages [10]. We propose the way of adjusting the existing algorithm which leads to the improvement of results. In the third chapter, the idea how to use neural networks for annotation projection is presented, and also some of the ideas how the work done in this thesis can be extended in the future
"Algún" indefinite is not bound by adverbs of quantification
Some indefinites cannot be bound by adverbs of quantification or the generic operator. I argue that this datum follows from the internal syntax of indefinites: only those indefinites consisting of a minimal structure can be bound, bigger indefinites cannot. I present evidence from Spanish, Russian and English to support this claim. Two theoretical consequences follow. The first one is about wh-dependencies: I argue that wh-phrases cannot be regarded as noun phrases with an extra [wh] feature, but rather as very small indefinites without additional features. The second one involves exceptional scope: choice function approaches seem to run into a paradox that alternative approaches, such as Schwarszchild's Singleton Indefinite approach, avoid. I also argue that an alternative semantic approach to binding resistance yields no fruit. Finally, I show that only small indefinites can be used as predicates, thus bolstering the approach taken in these pages
West Flemish verb-based discourse markers and the articulation of the Speech Act layer
This paper focuses on the West Flemish discourse markers located at the edge of the clause. After a brief survey of the distribution of discourse markers in WF, the paper proposes a syntactic analysis of the discourse markers ne and we. Based on the distribution of these discourse markers, of vocatives and of dislocated DPs, an articulated speech act layer is elaborated which corroborates the proposals in Hill (). It is postulated that there is a syntactic relation between particles used as discourse markers and vocatives. The paper offers further support for the grammaticalization of pragmatic features at the interface between syntax and discourse and for the hypothesis that the relevant computation at the interface is of the same nature as that in Narrow Syntax
Recommended from our members
Tree Adjoining Grammar at the Interfaces
This thesis constitutes an exploration of the applications of tree adjoining grammar (TAG) to natural language syntax. Perhaps more than any of its major competitors such as HPSG and LFG, however, TAG has never strayed too far from the guiding principles of generative syntax. Indeed, following the pioneering work of Frank (2004), TAG has been successfully incorporated into Chomsky’s (1995) Minimalist Program (MP). In large part, however, Frank (2004) leaves unexplored the issue of how TAG applies at the PF and LF interfaces. Given the fundamental importance of interfaces within the MP, no minimalist syntactic theory is complete without at least some notion of the means by which syntactic structure relates to pronunciation and interpretation. In this thesis we attempt to provide insight on this very issue: we address how TAG interfaces with the articulatory and interpretive components of the language faculty, and what insights it provides to minimalist conceptions of these interfaces. Ultimately, our aim is both to reaffirm the viability of TAG as a minimalist syntactic theory as well as to demonstrate that TAG makes clear otherwise arcane facts in natural language syntax. The central proposal of this thesis is twofold. First, TAG may be naturally extended to interface with the articulatory and interpretive components of the language faculty by making recourse to synchronous TAG (STAG). Second, once such a framework has been adopted, minimalist ideas regarding the interaction between syntax and linear order can be applied to deal with certain problematic examples in the TAG framework. TAG thus offers confirmation that in at least some cases, certain aspects of linear order are dependent on post-syntactic operations, so that syntax does not always wholly determine linear order. As a corollary of our proposal, we also demonstrate, through a case study in Niuean raising, that the TAG system makes clear predictions on phenomena that are difficult to describe in mainstream minimalist theories. Our argumentation for these proposals proceeds in three major stages. First, we formalize the synchronous TAG system that has to date been applied in a mostly piecemeal way by various researchers (see Shieber & Nesson 2006, Frank & Storoshenko 2012 for some examples). As a part of this formalization, we argue that the derivation of the LF object, but not the PF object, should make recourse to a more expressive version of the TAG system: multicomponent TAG, a variant that relaxes some constraints on the primitive units in the TAG system to yield greater expressive power. Second, we argue that the STAG system lends credence to the view that at least some word order is determined post-syntactically. In the past, researchers have presented ad hoc extensions of the expressive power of TAG to handle various difficult examples such as subject-to-subject raising in English questions and Irish and Welsh main clauses. We demonstrate that these extensions are both theoretically suspect and ultimately unnecessary given minimalist notions of the derivation: for many of the data motivating these extensions, there is independent evidence that their derivation in fact relies on post-syntactic rearrangements of certain verbal heads. Such examples are therefore well within the generative capacity of a framework with a TAG-based syntactic component that allows certain specific and well motivated post-syntactic rearrangements. Third, we demonstrate that not only is our particular system well motivated within the theoretical bounds of the MP, but also that it makes surprising and accurate empirical predications in cases that have otherwise defied analysis. Specifically, the Austronesian language Niuean features a peculiar instance of raising that has defied a satisfactory analysis since its discovery by Seiter (1980, 1983). We show that TAG makes the clear prediction that there is no raising in Niuean, then argue that this prediction is borne out under a careful examination of the facts. Given that the framework was developed almost exclusively based on the Indo-European language family, its ability to capture confounding behavior in a typologically dissimilar Austronesian language is a strong confirmation of its status as a reasonable alternative to mainstream minimalist syntactic theories
MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African languages
In this paper, we present AfricaPOS, the largest part-of-speech (POS) dataset for 20 typologically diverse African languages. We discuss the challenges in annotating POS for these languages using the universal dependencies (UD) guidelines. We conducted extensive POS baseline experiments using both conditional random field and several multilingual pre-trained language models. We applied various cross-lingual transfer models trained with data available in the UD. Evaluating on the AfricaPOS dataset, we show that choosing the best transfer language(s) in both single-source and multi-source setups greatly improves the POS tagging performance of the target languages, in particular when combined with parameter-fine-tuning methods. Crucially, transferring knowledge from a language that matches the language family and morphosyntactic properties seems to be more effective for POS tagging in unseen languages
Focus and Focus Structures in the Romance Languages
The archived version is a draft of a chapter/article that has been accepted for publication by Oxford University Press in the Oxford Research Encyclopedia of Linguistics.Peer reviewe
Deep machine learning for syntactic annotation projection
U ovom radu istražuje se prijenosno učenje kroz više jezika s ciljem omogućavanja sintaktičke analize jezika koji nemaju dovoljno označenih podataka za učenje. Najbolji pristupi rješavanju problema uključuju projekciju oznaka sintaktičkih ovisnosti preko paralelnih tekstova, iz jezika koji imaju mnogo označenih podataka za učenje u jezike koji imaju nedovoljno. U prvom poglavlju opisuju se osnovni pojmovi morfološkog označivanja rečenica i parsanja njihovih ovisnosnih stabala kao i način označivanja sintaktičkih ovisnosti. Prvi pristup rješavanja problema projekcije oznaka sintaktičkih ovisnosti je opisan u drugom poglavlju. Zasnovan je na algoritmu predstavljenom u znanstvenom radu Multilingual Projection for Parsing Truly Low-Resource Languages [10]. Predložene su prilagodbe algoritma koje vode poboljšanju rezultata. U trećem poglavlju predstavljena je ideja o upotrebi neuronskih mreža za projekcije oznaka sintaktičkih ovisnosti te nekoliko ideja kojima rad u budućnosti može biti unaprijeđen.The purpose of this thesis was to explore cross-lingual transfer learning to dependency parsing, with a goal of enabling syntactic analysis for low-resource languages. The best approaches involve annotation projection: the transfer of dependency structures via parallel texts, from resource-rich to low-resource languages. In the first chapter, basic concepts of part of speech tagging and dependency parsing are described as well as the way of annotating texts. The first approach to solving an annotation projection problem is described in the second chapter. It is based on the algorithm presented in the paper Multilingual Projection for Parsing Truly Low-Resource Languages [10]. We propose the way of adjusting the existing algorithm which leads to the improvement of results. In the third chapter, the idea how to use neural networks for annotation projection is presented, and also some of the ideas how the work done in this thesis can be extended in the future
Deep machine learning for syntactic annotation projection
U ovom radu istražuje se prijenosno učenje kroz više jezika s ciljem omogućavanja sintaktičke analize jezika koji nemaju dovoljno označenih podataka za učenje. Najbolji pristupi rješavanju problema uključuju projekciju oznaka sintaktičkih ovisnosti preko paralelnih tekstova, iz jezika koji imaju mnogo označenih podataka za učenje u jezike koji imaju nedovoljno. U prvom poglavlju opisuju se osnovni pojmovi morfološkog označivanja rečenica i parsanja njihovih ovisnosnih stabala kao i način označivanja sintaktičkih ovisnosti. Prvi pristup rješavanja problema projekcije oznaka sintaktičkih ovisnosti je opisan u drugom poglavlju. Zasnovan je na algoritmu predstavljenom u znanstvenom radu Multilingual Projection for Parsing Truly Low-Resource Languages [10]. Predložene su prilagodbe algoritma koje vode poboljšanju rezultata. U trećem poglavlju predstavljena je ideja o upotrebi neuronskih mreža za projekcije oznaka sintaktičkih ovisnosti te nekoliko ideja kojima rad u budućnosti može biti unaprijeđen.The purpose of this thesis was to explore cross-lingual transfer learning to dependency parsing, with a goal of enabling syntactic analysis for low-resource languages. The best approaches involve annotation projection: the transfer of dependency structures via parallel texts, from resource-rich to low-resource languages. In the first chapter, basic concepts of part of speech tagging and dependency parsing are described as well as the way of annotating texts. The first approach to solving an annotation projection problem is described in the second chapter. It is based on the algorithm presented in the paper Multilingual Projection for Parsing Truly Low-Resource Languages [10]. We propose the way of adjusting the existing algorithm which leads to the improvement of results. In the third chapter, the idea how to use neural networks for annotation projection is presented, and also some of the ideas how the work done in this thesis can be extended in the future
Copy theory in wh-in-situ languages: Sluicing in Hindi-Urdu
Hindi-Urdu is known to be one of the wh-in-situ languages exhibiting a sluicing-like construction. Although many have proposed alternative accounts of such strings in wh-in-situ languages (e.g. Kizu 1997, Toosarvandani 2009, Gribanova 2011, Hankamer 2010), I argue that apparent sluicing in Hindi-Urdu can be analyzed in a manner consistent with the notion that the syntax of a sluice is the syntax of a regular wh-question (Ross 1969, Merchant 2001). Assuming the copy theory of movement (Chomsky & Lasnik 1993, Chomsky 1993, i.a.), we can understand sluicing in Hindi-Urdu as an exceptional instance of the pronunciation of the top copy in a wh-chain, correctly predicting that Hindi-Urdusluiced structures have properties similar to genuine sluices in languages like English. This article pursues a continued refinement in the implementation of copy theory in wh-in-situ languages and importantly, contributes to the current line of work investigating intra-linguistic variation among wh-in-situ languages and the ways in which constellations of properties of wh-dependencies and ellipsis processes in these languages are best understood
- …