987 research outputs found

    Inquiries into the lexicon-syntax relations in Basque

    Get PDF
    Index:- Foreword. B. Oyharçabal.- Morphosyntactic disambiguation and shallow parsing in computational processing in Basque. I. Aduriz, A. Díaz de Ilarraza.- The transitivity of borrowed verbs in Basque: an outline. X. Alberdi.- Patrixa: a unification-based parser for Basque and its application to the automatic analysis of verbs. I. Aldezabal, M. J. Aranzabe, A. Atutxa, K.Gojenola, K, Sarasola.- Learning argument/adjunct distinction for Basque. I. Aldezabal, M. J. Aranzabe, K. Gojenola, K, Sarasola, A. Atutxa.- Analyzing verbal subcategorization aimed at its computation application. I. Aldezabal, P. Goenaga.- Automatic extraction of verb paterns from “hauta-lanerako euskal hiztegia”. J. M. Arriola, X. Artola, A. Soroa.- The case of an enlightening, provoking an admirable Basque derivational siffux with implications for the theory of argument structure. X. Artiagoitia.- Verb-deriving processes in Basque. J. C. Odriozola.- Lexical causatives and causative alternation in Basque. B. Oyharçabal.- Causation and semantic control; diagnosis of incorrect use in minorized languages. I. Zabala.- Subject index.- Contributions

    On Spanish Prepositional Prefixes and the Cartography of Prepositions

    Get PDF
    Despite its potential appeal, the possibility of analyzing prefixes as prepositions (and thus as syntactic objects) faces several problems related with selection, headedness and semantic isomorphism. In this article, I try to understand and solve these problems. I focus on prefixed nouns, and more specifically on the fact that some of them have their bases interpreted as grounds (precoma, 'something before a coma'), while others have them interpreted as figures (pre-cognition, 'cognition before something'). I will propose that in the structures where the base can be interpreted as figure or ground the prefix is a very low prepositional modifier of the noun and the two readings depend on the interpretation of a pronominal category introduced by the preposition. This configuration is forced by the absence of a functional category from the preposition's structure; when independent conditions force this functional category to be present, the figure reading is impossible and the prefix behaves as a preposition.Malgrat que és inicialment atractiva, l'anàlisi dels prefixos com a preposicions (i, doncs, com a objectes sintàctics), causa diversos problemes relacionats amb la selecció, la natura del nucli i l'isomorfisme semàntic. En aquest article abordem aquests problemes. Ens concentrem en els noms prefixats, i més exactament en el fet que n'hi ha que interpreten la base com a fons (pre- coma), mentre que d'altres la interpreten com a figura (precognició). Proposem que a les estruc- tures on la base pot interpretar-se com a fons o com a figura el prefix és un SP introduït com a modificador i que pren una categoria pronominal com a complement, mentre que en els casos on la lectura de figura és obligatòria el prefix és un Sp que té els efectes sintàctics propis de una pre- posició plena. La segona configuració és obligatòria quan hi ha certes condicions, que s'exploren en detall a l'article

    Overview of the SPMRL 2013 shared task: cross-framework evaluation of parsing morphologically rich languages

    Get PDF
    This paper reports on the first shared task on statistical parsing of morphologically rich languages (MRLs). The task features data sets from nine languages, each available both in constituency and dependency annotation. We report on the preparation of the data sets, on the proposed parsing scenarios, and on the evaluation metrics for parsing MRLs given different representation types. We present and analyze parsing results obtained by the task participants, and then provide an analysis and comparison of the parsers across languages and frameworks, reported for gold input as well as more realistic parsing scenarios

    Overview of the SPMRL 2013 Shared Task: A Cross-Framework Evaluation of Parsing Morphologically Rich Languages

    Get PDF
    International audienceThis paper reports on the first shared task on statistical parsing of morphologically rich lan- guages (MRLs). The task features data sets from nine languages, each available both in constituency and dependency annotation. We report on the preparation of the data sets, on the proposed parsing scenarios, and on the eval- uation metrics for parsing MRLs given dif- ferent representation types. We present and analyze parsing results obtained by the task participants, and then provide an analysis and comparison of the parsers across languages and frameworks, reported for gold input as well as more realistic parsing scenarios


    Get PDF
    Proceedings of the NODALIDA 2011 Workshop Constraint Grammar Applications. Editors: Eckhard Bick, Kristin Hagen, Kaili Müürisep, Trond Trosterud. NEALT Proceedings Series, Vol. 14 (2011), vi+69 pp. © 2011 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/19231

    Remarks on Modern Standard Arabic Construct State and Quantification

    Get PDF
    This thesis investigates the interpretations of genitive and quantificational forms that Modern Standard Arabic (MSA) unifies under a complex DP, namely Construct State (CS). Despite the linguistic differences between these phenomena, the PF form of this structure neutralizes all indicated types and their sub-types into a head-complement form (possessum-possessor or quantifier-domain restriction), where the whole structure’s definiteness is recovered from the complement that is distinguished for this value overtly. However, the internal syntactic and semantic components such as the source of relations and definiteness value of the whole structure that contribute to the CS its various interpretations are always concealed at PF. This neutralization makes it hard to view the differences between CS types as well as the causes of their various semantic ambiguities. This project analyzes Nominal and Quantificational CSs of MSA to uncover their hidden syntactic and semantic factors that distinguish their semantic contributions. To approach these two forms, this thesis consists of four main discussion chapters. Two of these chapters (2&3) are devoted to approach genitive nominals, and their syntactic and semantic aspects. Chapter (2) looks at (in)definiteness: marking, agreement (inheritance), and its interpretation on either component at LF. In this chapter, I argue that the Nominal-CS D head inherits its covert definiteness featural specifications from its complement whose definiteness is distinguished overtly. This inheritance takes place at the syntactic level via the operation of syntactic agreement (following Pesetsky and Torrego, 2007 framework) which feeds the semantic interpretations of this form, regardless of some exceptional cases for this inheritance. Chapter (3) investigates the semantic ambiguities of a nominal CS. One type of the ambiguities categorizes a CS as possessive vs. modificational CS based on the relation between the head and the complement. Fol-lowing Borer (2009), these interpretations are caused by the referentiality of the complement, which is associated with its syntactic category: a referential DP for the possessive type and non-referential NP for modificational type. Another ambiguity is caused by the relation between the nominals in the distinguished types contributed by Relator Phrase (RP) projection (cf. Den Dikken, 2006 and Ouhalla, 2011). The head of this projection denotes a free variable over contextual relations (possessiveness, agent, control, or other pragmatic relations) or its relation can be contributed lexically by the head noun when it is relational semantically. However, the lexical relation may or may not feed the RP projection depending on the context. Regarding the quantificational side of the investigation, it focuses on quantificational determiners and their domain restriction (DR) nouns that form the quantificational construct state (QCS), in addition to some notes about scope taking ambiguities. Chapter (4) approaches the quantifiers kul: “every/each or all” dʒami:ʔ “all” muʕðˤam “most” baʕdˤ “some” and their DR nouns in CS. All the former quantifiers are restricted by definite plural DPs without partitive preposition, except for the distributive interpretation of kul:. For the latter, it has to be restricted by an indefinite bare noun. Regarding these issues, this chapter argues that quantifiers of Arabic are not syntactic deter-miners since they are distinguished for (in)definiteness overtly in non-CS structure or covertly in QCS. The account that is drawn for the quantifiers with definite DR proposes that they are partitive quantifiers whose partitive relation is established by a null PartP (partitive phrase) (cf. Fehri, 2018). PartP allows them to quantify over parts of the individual sum denoted by their definite plural DR noun. On the other hand, the inherited definiteness on the quantifiers is semantically vacuous since the domain of quantification is restricted by the definiteness of DR noun. For the distributive interpretation of the universal kul: “every/each”, its DR is a bare NP whose number contributes the (non)atomic granularity for distributivity rather than categorizing it as indefinite since this language lacks the indefinite determiner. The following chapter shifts the discussion toward some notes on scope taking to examine the possibility of the covert inverse scope and inverse linking readings at LF in SVO and VSO word orders. For the inverse scope at clause-level, the findings of this chapter analysis suggest that the scope is fluid with respect to VSO order, while the SVO order shows some exceptions. The subject of SVO occurs in the left periphery as a topic or focus (Soltan, 2007; Albuhayri, 2019) where QR does not exceed (cf. May, 1977, 1985; Heim and Kratzer, 1998). Merely, a clitic left dislocated topic can freeze the scope by reserving wide scope interpretation, while a focused subject can show scope ambiguity due to its ability to reconstruct because it is a moved element to the left periphery. Regarding scope linking within DP, MSA allows this type of QR movement at LF, but, still, the left periphery boundary is respected

    Parsing and Evaluation. Improving Dependency Grammars Accuracy. Anàlisi Sintàctica Automàtica i Avaluació. Millora de qualitat per a Gramàtiques de Dependències

    Get PDF
    Because parsers are still limited in analysing specific ambiguous constructions, the research presented in this thesis mainly aims to contribute to the improvement of parsing performance when it has knowledge integrated in order to deal with ambiguous linguistic phenomena. More precisely, this thesis intends to provide empirical solutions to the disambiguation of prepositional phrase attachment and argument recognition in order to assist parsers in generating a more accurate syntactic analysis. The disambiguation of these two highly ambiguous linguistic phenomena by the integration of knowledge about the language necessarily relies on linguistic and statistical strategies for knowledge acquisition. The starting point of this research proposal is the development of a rule-based grammar for Spanish and for Catalan following the theoretical basis of Dependency Grammar (Tesnière, 1959; Mel’čuk, 1988) in order to carry out two experiments about the integration of automatically- acquired knowledge. In order to build two robust grammars that understand a sentence, the FreeLing pipeline (Padró et al., 2010) has been used as a framework. On the other hand, an eclectic repertoire of criteria about the nature of syntactic heads is proposed by reviewing the postulates of Generative Grammar (Chomsky, 1981; Bonet and Solà, 1986; Haegeman, 1991) and Dependency Grammar (Tesnière, 1959; Mel’čuk, 1988). Furthermore, a set of dependency relations is provided and mapped to Universal Dependencies (Mcdonald et al., 2013). Furthermore, an empirical evaluation method has been designed in order to carry out both a quantitative and a qualitative analysis. In particular, the dependency parsed trees generated by the grammars are compared to real linguistic data. The quantitative evaluation is based on the Spanish Tibidabo Treebank (Marimon et al., 2014), which is large enough to carry out a real analysis of the grammars performance and which has been annotated with the same formalism as the grammars, syntactic dependencies. Since the criteria between both resources are differ- ent, a process of harmonization has been applied developing a set of rules that automatically adapt the criteria of the corpus to the grammar criteria. With regard to qualitative evaluation, there are no available resources to evaluate Spanish and Catalan dependency grammars quali- tatively. For this reason, a test suite of syntactic phenomena about structure and word order has been built. In order to create a representative repertoire of the languages observed, descriptive grammars (Bosque and Demonte, 1999; Solà et al., 2002) and the SenSem Corpus (Vázquez and Fernández-Montraveta, 2015) have been used for capturing relevant structures and word order patterns, respectively. Thanks to these two tools, two experiments have been carried out in order to prove that knowl- edge integration improves the parsing accuracy. On the one hand, the automatic learning of lan- guage models has been explored by means of statistical methods in order to disambiguate PP- attachment. More precisely, a model has been learned with a supervised classifier using Weka (Witten and Frank, 2005). Furthermore, an unsupervised model based on word embeddings has been applied (Mikolov et al., 2013a,b). The results of the experiment show that the supervised method is limited in predicting solutions for unseen data, which is resolved by the unsupervised method since provides a solution for any case. However, the unsupervised method is limited if it Parsing and Evaluation Improving Dependency Grammars Accuracy only learns from lexical data. For this reason, training data needs to be enriched with the lexical value of the preposition, as well as semantic and syntactic features. In addition, the number of patterns used to learn language models has to be extended in order to have an impact on the grammars. On the other hand, another experiment is carried out in order to improve the argument recog- nition in the grammars by the acquisition of linguistic knowledge. In this experiment, knowledge is acquired automatically from the extraction of verb subcategorization frames from the SenSem Corpus (Vázquez and Fernández-Montraveta, 2015) which contains the verb predicate and its arguments annotated syntactically. As a result of the information extracted, subcategorization frames have been classified into subcategorization classes regarding the patterns observed in the corpus. The results of the subcategorization classes integration in the grammars prove that this information increases the accuracy of the argument recognition in the grammars. The results of the research of this thesis show that grammars’ rules on their own are not ex- pressive enough to resolve complex ambiguities. However, the integration of knowledge about these ambiguities in the grammars may be decisive in the disambiguation. On the one hand, sta- tistical knowledge about PP-attachment can improve the grammars accuracy, but syntactic and semantic information, and new patterns of PP-attachment need to be included in the language models in order to contribute to disambiguate this phenomenon. On the other hand, linguistic knowledge about verb subcategorization acquired from annotated linguistic resources show a positive influence positively on grammars’ accuracy.Aquesta tesi vol tractar les limitacions amb què es troben els analitzadors sintàctics automàtics actualment. Tot i els progressos que s’han fet en l’àrea del Processament del Llenguatge Nat- ural en els darrers anys, les tecnologies del llenguatge i, en particular, els analitzadors sintàc- tics automàtics no han pogut traspassar el llindar de certes ambiguïtats estructurals com ara l’agrupació del sintagma preposicional i el reconeixement d’arguments. És per aquest motiu que la recerca duta a terme en aquesta tesi té com a objectiu aportar millores signiflcatives de quali- tat a l’anàlisi sintàctica automàtica per mitjà de la integració de coneixement lingüístic i estadístic per desambiguar construccions sintàctiques ambigües. El punt de partida de la recerca ha estat el desenvolupament de d’una gramàtica en espanyol i una altra en català basades en regles que segueixen els postulats de la Gramàtica de Dependèn- dencies (Tesnière, 1959; Mel’čuk, 1988) per tal de dur a terme els experiments sobre l’adquisició de coneixement automàtic. Per tal de crear dues gramàtiques robustes que analitzin i entenguin l’oració en profunditat, ens hem basat en l’arquitectura de FreeLing (Padró et al., 2010), una lli- breria de Processament de Llenguatge Natural que proveeix una anàlisi lingüística automàtica de l’oració. Per una altra banda, s’ha elaborat una proposta eclèctica de criteris lingüístics per determinar la formació dels sintagmes i les clàusules a la gramàtica per mitjà de la revisió de les propostes teòriques de la Gramàtica Generativa (Chomsky, 1981; Bonet and Solà, 1986; Haege- man, 1991) i de la Gramàtica de Dependències (Tesnière, 1959; Mel’čuk, 1988). Aquesta proposta s’acompanya d’un llistat de les etiquetes de relació de dependència que fan servir les regles de les gramàtques. A més a més de l’elaboració d’aquest llistat, s’han establert les correspondències amb l’estàndard d’anotació de les Dependències Universals (Mcdonald et al., 2013). Alhora, s’ha dissenyat un sistema d’avaluació empíric que té en compte l’anàlisi quantitativa i qualitativa per tal de fer una valoració completa dels resultats dels experiments. Precisament, es tracta una tasca empírica pel fet que es comparen les anàlisis generades per les gramàtiques amb dades reals de la llengua. Per tal de dur a terme l’avaluació des d’una perspectiva quan- titativa, s’ha fet servir el corpus Tibidabo en espanyol (Marimon et al., 2014) disponible només en espanyol que és prou extens per construir una anàlisi real de les gramàtiques i que ha estat anotat amb el mateix formalisme que les gramàtiques. En concret, per tal com els criteris de les gramàtiques i del corpus no són coincidents, s’ha dut a terme un procés d’harmonització de cri- teris per mitjà d’unes regles creades manualment que adapten automàticament l’estructura i la relació de dependència del corpus al criteri de les gramàtiques. Pel que fa a l’avaluació qualitativa, pel fet que no hi ha recursos disponibles en espanyol i català, hem dissenyat un reprertori de test de fenòmens sintàctics estructurals i relacionats amb l’ordre de l’oració. Amb l’objectiu de crear un repertori representatiu de les llengües estudiades, s’han fet servir gramàtiques descriptives per fornir el repertori d’estructures sintàctiques (Bosque and Demonte, 1999; Solà et al., 2002) i el Corpus SenSem (Vázquez and Fernández-Montraveta, 2015) per capturar automàticament l’ordre oracional. Gràcies a aquestes dues eines, s’han pogut dur a terme dos experiments per provar que la integració de coneixement en l’anàlisi sintàctica automàtica en millora la qualitat. D’una banda, Parsing and Evaluation Improving Dependency Grammars Accuracy s’ha explorat l’aprenentatge de models de llenguatge per mitjà de models estadístics per tal de proposar solucions a l’agrupació del sintagma preposicional. Més concretament, s’ha desen- volupat un model de llenguatge per mitjà d’un classiflcador d’aprenentatge supervisat de Weka (Witten and Frank, 2005). A més a més, s’ha après un model de llenguatge per mitjà d’un mètode no supervisat basat en l’aproximació distribucional anomenat word embeddings (Mikolov et al., 2013a,b). Els resultats de l’experiment posen de manifest que el mètode supervisat té greus lim- itacions per fer donar una resposta en dades que no ha vist prèviament, cosa que és superada pel mètode no supervisat pel fet que és capaç de classiflcar qualsevol cas. De tota manera, el mètode no supervisat que s’ha estudiat és limitat si aprèn a partir de dades lèxiques. Per aquesta raó, és necessari que les dades utilitzades per entrenar el model continguin el valor de la preposi- ció, trets sintàctics i semàntics. A més a més, cal ampliar el número de patrons apresos per tal d’ampliar la cobertura dels models i tenir un impacte en els resultats de les gramàtiques. D’una altra banda, s’ha proposat una manera de millorar el reconeixement d’arguments a les gramàtiques per mitjà de l’adquisició de coneixement lingüístic. En aquest experiment, s’ha op- tat per extreure automàticament el coneixement en forma de classes de subcategorització verbal d’el Corpus SenSem (Vázquez and Fernández-Montraveta, 2015), que conté anotats sintàctica- ment el predicat verbal i els seus arguments. A partir de la informació extreta, s’ha classiflcat les diverses diàtesis verbals en classes de subcategorització verbal en funció dels patrons observats en el corpus. Els resultats de la integració de les classes de subcategorització a les gramàtiques mostren que aquesta informació determina positivament el reconeixement dels arguments. Els resultats de la recerca duta a terme en aquesta tesi doctoral posen de manifest que les regles de les gramàtiques no són prou expressives per elles mateixes per resoldre ambigüitats complexes del llenguatge. No obstant això, la integració de coneixement sobre aquestes am- bigüitats pot ser decisiu a l’hora de proposar una solució. D’una banda, el coneixement estadístic sobre l’agrupació del sintagma preposicional pot millorar la qualitat de les gramàtiques, però per aflrmar-ho cal incloure informació sintàctica i semàntica en els models d’aprenentatge automàtic i capturar més patrons per contribuir en la desambiguació de fenòmens complexos. D’una al- tra banda, el coneixement lingüístic sobre subcategorització verbal adquirit de recursos lingüís- tics anotats influeix decisivament en la qualitat de les gramàtiques per a l’anàlisi sintàctica au- tomàtica

    Description of the Chinese-to-Spanish rule-based machine translation system developed with a hybrid combination of human annotation and statistical techniques

    Get PDF
    Two of the most popular Machine Translation (MT) paradigms are rule based (RBMT) and corpus based, which include the statistical systems (SMT). When scarce parallel corpus is available, RBMT becomes particularly attractive. This is the case of the Chinese--Spanish language pair. This article presents the first RBMT system for Chinese to Spanish. We describe a hybrid method for constructing this system taking advantage of available resources such as parallel corpora that are used to extract dictionaries and lexical and structural transfer rules. The final system is freely available online and open source. Although performance lags behind standard SMT systems for an in-domain test set, the results show that the RBMT’s coverage is competitive and it outperforms the SMT system in an out-of-domain test set. This RBMT system is available to the general public, it can be further enhanced, and it opens up the possibility of creating future hybrid MT systems.Peer ReviewedPostprint (author's final draft

    Minimizing dependencies across languages and speakers. Evidence from basque, polish and spanish and native and non-native bilinguals.

    Get PDF
    223 p.Within the last years, evidence for a general preference towards grammars reducing the linear distance between elements in a dependency has been accumulating (e. g., Futrell, Mahowald, and Gibson, 2015b; Gildea and Temperley, 2010). This cognitive bias towards dependency length minimization has been argued to result from communicative and cognitive pressures at play during language production. Although corpus evidence supporting this claim is quite broad insofar as grammaticalized structures are concerned (e. g., Futrell et al., 2015b; Liu, 2008; Temperley, 2007, among others), its validity rests on more shaky foundations regarding production preferences (Stallings, MacDonald, and O¿Seaghdha, 1998; Wasow, 1997; Yamashita and Chang, 2001, among others). This dissertation intends to address this gap. It examines whether dependency length minimization is an active mechanism shaping language production preferences, and explores the specific nature of this principle and its interplay with linguistic specifications and architectural properties of the human memory system. In a series of 5 cued-recall production experiments and 2 complex memory span tasks, I investigate the effect of dependency length in modulating production preferences across languages with differing grammatical properties (e.g., head-position and case marking) and across speakers (e. g., natives and non-natives and with variable working memory capacity). I begin by showing that the preference for short dependencies is better accounted by a general cognitive preference for minimizing the distance across dependents than by conceptual availability. I then show how languages as diverse as Basque, Spanish and Polish tend to choose the communicatively more efficient structures, when there is more than one available alternative to express the same meaning. Crucially, I confirm that there is consistent variation regarding this tendency both across languages and across speakers. I argue that language-specific (e. g., pluripersonal agreement) and general cognitive mechanisms (e. g., word order based-expectations) interact with the preference towards dependency length minimization. Also, I show that the degree of communicative efficiency achieved by highly proficient and early non-native bilingual speakers is lower than that reached by their native peers. Finally, I find that the bias towards shifted orders that yield shorter dependencies correlates positively with working memory. Based on these findings, I conclude that there is strong evidence supporting the claim that dependency length minimization is a pervasive force in human language production, resulting from a general cognitive constraint towards efficient communication, and also that its strength varies depending on grammatical and individual specifications compatible with information-theoretic considerations