132 research outputs found

    Towards a machine-learning architecture for lexical functional grammar parsing

    Get PDF
    Data-driven grammar induction aims at producing wide-coverage grammars of human languages. Initial efforts in this field produced relatively shallow linguistic representations such as phrase-structure trees, which only encode constituent structure. Recent work on inducing deep grammars from treebanks addresses this shortcoming by also recovering non-local dependencies and grammatical relations. My aim is to investigate the issues arising when adapting an existing Lexical Functional Grammar (LFG) induction method to a new language and treebank, and find solutions which will generalize robustly across multiple languages. The research hypothesis is that by exploiting machine-learning algorithms to learn morphological features, lemmatization classes and grammatical functions from treebanks we can reduce the amount of manual specification and improve robustness, accuracy and domain- and language -independence for LFG parsing systems. Function labels can often be relatively straightforwardly mapped to LFG grammatical functions. Learning them reliably permits grammar induction to depend less on language-specific LFG annotation rules. I therefore propose ways to improve acquisition of function labels from treebanks and translate those improvements into better-quality f-structure parsing. In a lexicalized grammatical formalism such as LFG a large amount of syntactically relevant information comes from lexical entries. It is, therefore, important to be able to perform morphological analysis in an accurate and robust way for morphologically rich languages. I propose a fully data-driven supervised method to simultaneously lemmatize and morphologically analyze text and obtain competitive or improved results on a range of typologically diverse languages

    First International Workshop on Lexical Resources

    Get PDF
    International audienceLexical resources are one of the main sources of linguistic information for research and applications in Natural Language Processing and related fields. In recent years advances have been achieved in both symbolic aspects of lexical resource development (lexical formalisms, rule-based tools) and statistical techniques for the acquisition and enrichment of lexical resources, both monolingual and multilingual. The latter have allowed for faster development of large-scale morphological, syntactic and/or semantic resources, for widely-used as well as resource-scarce languages. Moreover, the notion of dynamic lexicon is used increasingly for taking into account the fact that the lexicon undergoes a permanent evolution.This workshop aims at sketching a large picture of the state of the art in the domain of lexical resource modeling and development. It is also dedicated to research on the application of lexical resources for improving corpus-based studies and language processing tools, both in NLP and in other language-related fields, such as linguistics, translation studies, and didactics

    The Typological Diversity of Morphomes: A Cross-Linguistic Study of Unnatural Morphology

    Full text link
    This is the first typologically-oriented book-length treatment of morphomes, systematic morphological identities, usually within inflectional paradigms, that do not map onto syntactic or semantic natural classes. In the first half of the book, Borja Herce outlines the theoretical and empirical challenges associated with the identification and definition of morphomes, and surveys their links with related notions such as syncretism, homophony, segmentation, and economy, among others. He also presents the different ways in which morphomic structures in a language have been observed to emerge, change, and disappear. The second part of the book contains its core contribution: a database of 120 morphomes across 79 languages from a range of families, which are presented and analysed in detail. A range of findings emerge as a result, including the idiosyncratic nature of morphomes in the Romance languages, the existence of cross-linguistically recurrent unnatural patterns, and the preference for more natural structures even among morphomes. The database also allows further explorations of other issues such as the effect of learnability and communicative efficiency on morphological structures, and the lexical and grammatical informativity of morphs and their distribution

    A typological approach to the morphome

    Get PDF
    407 p.Esta tesis constituye la primera monografía de orientación eminentemente tipológica sobre morfomas. Este término denota estructuras morfológicas sistemáticas cuya extensión paradigmática no se corresponde con distinciones semánticas o morfosintácticas como 'plural', 'genitivo singular' etc.El Capítulo 1 presenta y discute la literatura previa y cuestiones terminológicas, y el Capítulo 2 clarifica cuestiones relativas a la definición e identificación de los morfomas en casos concretos. La discusión se traslada a continuación a un plano más empírico. El Capítulo 3 discute las nociones de 'clase natural' y 'economía', y explora la relación entre morfomicidad y otras desviaciones morfológicas. La diacronía se convierte en protagonista en el Capítulo 4, donde se presentan y discuten las diferentes maneras en que pueden surgir, cambiar o desaparecer los morfomas en las lenguas.El Capítulo 5 es el central de la tesis y presenta 110 morfomas identificados por el autor en lenguas de todo el mundo. Todas estas estructuras son presentadas detalladamente junto con su historia en muchos casos. En base a la variedad observada entre morfomas, se ha definido una docena de variables independientes en torno a las cuales se estructura dicha variación. Tras operacionalizar dichas variables y establecer su valor en los 110 morfomas mencionados, se explora estadísticamente su correlación.Otro resultado derivado de esta base de datos sincrónica se refiere a la recurrencia cross-lingüística de morfomas concretos. Algunas estructuras, arbitrarias desde el punto de vista morfosintáctico o semántico (SG+3PL, 1SG+3, PL+1SG etc.), se encuentran presentes en lenguas independientes, es decir, no emparentadas ni relacionadas arealmente. Esto supone una novedad con respecto a la literatura anterior.La tesis concluye reiterando en el Capítulo 6 los resultados principales de la investigación y explorando sus implicaciones en relación a nuestro conocimiento de los morfomas en particular y del campo de la tipología y la morfología en general

    A typological approach to the morphome

    Get PDF
    407 p.Esta tesis constituye la primera monografía de orientación eminentemente tipológica sobre morfomas. Este término denota estructuras morfológicas sistemáticas cuya extensión paradigmática no se corresponde con distinciones semánticas o morfosintácticas como 'plural', 'genitivo singular' etc.El Capítulo 1 presenta y discute la literatura previa y cuestiones terminológicas, y el Capítulo 2 clarifica cuestiones relativas a la definición e identificación de los morfomas en casos concretos. La discusión se traslada a continuación a un plano más empírico. El Capítulo 3 discute las nociones de 'clase natural' y 'economía', y explora la relación entre morfomicidad y otras desviaciones morfológicas. La diacronía se convierte en protagonista en el Capítulo 4, donde se presentan y discuten las diferentes maneras en que pueden surgir, cambiar o desaparecer los morfomas en las lenguas.El Capítulo 5 es el central de la tesis y presenta 110 morfomas identificados por el autor en lenguas de todo el mundo. Todas estas estructuras son presentadas detalladamente junto con su historia en muchos casos. En base a la variedad observada entre morfomas, se ha definido una docena de variables independientes en torno a las cuales se estructura dicha variación. Tras operacionalizar dichas variables y establecer su valor en los 110 morfomas mencionados, se explora estadísticamente su correlación.Otro resultado derivado de esta base de datos sincrónica se refiere a la recurrencia cross-lingüística de morfomas concretos. Algunas estructuras, arbitrarias desde el punto de vista morfosintáctico o semántico (SG+3PL, 1SG+3, PL+1SG etc.), se encuentran presentes en lenguas independientes, es decir, no emparentadas ni relacionadas arealmente. Esto supone una novedad con respecto a la literatura anterior.La tesis concluye reiterando en el Capítulo 6 los resultados principales de la investigación y explorando sus implicaciones en relación a nuestro conocimiento de los morfomas en particular y del campo de la tipología y la morfología en general

    Poland, Slovenia, the World : Challenges of present-day education

    Get PDF
    Publikacja recenzowana / Peer-reviewed publicationTransformations of education in changing Europe are multifaceted. One of the latter is the process of strengthening the cooperation among universities in this part of the world. This cooperation is carried out in many fields – from joint projects and researches – to joint analyses, discourses and publications. This monograph – a collection of reflections, thoughts and polemics deriving from theoretical and empirical researches, carried out as a part of a joint research project simultaneously undertaken at both these universities under the name “Problems and challenges of modern education” – constitutes one of the fruits of the cooperation between Andrzej Frycz Modrzewski Cracow University and the University of Ljubljana

    CAMling 2010

    Get PDF

    Proceedings of the Sixth International Conference Formal Approaches to South Slavic and Balkan languages

    Get PDF
    Proceedings of the Sixth International Conference Formal Approaches to South Slavic and Balkan Languages publishes 22 papers that were presented at the conference organised in Dubrovnik, Croatia, 25-28 Septembre 2008

    Unsupervised grammar induction with Combinatory Categorial Grammars

    Get PDF
    Language is a highly structured medium for communication. An idea starts in the speaker's mind (semantics) and is transformed into a well formed, intelligible, sentence via the specific syntactic rules of a language. We aim to discover the fingerprints of this process in the choice and location of words used in the final utterance. What is unclear is how much of this latent process can be discovered from the linguistic signal alone and how much requires shared non-linguistic context, knowledge, or cues. Unsupervised grammar induction is the task of analyzing strings in a language to discover the latent syntactic structure of the language without access to labeled training data. Successes in unsupervised grammar induction shed light on the amount of syntactic structure that is discoverable from raw or part-of-speech tagged text. In this thesis, we present a state-of-the-art grammar induction system based on Combinatory Categorial Grammars. Our choice of syntactic formalism enables the first labeled evaluation of an unsupervised system. This allows us to perform an in-depth analysis of the system’s linguistic strengths and weaknesses. In order to completely eliminate reliance on any supervised systems, we also examine how performance is affected when we use induced word clusters instead of gold-standard POS tags. Finally, we perform a semantic evaluation of induced grammars, providing unique insights into future directions for unsupervised grammar induction systems
    corecore