17 research outputs found

    A Study on Learnability for Rigid Lambek Grammars

    Get PDF
    We present basic notions of Gold's "learnability in the limit" paradigm, first presented in 1967, a formalization of the cognitive process by which a native speaker gets to grasp the underlying grammar of his/her own native language by being exposed to well formed sentences generated by that grammar. Then we present Lambek grammars, a formalism issued from categorial grammars which, although not as expressive as needed for a full formalization of natural languages, is particularly suited to easily implement a natural interface between syntax and semantics. In the last part of this work, we present a learnability result for Rigid Lambek grammars from structured examples

    The Logic of Categorial Grammars: Lecture Notes

    Get PDF
    These lecture notes present categorial grammars as deductive systems, and include detailed proofs of their main properties. The first chapter deals with Ajdukiewicz and Bar-Hillel categorial grammars (AB grammars), their relation to context-free grammars and their learning algorithms. The second chapter is devoted to the Lambek calculus as a deductive system; the weak equivalence with context free grammars is proved; we also define the mapping from a syntactic analysis to a higher-order logical formula, which describes the semantics of the parsed sentence. The third and last chapter is about proof-nets as parse structures for Lambek grammars; we show the linguistic relevance of these graphs in particular through the study of a performance question. Although definitions, theorems and proofs have been reformulated for pedagogical reasons, these notes contain no personnal result but in the proofnet chapter

    Learning categorial grammars

    Get PDF
    In 1967 E. M. Gold published a paper in which the language classes from the Chomsky-hierarchy were analyzed in terms of learnability, in the technical sense of identification in the limit. His results were mostly negative, and perhaps because of this his work had little impact on linguistics. In the early eighties there was renewed interest in the paradigm, mainly because of work by Angluin and Wright. Around the same time, Arikawa and his co-workers refined the paradigm by applying it to so-called Elementary Formal Systems. By making use of this approach Takeshi Shinohara was able to come up with an impressive result; any class of context-sensitive grammars with a bound on its number of rules is learnable. Some linguistically motivated work on learnability also appeared from this point on, most notably Wexler & Culicover 1980 and Kanazawa 1994. The latter investigates the learnability of various classes of categorial grammar, inspired by work by Buszkowski and Penn, and raises some interesting questions. We follow up on this work by exploring complexity issues relevant to learning these classes, answering an open question from Kanazawa 1994, and applying the same kind of approach to obtain (non)learnable classes of Combinatory Categorial Grammars, Tree Adjoining Grammars, Minimalist grammars, Generalized Quantifiers, and some variants of Lambek Grammars. We also discuss work on learning tree languages and its application to learning Dependency Grammars. Our main conclusions are: - formal learning theory is relevant to linguistics, - identification in the limit is feasible for non-trivial classes, - the `Shinohara approach' -i.e., placing a numerical bound on the complexity of a grammar- can lead to a learnable class, but this completely depends on the specific nature of the formalism and the notion of complexity. We give examples of natural classes of commonly used linguistic formalisms that resist this kind of approach, - learning is hard work. Our results indicate that learning even `simple' classes of languages requires a lot of computational effort, - dealing with structure (derivation-, dependency-) languages instead of string languages offers a useful and promising approach to learnabilty in a linguistic contex

    A Compositional Vector Space Model of Ellipsis and Anaphora.

    Get PDF
    PhD ThesisThis thesis discusses research in compositional distributional semantics: if words are defined by their use in language and represented as high-dimensional vectors reflecting their co-occurrence behaviour in textual corpora, how should words be composed to produce a similar numerical representation for sentences, paragraphs and documents? Neural methods learn a task-dependent composition by generalising over large datasets, whereas type-driven approaches stipulate that composition is given by a functional view on words, leaving open the question of what those functions should do, concretely. We take on the type-driven approach to compositional distributional semantics and focus on the categorical framework of Coecke, Grefenstette, and Sadrzadeh [CGS13], which models composition as an interpretation of syntactic structures as linear maps on vector spaces using the language of category theory, as well as the two-step approach of Muskens and Sadrzadeh [MS16], where syntactic structures map to lambda logical forms that are instantiated by a concrete composition model. We develop the theory behind these approaches to cover phenomena not dealt with in previous work, evaluate the models in sentence-level tasks, and implement a tensor learning method that generalises to arbitrary sentences. This thesis reports three main contributions. The first, theoretical in nature, discusses the ability of categorical and lambda-based models of compositional distributional semantics to model ellipsis, anaphora, and parasitic gaps; phenomena that challenge the linearity of previous compositional models. Secondly, we perform an evaluation study on verb phrase ellipsis where we introduce three novel sentence evaluation datasets and compare algebraic, neural, and tensor-based composition models to show that models that resolve ellipsis achieve higher correlation with humans. Finally, we generalise the skipgram model [Mik+13] to a tensor-based setting and implement it for transitive verbs, showing that neural methods to learn tensor representations for words can outperform previous tensor-based methods on compositional tasks

    Apprentissage de grammaires catégorielles (transducteurs d'arbres et clustering pour induction de grammaires catégorielles)

    Get PDF
    De nos jours, il n est pas rare d utiliser des logiciels capables d avoir une conversation, d interagir avec nous (systèmes questions/réponses pour les SAV, gestion d interface ou simplement Intelligence Artificielle - IA - de discussion). Ceux-ci doivent comprendre le contexte ou réagir par mot-clefs, mais générer ensuite des réponses cohérentes, aussi bien au niveau du sens de la phrase (sémantique) que de la forme (syntaxe). Si les premières IA se contentaient de phrases toutes faites et réagissaient en fonction de mots-clefs, le processus s est complexifié avec le temps. Pour améliorer celui-ci, il faut comprendre et étudier la construction des phrases. Nous nous focalisons sur la syntaxe et sa modélisation avec des grammaires catégorielles. L idée est de pouvoir aussi bien générer des squelettes de phrases syntaxiquement correctes que vérifier l appartenance d une phrase à un langage, ici le français (il manque l aspect sémantique). On note que les grammaires AB peuvent, à l exception de certains phénomènes comme la quantification et l extraction, servir de base pour la sémantique en extrayant des -termes. Nous couvrons aussi bien l aspect d extraction de grammaire à partir de corpus arborés que l analyse de phrases. Pour ce faire, nous présentons deux méthodes d extraction et une méthode d analyse de phrases permettant de tester nos grammaires. La première méthode consiste en la création d un transducteur d arbres généralisé, qui transforme les arbres syntaxiques en arbres de dérivation d une grammaire AB. Appliqué sur les corpus français que nous avons à notre disposition, il permet d avoir une grammaire assez complète de la langue française, ainsi qu un vaste lexique. Le transducteur, même s il s éloigne peu de la définition usuelle d un transducteur descendant, a pour particularité d offrir une nouvelle méthode d écriture des règles de transduction, permettant une définition compacte de celles-ci. Nous transformons actuellement 92,5% des corpus en arbres de dérivation. Pour notre seconde méthode, nous utilisons un algorithme d unification en guidant celui-ci avec une étape préliminaire de clustering, qui rassemble les mots en fonction de leur contexte dans la phrase. La comparaison avec les arbres extraits du transducteur donne des résultats encourageants avec 91,3% de similarité. Enfin, nous mettons en place une version probabiliste de l algorithme CYK pour tester l efficacité de nos grammaires en analyse de phrases. La couverture obtenue est entre 84,6% et 92,6%, en fonction de l ensemble de phrases pris en entrée. Les probabilités, appliquées aussi bien sur le type des mots lorsque ceux-ci en ont plusieurs que sur les règles, permettent de sélectionner uniquement le meilleur arbre de dérivation.Tous nos logiciels sont disponibles au téléchargement sous licence GNU GPL.Nowadays, we have become familiar with software interacting with us using natural language (for example in question-answering systems for after-sale services, human-computer interaction or simple discussion bots). These tools have to either react by keyword extraction or, more ambitiously, try to understand the sentence in its context. Though the simplest of these programs only have a set of pre-programmed sentences to react to recognized keywords (these systems include Eliza but also more modern systems like Siri), more sophisticated systems make an effort to understand the structure and the meaning of sentences (these include systems like Watson), allowing them to generate consistent answers, both with respect to the meaning of the sentence (semantics) and with respect to its form (syntax). In this thesis, we focus on syntax and on how to model syntax using categorial grammars. Our goal is to generate syntactically accurate sentences (without the semantic aspect) and to verify that a given sentence belongs to a language - the French language. We note that AB grammars, with the exception of some phenomena like quantification or extraction, are also a good basis for semantic purposes. We cover both grammar extraction from treebanks and parsing using the extracted grammars. On this purpose, we present two extraction methods and test the resulting grammars using standard parsing algorithms. The first method focuses on creating a generalized tree transducer, which transforms syntactic trees into derivation trees corresponding to an AB grammar. Applied on the various French treebanks, the transducer s output gives us a wide-coverage lexicon and a grammar suitable for parsing. The transducer, even if it differs only slightly from the usual definition of a top-down transducer, offers several new, compact ways to express transduction rules. We currently transduce 92.5% of all sen- tences in the treebanks into derivation trees.For our second method, we use a unification algorithm, guiding it with a preliminary clustering step, which gathers the words according to their context in the sentence. The comparision between the transduced trees and this method gives the promising result of 91.3% of similarity.Finally, we have tested our grammars on sentence analysis with a probabilistic CYK algorithm and a formula assignment step done with a supertagger. The obtained coverage lies between 84.6% and 92.6%, depending on the input corpus. The probabilities, estimated for the type of words and for the rules, enable us to select only the best derivation tree. All our software is available for download under GNU GPL licence.BORDEAUX1-Bib.electronique (335229901) / SudocSudocFranceF

    Meaning versus Grammar

    Get PDF
    This volume investigates the complicated relationship between grammar, computation, and meaning in natural languages. It details conditions under which meaning-driven processing of natural language is feasible, discusses an operational and accessible implementation of the grammatical cycle for Dutch, and offers analyses of a number of further conjectures about constituency and entailment in natural language

    Algebraic dependency grammar

    Get PDF
    We propose a mathematical formalism called Algebraic Dependency Grammar with applications to formal linguistics and to formal language theory. Regarding formal linguistics we aim to address the problem of grammaticality with special attention to cross-linguistic cases. In the field of formal language theory this formalism provides a new perspective allowing an algebraic classification of languages. Notably our approach suggests the existence of so-called anti-classes of languages associated to certain classes of languages. Our notion of a dependency grammar is as of a definition of a set of well-constructed dependency trees (we call this algebraic governance) and a relation which associates word-orders to dependency trees (we call this algebraic linearization). In relation to algebraic governance, we define a manifold which is a set of dependency trees satisfying an agreement condition throughout a pattern, which is the algebraic form of a collection of syntactic addresses over the dependency tree. A boolean condition on the words formalizes the notion of agreement. In relation to algebraic linearization, first we observe that the notion of projectivity is quintessentially that certain substructures of a dependency tree always form an interval in its linearization. So we have to establish well what is a substructure; we see again that patterns proportion the key, generalizing the notion of projectivity with recursive linearization procedures. Combining the above modules we have the formalism: an algebraic dependency grammar is a manifold together with a linearization. Notice that patterns sustain both manifolds and linearizations. We study their interrelation in terms of a new algebraic classification of classes of languages. We highlight the main contributions of the thesis. Regarding mathematical linguistics, algebraic dependency grammar considers trees and word-order different modules in the architecture, which allows description of languages with varied word-order. Ellipses are permitted; this issue is usually avoided because it makes some formalisms non-decidable. We differentiate linguistic phenomena structurally by their algebraic description. Algebraic dependency grammar permits observance of affinity between linguistic constructions which seem superficially different. Regarding formal language theory, a new system for understanding a very large family of languages is presented which permits observation of languages in broader contexts. We identify a new class named anti-context-free languages containing constructions structurally symmetric to context-free languages. Informally we could say that context-free languages are well-parenthesized, while anti-context-free languages are cross-serial-parenthesized. For example copy languages and respectively languages are anti-context-free.Es proposa un formalisme matemàtic anomenat Gramàtica de Dependències Algebraica amb aplicacions a la lingüística formal i a la teoria de llenguatges formals. Pel que fa a la lingüística formal es pretén abordar el problema de la gramaticalitat, amb un èmfasi especial en la transversalitat, això és, que el formalisme sigui apte per a un bon nombre de llengües. En el camp dels llenguatges formals aquest formalisme proporciona una nova perspectiva que permet una classificació algebraica dels llenguatges. Aquest enfocament suggereix a més a més l'existència de les aquí anomenades anti-classes de llenguatges associades a certes classes de llenguatges. La nostra idea d'una gramàtica de dependències és en un conjunt de sintagmes ben construïts (d'això en diem recció algebraica) i una relació que associa ordres de paraules als sintagmes d'aquest conjunt (d'això en diem linearització algebraica). Pel que fa a la recció algebraica, introduïm el concepte de varietat sintàctica com el conjunt de sintagmes que satisfan una concordança sobre un determinat patró. Un patró és un conjunt d'adreces sintàctiques descrit algebraicament. La concordança es formalitza a través d'una condició booleana sobre el vocabulari. En relació amb linearització algebraica, en primer lloc, observem que l'essencial de la noció clàssica de projectivitat rau en el fet que certes subestructures d'un arbre de dependències formen sempre un interval en la seva linearització. Així doncs, primer hem d'establir bé que vol dir subestructura. Un cop més veiem que els patrons en proporcionen la clau, tot generalitzant la noció de projectivitat a través d'un procediment recursiu de linearització. Tot unint els dos mòduls anteriors ja tenim el nostre formalisme a punt: una gramàtica de dependències algebraica és una varietat sintàctica juntament amb una linearització. Notem que els patrons són a la base de tots dos mòduls: varietats i linearitzacions, així que resulta del tot natural estudiar-ne la interrelació en termes d'un nou sistema de classificació algebraica de classes de llenguatges. Destaquem les principals contribucions d'aquesta tesi. Pel que fa a la matemàtica lingüística, la gramàtica de dependències algebraica considera els arbres i l'ordre de les paraules diferents mòduls dins l'arquitectura la qual cosa permet de descriure llenguatges amb una gran varietat d'ordre. L'ús d'el·lipsis és permès; aquesta qüestió és normalment evitada en altres formalismes per tal com la possibilitat d'el·lipsis fa que els models es tornin no decidibles. El nostre model també ens permet classificar estructuralment fenòmens lingüístics segons la seva descripció algebraica, així com de copsar afinitats entre construccions que semblen superficialment diferents. Pel que fa a la teoria dels llenguatges formals, presentem un nou sistema de classificació que ens permet d'entendre els llenguatges en un context més ampli. Identifiquem una nova classe que anomenem llenguatges anti-lliures-de-context que conté construccions estructuralment simètriques als llenguatges lliures de context. Informalment podríem dir que els llenguatges lliures de context estan ben parentetitzats, mentre que els anti-lliures-de-context estan parentetitzats segons dependències creuades en sèrie. En són mostres d'aquesta classe els llenguatges còpia i els llenguatges respectivament.Postprint (published version

    Mathematical linguistics

    Get PDF
    but in fact this is still an early draft, version 0.56, August 1 2001. Please d

    Rigid Grammars in the Associative-Commutative Lambek Calculus are not Learnable

    No full text
    In (Kanazawa, 1998) it was shown that rigid Classical Catcgorial Gram- mars are learnable (in the sense of (Gold, 1967)) from strings. Surpris- ingly there are recent negative results for, among others, rigid associative Lambek (L) grammars
    corecore