    A Study on Learnability for Rigid Lambek Grammars

    We present basic notions of Gold's "learnability in the limit" paradigm, first presented in 1967, a formalization of the cognitive process by which a native speaker gets to grasp the underlying grammar of his/her own native language by being exposed to well formed sentences generated by that grammar. Then we present Lambek grammars, a formalism issued from categorial grammars which, although not as expressive as needed for a full formalization of natural languages, is particularly suited to easily implement a natural interface between syntax and semantics. In the last part of this work, we present a learnability result for Rigid Lambek grammars from structured examples

    Learnability of type-logical grammars

    AbstractA procedure for learning a lexical assignment together with a system of syntactic and semantic categories given a fixed type-logical grammar is briefly described. The logic underlying the grammar can be any cut-free decidable modally enriched extension of the Lambek calculus, but the correspondence between syntactic and semantic categories must be constrained so that no infinite set of categories is ultimately used to generate the language. It is shown that under these conditions various linguistically valuable subsets of the range of the algorithm are classes identifiable in the limit from data consisting of sentences labeled by simply typed lambda calculus meaning terms in normal form. The entire range of the algorithm is shown to be not a learnable class, contrary to a mistaken result reported in a preliminary version of this paper. It is informally argued that, given the right type logic, the learnable classes of grammars include members which generate natural languages, and thus that natural languages are learnable in this way

    Learning categorial grammars

    In 1967 E. M. Gold published a paper in which the language classes from the Chomsky-hierarchy were analyzed in terms of learnability, in the technical sense of identification in the limit. His results were mostly negative, and perhaps because of this his work had little impact on linguistics. In the early eighties there was renewed interest in the paradigm, mainly because of work by Angluin and Wright. Around the same time, Arikawa and his co-workers refined the paradigm by applying it to so-called Elementary Formal Systems. By making use of this approach Takeshi Shinohara was able to come up with an impressive result; any class of context-sensitive grammars with a bound on its number of rules is learnable. Some linguistically motivated work on learnability also appeared from this point on, most notably Wexler & Culicover 1980 and Kanazawa 1994. The latter investigates the learnability of various classes of categorial grammar, inspired by work by Buszkowski and Penn, and raises some interesting questions. We follow up on this work by exploring complexity issues relevant to learning these classes, answering an open question from Kanazawa 1994, and applying the same kind of approach to obtain (non)learnable classes of Combinatory Categorial Grammars, Tree Adjoining Grammars, Minimalist grammars, Generalized Quantifiers, and some variants of Lambek Grammars. We also discuss work on learning tree languages and its application to learning Dependency Grammars. Our main conclusions are: - formal learning theory is relevant to linguistics, - identification in the limit is feasible for non-trivial classes, - the `Shinohara approach' -i.e., placing a numerical bound on the complexity of a grammar- can lead to a learnable class, but this completely depends on the specific nature of the formalism and the notion of complexity. We give examples of natural classes of commonly used linguistic formalisms that resist this kind of approach, - learning is hard work. Our results indicate that learning even `simple' classes of languages requires a lot of computational effort, - dealing with structure (derivation-, dependency-) languages instead of string languages offers a useful and promising approach to learnabilty in a linguistic contex

    Structures Abstract

    This paper is concerned with learning categorial grammars in the model of Gold. We show that rigid and k-valued non-associative Lambek grammars are learnable from function-argument structured sentences. In fact, function-argument structures are natural syntactical decompositions of sentences in sub-components with the indication of the head of each sub-component. This result is interesting and surprising because for every k, the class of k-valued NL grammars has infinite elasticity and one could think that it is not learnable, which is not true. Moreover, these classes are very close to unlearnable classes like k-valued associative Lambek grammars learned from function-argument sentences or k-valued non-associative Lambek calculus grammars learned from well-bracketed list of words or from strings. Thus, the k-valued non-associative Lambek grammars learned from function-argument sentences is at the frontier between learnable and unlearnable classes of languages

    The Logic of Categorial Grammars: Lecture Notes

    These lecture notes present categorial grammars as deductive systems, and include detailed proofs of their main properties. The first chapter deals with Ajdukiewicz and Bar-Hillel categorial grammars (AB grammars), their relation to context-free grammars and their learning algorithms. The second chapter is devoted to the Lambek calculus as a deductive system; the weak equivalence with context free grammars is proved; we also define the mapping from a syntactic analysis to a higher-order logical formula, which describes the semantics of the parsed sentence. The third and last chapter is about proof-nets as parse structures for Lambek grammars; we show the linguistic relevance of these graphs in particular through the study of a performance question. Although definitions, theorems and proofs have been reformulated for pedagogical reasons, these notes contain no personnal result but in the proofnet chapter

    Category-Theoretic Quantitative Compositional Distributional Models of Natural Language Semantics

    This thesis is about the problem of compositionality in distributional semantics. Distributional semantics presupposes that the meanings of words are a function of their occurrences in textual contexts. It models words as distributions over these contexts and represents them as vectors in high dimensional spaces. The problem of compositionality for such models concerns itself with how to produce representations for larger units of text by composing the representations of smaller units of text. This thesis focuses on a particular approach to this compositionality problem, namely using the categorical framework developed by Coecke, Sadrzadeh, and Clark, which combines syntactic analysis formalisms with distributional semantic representations of meaning to produce syntactically motivated composition operations. This thesis shows how this approach can be theoretically extended and practically implemented to produce concrete compositional distributional models of natural language semantics. It furthermore demonstrates that such models can perform on par with, or better than, other competing approaches in the field of natural language processing. There are three principal contributions to computational linguistics in this thesis. The first is to extend the DisCoCat framework on the syntactic front and semantic front, incorporating a number of syntactic analysis formalisms and providing learning procedures allowing for the generation of concrete compositional distributional models. The second contribution is to evaluate the models developed from the procedures presented here, showing that they outperform other compositional distributional models present in the literature. The third contribution is to show how using category theory to solve linguistic problems forms a sound basis for research, illustrated by examples of work on this topic, that also suggest directions for future research.Comment: DPhil Thesis, University of Oxford, Submitted and accepted in 201

    Apprentissage de grammaires catégorielles (transducteurs d'arbres et clustering pour induction de grammaires catégorielles)

    De nos jours, il n est pas rare d utiliser des logiciels capables d avoir une conversation, d interagir avec nous (systèmes questions/réponses pour les SAV, gestion d interface ou simplement Intelligence Artificielle - IA - de discussion). Ceux-ci doivent comprendre le contexte ou réagir par mot-clefs, mais générer ensuite des réponses cohérentes, aussi bien au niveau du sens de la phrase (sémantique) que de la forme (syntaxe). Si les premières IA se contentaient de phrases toutes faites et réagissaient en fonction de mots-clefs, le processus s est complexifié avec le temps. Pour améliorer celui-ci, il faut comprendre et étudier la construction des phrases. Nous nous focalisons sur la syntaxe et sa modélisation avec des grammaires catégorielles. L idée est de pouvoir aussi bien générer des squelettes de phrases syntaxiquement correctes que vérifier l appartenance d une phrase à un langage, ici le français (il manque l aspect sémantique). On note que les grammaires AB peuvent, à l exception de certains phénomènes comme la quantification et l extraction, servir de base pour la sémantique en extrayant des -termes. Nous couvrons aussi bien l aspect d extraction de grammaire à partir de corpus arborés que l analyse de phrases. Pour ce faire, nous présentons deux méthodes d extraction et une méthode d analyse de phrases permettant de tester nos grammaires. La première méthode consiste en la création d un transducteur d arbres généralisé, qui transforme les arbres syntaxiques en arbres de dérivation d une grammaire AB. Appliqué sur les corpus français que nous avons à notre disposition, il permet d avoir une grammaire assez complète de la langue française, ainsi qu un vaste lexique. Le transducteur, même s il s éloigne peu de la définition usuelle d un transducteur descendant, a pour particularité d offrir une nouvelle méthode d écriture des règles de transduction, permettant une définition compacte de celles-ci. Nous transformons actuellement 92,5% des corpus en arbres de dérivation. Pour notre seconde méthode, nous utilisons un algorithme d unification en guidant celui-ci avec une étape préliminaire de clustering, qui rassemble les mots en fonction de leur contexte dans la phrase. La comparaison avec les arbres extraits du transducteur donne des résultats encourageants avec 91,3% de similarité. Enfin, nous mettons en place une version probabiliste de l algorithme CYK pour tester l efficacité de nos grammaires en analyse de phrases. La couverture obtenue est entre 84,6% et 92,6%, en fonction de l ensemble de phrases pris en entrée. Les probabilités, appliquées aussi bien sur le type des mots lorsque ceux-ci en ont plusieurs que sur les règles, permettent de sélectionner uniquement le meilleur arbre de dérivation.Tous nos logiciels sont disponibles au téléchargement sous licence GNU GPL.Nowadays, we have become familiar with software interacting with us using natural language (for example in question-answering systems for after-sale services, human-computer interaction or simple discussion bots). These tools have to either react by keyword extraction or, more ambitiously, try to understand the sentence in its context. Though the simplest of these programs only have a set of pre-programmed sentences to react to recognized keywords (these systems include Eliza but also more modern systems like Siri), more sophisticated systems make an effort to understand the structure and the meaning of sentences (these include systems like Watson), allowing them to generate consistent answers, both with respect to the meaning of the sentence (semantics) and with respect to its form (syntax). In this thesis, we focus on syntax and on how to model syntax using categorial grammars. Our goal is to generate syntactically accurate sentences (without the semantic aspect) and to verify that a given sentence belongs to a language - the French language. We note that AB grammars, with the exception of some phenomena like quantification or extraction, are also a good basis for semantic purposes. We cover both grammar extraction from treebanks and parsing using the extracted grammars. On this purpose, we present two extraction methods and test the resulting grammars using standard parsing algorithms. The first method focuses on creating a generalized tree transducer, which transforms syntactic trees into derivation trees corresponding to an AB grammar. Applied on the various French treebanks, the transducer s output gives us a wide-coverage lexicon and a grammar suitable for parsing. The transducer, even if it differs only slightly from the usual definition of a top-down transducer, offers several new, compact ways to express transduction rules. We currently transduce 92.5% of all sen- tences in the treebanks into derivation trees.For our second method, we use a unification algorithm, guiding it with a preliminary clustering step, which gathers the words according to their context in the sentence. The comparision between the transduced trees and this method gives the promising result of 91.3% of similarity.Finally, we have tested our grammars on sentence analysis with a probabilistic CYK algorithm and a formula assignment step done with a supertagger. The obtained coverage lies between 84.6% and 92.6%, depending on the input corpus. The probabilities, estimated for the type of words and for the rules, enable us to select only the best derivation tree. All our software is available for download under GNU GPL licence.BORDEAUX1-Bib.electronique (335229901) / SudocSudocFranceF

    Prospects for Declarative Mathematical Modeling of Complex Biological Systems

    Declarative modeling uses symbolic expressions to represent models. With such expressions one can formalize high-level mathematical computations on models that would be difficult or impossible to perform directly on a lower-level simulation program, in a general-purpose programming language. Examples of such computations on models include model analysis, relatively general-purpose model-reduction maps, and the initial phases of model implementation, all of which should preserve or approximate the mathematical semantics of a complex biological model. The potential advantages are particularly relevant in the case of developmental modeling, wherein complex spatial structures exhibit dynamics at molecular, cellular, and organogenic levels to relate genotype to multicellular phenotype. Multiscale modeling can benefit from both the expressive power of declarative modeling languages and the application of model reduction methods to link models across scale. Based on previous work, here we define declarative modeling of complex biological systems by defining the operator algebra semantics of an increasingly powerful series of declarative modeling languages including reaction-like dynamics of parameterized and extended objects; we define semantics-preserving implementation and semantics-approximating model reduction transformations; and we outline a "meta-hierarchy" for organizing declarative models and the mathematical methods that can fruitfully manipulate them

    Meaning versus Grammar

    This volume investigates the complicated relationship between grammar, computation, and meaning in natural languages. It details conditions under which meaning-driven processing of natural language is feasible, discusses an operational and accessible implementation of the grammatical cycle for Dutch, and offers analyses of a number of further conjectures about constituency and entailment in natural language