4,351 research outputs found

    A Generalised Quantifier Theory of Natural Language in Categorical Compositional Distributional Semantics with Bialgebras

    Get PDF
    Categorical compositional distributional semantics is a model of natural language; it combines the statistical vector space models of words with the compositional models of grammar. We formalise in this model the generalised quantifier theory of natural language, due to Barwise and Cooper. The underlying setting is a compact closed category with bialgebras. We start from a generative grammar formalisation and develop an abstract categorical compositional semantics for it, then instantiate the abstract setting to sets and relations and to finite dimensional vector spaces and linear maps. We prove the equivalence of the relational instantiation to the truth theoretic semantics of generalised quantifiers. The vector space instantiation formalises the statistical usages of words and enables us to, for the first time, reason about quantified phrases and sentences compositionally in distributional semantics

    Computer-aided exploration of architectural design spaces: a digital sketchbook

    Get PDF
    Het ontwerpproces van architecten vormt vaak geen lineair pad van ontwerpopgave tot eindresultaat, maar wordt veeleer gekenmerkt door exploratie of het doorzoeken van meerdere alternatieven in een (conceptuele) ontwerpruimte. Dit proces wordt in de praktijk vaak ondersteund door manueel schetsen, waarbij de ontwerpers schetsboek kan gelezen worden als een reeks exploraties. Dit soort interactie met de ontwerpruimte wordt in veel mindere mate ondersteund door hedendaagse computerondersteunde ontwerpsystemen. De metafoor van een digitaal schetsboek, waarbij menselijke exploratie wordt versterkt door de (reken)kracht van een computer, is het centrale onderzoeksthema van dit proefschrift. Hoewel het opzet van een ontwerpruimte op het eerste gezicht schatplichtig lijkt aan het onderzoeksveld van de artificiële intelligentie (AI), wordt het ontwerpen hier ruimer geïnterpreteerd dan het oplossen van problemen. Als onderzoeksmethodologie worden vormengrammatica’s ingezet, die enerzijds nauw aanleunen bij de AI en een formeel raamwerk bieden voor de exploratie van ontwerpruimtes, maar tegelijkertijd ook weerstand bieden tegen de AI en een vorm van visueel denken en ambiguïteit toelaten. De twee bijhorende onderzoeksvragen zijn hoe deze vormengrammatica’s digitaal kunnen worden gerepresenteerd, en op welke manier de ontwerper-computer interactie kan gebeuren. De resultaten van deze twee onderzoeksvragen vormen de basis van een nieuw hulpmiddel voor architecten: het digitaal schetsboek

    Protecting Systems From Exploits Using Language-Theoretic Security

    Get PDF
    Any computer program processing input from the user or network must validate the input. Input-handling vulnerabilities occur in programs when the software component responsible for filtering malicious input---the parser---does not perform validation adequately. Consequently, parsers are among the most targeted components since they defend the rest of the program from malicious input. This thesis adopts the Language-Theoretic Security (LangSec) principle to understand what tools and research are needed to prevent exploits that target parsers. LangSec proposes specifying the syntactic structure of the input format as a formal grammar. We then build a recognizer for this formal grammar to validate any input before the rest of the program acts on it. To ensure that these recognizers represent the data format, programmers often rely on parser generators or parser combinators tools to build the parsers. This thesis propels several sub-fields in LangSec by proposing new techniques to find bugs in implementations, novel categorizations of vulnerabilities, and new parsing algorithms and tools to handle practical data formats. To this end, this thesis comprises five parts that tackle various tenets of LangSec. First, I categorize various input-handling vulnerabilities and exploits using two frameworks. First, I use the mismorphisms framework to reason about vulnerabilities. This framework helps us reason about the root causes leading to various vulnerabilities. Next, we built a categorization framework using various LangSec anti-patterns, such as parser differentials and insufficient input validation. Finally, we built a catalog of more than 30 popular vulnerabilities to demonstrate the categorization frameworks. Second, I built parsers for various Internet of Things and power grid network protocols and the iccMAX file format using parser combinator libraries. The parsers I built for power grid protocols were deployed and tested on power grid substation networks as an intrusion detection tool. The parser I built for the iccMAX file format led to several corrections and modifications to the iccMAX specifications and reference implementations. Third, I present SPARTA, a novel tool I built that generates Rust code that type checks Portable Data Format (PDF) files. The type checker I helped build strictly enforces the constraints in the PDF specification to find deviations. Our checker has contributed to at least four significant clarifications and corrections to the PDF 2.0 specification and various open-source PDF tools. In addition to our checker, we also built a practical tool, PDFFixer, to dynamically patch type errors in PDF files. Fourth, I present ParseSmith, a tool to build verified parsers for real-world data formats. Most parsing tools available for data formats are insufficient to handle practical formats or have not been verified for their correctness. I built a verified parsing tool in Dafny that builds on ideas from attribute grammars, data-dependent grammars, and parsing expression grammars to tackle various constructs commonly seen in network formats. I prove that our parsers run in linear time and always terminate for well-formed grammars. Finally, I provide the earliest systematic comparison of various data description languages (DDLs) and their parser generation tools. DDLs are used to describe and parse commonly used data formats, such as image formats. Next, I conducted an expert elicitation qualitative study to derive various metrics that I use to compare the DDLs. I also systematically compare these DDLs based on sample data descriptions available with the DDLs---checking for correctness and resilience

    An Efficient Probabilistic Context-Free Parsing Algorithm that Computes Prefix Probabilities

    Full text link
    We describe an extension of Earley's parser for stochastic context-free grammars that computes the following quantities given a stochastic context-free grammar and an input string: a) probabilities of successive prefixes being generated by the grammar; b) probabilities of substrings being generated by the nonterminals, including the entire string being generated by the grammar; c) most likely (Viterbi) parse of the string; d) posterior expected number of applications of each grammar production, as required for reestimating rule probabilities. (a) and (b) are computed incrementally in a single left-to-right pass over the input. Our algorithm compares favorably to standard bottom-up parsing methods for SCFGs in that it works efficiently on sparse grammars by making use of Earley's top-down control structure. It can process any context-free rule format without conversion to some normal form, and combines computations for (a) through (d) in a single algorithm. Finally, the algorithm has simple extensions for processing partially bracketed inputs, and for finding partial parses and their likelihoods on ungrammatical inputs.Comment: 45 pages. Slightly shortened version to appear in Computational Linguistics 2

    What Does a Grammar Formalism Say About a Language?

    Get PDF
    Over the last ten or fifteen years there has been a shift in generative linguistics away from formalisms based on a procedural interpretation of grammars towards constraint-based formalisms—formalisms that define languages by specifying a set of constraints that characterize the set of well-formed structures analyzing the strings in the language. A natural extension of this trend is to define this set of structures model-theoretically—to define it as the set of mathematical structures that satisfy some set of logical axioms. This approach raises a number of questions about the nature of linguistic theories and the role of grammar formalisms in expressing them. We argue here that the crux of what theories of syntax have to say about language lies in the abstract properties of the sets of structures they license. This is the level that is most directly connected to the empirical basis of these theories and it is the level at which it is possible to make meaningful comparisons between the approaches. From this point of view, grammar formalisms, or (formal frameworks) are primarily means of presenting these properties. Many of the apparent distinctions between formalisms, then, may well be artifacts of their presentation rather than substantive distinctions between the properties of the structures they license. The model-theoretic approach offers a way in which to abstract away from the idiosyncrasies of these presentations. Having said that, we must distinguish between the class of sets of structures licensed by a linguistic theory and the set of structures licensed by a specific instance of the theory—by a grammar expressing that theory. Theories of syntax are not simply accounts of the structure of individual languages in isolation, but rather include assertions about the organization of the structure of human languages in general. These universal aspects of the theories present two challenges for the model-theoretic approach. First, they frequently are not properties of individual structures, but are rather properties of sets of structures. Thus, in capturing these model-theoretically one is not defining sets of structures but is rather defining classes of sets of structures; these are not first order properties. Secondly, the universal aspects of linguistic theories are frequently not explicit, but are consequences of the nature of the formalism that embodies the theory. In capturing these one must develop an explicit axiomatic treatment of the formalism. This is both a challenge and a powerful beneft of the approach. Such re-interpretations tend to raise a variety of issues that are often overlooked in the original formalization. In this report we examine these issues within the context of a model-theoretic reinterpretation of Generalized Phrase-Structure Grammar. While there is little current active research on GPSG, it provides an ideal laboratory for exploring these issues. First, the formalism of GPSG is expressly intended to embody a great deal of the accompanying linguistic theory. Thus it provides a variety of opportunities for examining principles expressed as restrictions on the formalism from a model-theoretic point of view. At the same time, the fact that these restrictions embody universal grammar principles provides us with a variety of opportunities to explore the way in which the linguistic theory expressed by a grammar can transcend the mathematical theory of the structures it licenses. Finally, GPSG, although defined declaratively, is a formalism with restricted generative capacity, a characteristic more typical of the earlier procedural formalisms. As such, one component of the theory it embodies is a claim about the language-theoretic complexity of natural languages. Such claims are difficult to establish for any of the constraint-based approaches to grammar. We can show, however, that the class of sets of trees that are definable within the logical language we employ in reformalizing GPSG is nearly exactly the class of sets of trees definable within the basic GPSG formalism. Thus we are able to capture the language-theoretic consequences of GPSGs restricted formalism by employing a restricted logical language
    • …
    corecore