2,881 research outputs found

    FlashProfile: A Framework for Synthesizing Data Profiles

    Get PDF
    We address the problem of learning a syntactic profile for a collection of strings, i.e. a set of regex-like patterns that succinctly describe the syntactic variations in the strings. Real-world datasets, typically curated from multiple sources, often contain data in various syntactic formats. Thus, any data processing task is preceded by the critical step of data format identification. However, manual inspection of data to identify the different formats is infeasible in standard big-data scenarios. Prior techniques are restricted to a small set of pre-defined patterns (e.g. digits, letters, words, etc.), and provide no control over granularity of profiles. We define syntactic profiling as a problem of clustering strings based on syntactic similarity, followed by identifying patterns that succinctly describe each cluster. We present a technique for synthesizing such profiles over a given language of patterns, that also allows for interactive refinement by requesting a desired number of clusters. Using a state-of-the-art inductive synthesis framework, PROSE, we have implemented our technique as FlashProfile. Across 153153 tasks over 7575 large real datasets, we observe a median profiling time of only ∼ 0.7 \sim\,0.7\,s. Furthermore, we show that access to syntactic profiles may allow for more accurate synthesis of programs, i.e. using fewer examples, in programming-by-example (PBE) workflows such as FlashFill.Comment: 28 pages, SPLASH (OOPSLA) 201

    Inducing Probabilistic Grammars by Bayesian Model Merging

    Full text link
    We describe a framework for inducing probabilistic grammars from corpora of positive samples. First, samples are {\em incorporated} by adding ad-hoc rules to a working grammar; subsequently, elements of the model (such as states or nonterminals) are {\em merged} to achieve generalization and a more compact representation. The choice of what to merge and when to stop is governed by the Bayesian posterior probability of the grammar given the data, which formalizes a trade-off between a close fit to the data and a default preference for simpler models (`Occam's Razor'). The general scheme is illustrated using three types of probabilistic grammars: Hidden Markov models, class-based nn-grams, and stochastic context-free grammars.Comment: To appear in Grammatical Inference and Applications, Second International Colloquium on Grammatical Inference; Springer Verlag, 1994. 13 page

    Analyzing and Interpreting Neural Networks for NLP: A Report on the First BlackboxNLP Workshop

    Full text link
    The EMNLP 2018 workshop BlackboxNLP was dedicated to resources and techniques specifically developed for analyzing and understanding the inner-workings and representations acquired by neural models of language. Approaches included: systematic manipulation of input to neural networks and investigating the impact on their performance, testing whether interpretable knowledge can be decoded from intermediate representations acquired by neural networks, proposing modifications to neural network architectures to make their knowledge state or generated output more explainable, and examining the performance of networks on simplified or formal languages. Here we review a number of representative studies in each category

    TRX: A Formally Verified Parser Interpreter

    Full text link
    Parsing is an important problem in computer science and yet surprisingly little attention has been devoted to its formal verification. In this paper, we present TRX: a parser interpreter formally developed in the proof assistant Coq, capable of producing formally correct parsers. We are using parsing expression grammars (PEGs), a formalism essentially representing recursive descent parsing, which we consider an attractive alternative to context-free grammars (CFGs). From this formalization we can extract a parser for an arbitrary PEG grammar with the warranty of total correctness, i.e., the resulting parser is terminating and correct with respect to its grammar and the semantics of PEGs; both properties formally proven in Coq.Comment: 26 pages, LMC

    Generalizing input-driven languages: theoretical and practical benefits

    Get PDF
    Regular languages (RL) are the simplest family in Chomsky's hierarchy. Thanks to their simplicity they enjoy various nice algebraic and logic properties that have been successfully exploited in many application fields. Practically all of their related problems are decidable, so that they support automatic verification algorithms. Also, they can be recognized in real-time. Context-free languages (CFL) are another major family well-suited to formalize programming, natural, and many other classes of languages; their increased generative power w.r.t. RL, however, causes the loss of several closure properties and of the decidability of important problems; furthermore they need complex parsing algorithms. Thus, various subclasses thereof have been defined with different goals, spanning from efficient, deterministic parsing to closure properties, logic characterization and automatic verification techniques. Among CFL subclasses, so-called structured ones, i.e., those where the typical tree-structure is visible in the sentences, exhibit many of the algebraic and logic properties of RL, whereas deterministic CFL have been thoroughly exploited in compiler construction and other application fields. After surveying and comparing the main properties of those various language families, we go back to operator precedence languages (OPL), an old family through which R. Floyd pioneered deterministic parsing, and we show that they offer unexpected properties in two fields so far investigated in totally independent ways: they enable parsing parallelization in a more effective way than traditional sequential parsers, and exhibit the same algebraic and logic properties so far obtained only for less expressive language families

    Hierarchical Syntactic Models for Human Activity Recognition through Mobility Traces

    Get PDF
    Recognizing users’ daily life activities without disrupting their lifestyle is a key functionality to enable a broad variety of advanced services for a Smart City, from energy-efficient management of urban spaces to mobility optimization. In this paper, we propose a novel method for human activity recognition from a collection of outdoor mobility traces acquired through wearable devices. Our method exploits the regularities naturally present in human mobility patterns to construct syntactic models in the form of finite state automata, thanks to an approach known as grammatical inference. We also introduce a measure of similarity that accounts for the intrinsic hierarchical nature of such models, and allows to identify the common traits in the paths induced by different activities at various granularity levels. Our method has been validated on a dataset of real traces representing movements of users in a large metropolitan area. The experimental results show the effectiveness of our similarity measure to correctly identify a set of common coarse-grained activities, as well as their refinement at a finer level of granularity

    Symbolic and connectionist learning techniques for grammatical inference

    Get PDF
    This thesis is structured in four parts for a total of ten chapters. The first part, introduction and review (Chapters 1 to 4), presents an extensive state-of-the-art review of both symbolic and connectionist GI methods, that serves also to state most of the basic material needed to describe later the contributions of the thesis. These contributions constitute the contents of the rest of parts (Chapters 5 to 10). The second part, contributions on symbolic and connectionist techniques for regular grammatical inference (Chapters 5 to 7), describes the contributions related to the theory and methods for regular GI, which include other lateral subjects such as the representation oí. finite-state machines (FSMs) in recurrent neural networks (RNNs).The third part of the thesis, augmented regular expressions and their inductive inference, comprises Chapters 8 and 9. The augmented regular expressions (or AREs) are defined and proposed as a new representation for a subclass of CSLs that does not contain all the context-free languages but a large class of languages capable of describing patterns with symmetries and other (context-sensitive) structures of interest in pattern recognition problems.The fourth part of the thesis just includes Chapter 10: conclusions and future research. Chapter 10 summarizes the main results obtained and points out the lines of further research that should be followed both to deepen in some of the theoretical aspects raised and to facilitate the application of the developed GI tools to real-world problems in the area of computer vision

    Complexity Theory and the Operational Structure of Algebraic Programming Systems

    Get PDF
    An algebraic programming system is a language built from a fixed algebraic data abstraction and a selection of deterministic, and non-deterministic, assignment and control constructs. First, we give a detailed analysis of the operational structure of an algebraic data type, one which is designed to classify programming systems in terms of the complexity of their implementations. Secondly, we test our operational description by comparing the computations in deterministic and non-deterministic programming systems under certain space and time restrictions
    • …
    corecore