2,881 research outputs found
FlashProfile: A Framework for Synthesizing Data Profiles
We address the problem of learning a syntactic profile for a collection of
strings, i.e. a set of regex-like patterns that succinctly describe the
syntactic variations in the strings. Real-world datasets, typically curated
from multiple sources, often contain data in various syntactic formats. Thus,
any data processing task is preceded by the critical step of data format
identification. However, manual inspection of data to identify the different
formats is infeasible in standard big-data scenarios.
Prior techniques are restricted to a small set of pre-defined patterns (e.g.
digits, letters, words, etc.), and provide no control over granularity of
profiles. We define syntactic profiling as a problem of clustering strings
based on syntactic similarity, followed by identifying patterns that succinctly
describe each cluster. We present a technique for synthesizing such profiles
over a given language of patterns, that also allows for interactive refinement
by requesting a desired number of clusters.
Using a state-of-the-art inductive synthesis framework, PROSE, we have
implemented our technique as FlashProfile. Across tasks over large
real datasets, we observe a median profiling time of only s.
Furthermore, we show that access to syntactic profiles may allow for more
accurate synthesis of programs, i.e. using fewer examples, in
programming-by-example (PBE) workflows such as FlashFill.Comment: 28 pages, SPLASH (OOPSLA) 201
Inducing Probabilistic Grammars by Bayesian Model Merging
We describe a framework for inducing probabilistic grammars from corpora of
positive samples. First, samples are {\em incorporated} by adding ad-hoc rules
to a working grammar; subsequently, elements of the model (such as states or
nonterminals) are {\em merged} to achieve generalization and a more compact
representation. The choice of what to merge and when to stop is governed by the
Bayesian posterior probability of the grammar given the data, which formalizes
a trade-off between a close fit to the data and a default preference for
simpler models (`Occam's Razor'). The general scheme is illustrated using three
types of probabilistic grammars: Hidden Markov models, class-based -grams,
and stochastic context-free grammars.Comment: To appear in Grammatical Inference and Applications, Second
International Colloquium on Grammatical Inference; Springer Verlag, 1994. 13
page
Analyzing and Interpreting Neural Networks for NLP: A Report on the First BlackboxNLP Workshop
The EMNLP 2018 workshop BlackboxNLP was dedicated to resources and techniques
specifically developed for analyzing and understanding the inner-workings and
representations acquired by neural models of language. Approaches included:
systematic manipulation of input to neural networks and investigating the
impact on their performance, testing whether interpretable knowledge can be
decoded from intermediate representations acquired by neural networks,
proposing modifications to neural network architectures to make their knowledge
state or generated output more explainable, and examining the performance of
networks on simplified or formal languages. Here we review a number of
representative studies in each category
TRX: A Formally Verified Parser Interpreter
Parsing is an important problem in computer science and yet surprisingly
little attention has been devoted to its formal verification. In this paper, we
present TRX: a parser interpreter formally developed in the proof assistant
Coq, capable of producing formally correct parsers. We are using parsing
expression grammars (PEGs), a formalism essentially representing recursive
descent parsing, which we consider an attractive alternative to context-free
grammars (CFGs). From this formalization we can extract a parser for an
arbitrary PEG grammar with the warranty of total correctness, i.e., the
resulting parser is terminating and correct with respect to its grammar and the
semantics of PEGs; both properties formally proven in Coq.Comment: 26 pages, LMC
Generalizing input-driven languages: theoretical and practical benefits
Regular languages (RL) are the simplest family in Chomsky's hierarchy. Thanks
to their simplicity they enjoy various nice algebraic and logic properties that
have been successfully exploited in many application fields. Practically all of
their related problems are decidable, so that they support automatic
verification algorithms. Also, they can be recognized in real-time.
Context-free languages (CFL) are another major family well-suited to
formalize programming, natural, and many other classes of languages; their
increased generative power w.r.t. RL, however, causes the loss of several
closure properties and of the decidability of important problems; furthermore
they need complex parsing algorithms. Thus, various subclasses thereof have
been defined with different goals, spanning from efficient, deterministic
parsing to closure properties, logic characterization and automatic
verification techniques.
Among CFL subclasses, so-called structured ones, i.e., those where the
typical tree-structure is visible in the sentences, exhibit many of the
algebraic and logic properties of RL, whereas deterministic CFL have been
thoroughly exploited in compiler construction and other application fields.
After surveying and comparing the main properties of those various language
families, we go back to operator precedence languages (OPL), an old family
through which R. Floyd pioneered deterministic parsing, and we show that they
offer unexpected properties in two fields so far investigated in totally
independent ways: they enable parsing parallelization in a more effective way
than traditional sequential parsers, and exhibit the same algebraic and logic
properties so far obtained only for less expressive language families
Hierarchical Syntactic Models for Human Activity Recognition through Mobility Traces
Recognizing users’ daily life activities without disrupting their lifestyle is a key functionality to enable a broad variety of advanced services for a Smart City, from energy-efficient management of urban spaces to mobility optimization. In this paper, we propose a novel method for human activity recognition from a collection of outdoor mobility traces acquired through wearable devices. Our method exploits the regularities naturally present in human mobility patterns to construct syntactic models in the form of finite state automata, thanks to an approach known as grammatical inference. We also introduce a measure of similarity that accounts for the intrinsic hierarchical nature of such models, and allows to identify the common traits in the paths induced by different activities at various granularity levels. Our method has been validated on a dataset of real traces representing movements of users in a large metropolitan area. The experimental results show the effectiveness of our similarity measure to correctly identify a set of common coarse-grained activities, as well as their refinement at a finer level of granularity
Symbolic and connectionist learning techniques for grammatical inference
This thesis is structured in four parts for a total of ten chapters. The first part, introduction and review (Chapters 1 to 4), presents an extensive state-of-the-art review of both symbolic and connectionist GI methods, that serves also to state most of the basic material needed to describe later the contributions of the thesis. These contributions constitute the contents of the rest of parts (Chapters 5 to 10). The second part, contributions on symbolic and connectionist techniques for regular grammatical inference (Chapters 5 to 7), describes the contributions related to the theory and methods for regular GI, which include other lateral subjects such as the representation oÃ. finite-state machines (FSMs) in recurrent neural networks (RNNs).The third part of the thesis, augmented regular expressions and their inductive inference, comprises Chapters 8 and 9. The augmented regular expressions (or AREs) are defined and proposed as a new representation for a subclass of CSLs that does not contain all the context-free languages but a large class of languages capable of describing patterns with symmetries and other (context-sensitive) structures of interest in pattern recognition problems.The fourth part of the thesis just includes Chapter 10: conclusions and future research. Chapter 10 summarizes the main results obtained and points out the lines of further research that should be followed both to deepen in some of the theoretical aspects raised and to facilitate the application of the developed GI tools to real-world problems in the area of computer vision
Complexity Theory and the Operational Structure of Algebraic Programming Systems
An algebraic programming system is a language built from a fixed algebraic data abstraction and a selection of deterministic, and non-deterministic, assignment and control constructs. First, we give a detailed analysis of the operational structure of an algebraic data type, one which is designed to classify programming systems in terms of the complexity of their implementations. Secondly, we test our operational description by comparing the computations in deterministic and non-deterministic programming systems under certain space and time restrictions
- …