Search CORE

80 research outputs found

Stream Processing using Grammars and Regular Expressions

Author: Rasmussen Ulrik Terp
Publication venue
Publication date: 01/01/2016
Field of study

In this dissertation we study regular expression based parsing and the use of grammatical specifications for the synthesis of fast, streaming string-processing programs. In the first part we develop two linear-time algorithms for regular expression based parsing with Perl-style greedy disambiguation. The first algorithm operates in two passes in a semi-streaming fashion, using a constant amount of working memory and an auxiliary tape storage which is written in the first pass and consumed by the second. The second algorithm is a single-pass and optimally streaming algorithm which outputs as much of the parse tree as is semantically possible based on the input prefix read so far, and resorts to buffering as many symbols as is required to resolve the next choice. Optimality is obtained by performing a PSPACE-complete pre-analysis on the regular expression. In the second part we present Kleenex, a language for expressing high-performance streaming string processing programs as regular grammars with embedded semantic actions, and its compilation to streaming string transducers with worst-case linear-time performance. Its underlying theory is based on transducer decomposition into oracle and action machines, and a finite-state specialization of the streaming parsing algorithm presented in the first part. In the second part we also develop a new linear-time streaming parsing algorithm for parsing expression grammars (PEG) which generalizes the regular grammars of Kleenex. The algorithm is based on a bottom-up tabulation algorithm reformulated using least fixed points and evaluated using an instance of the chaotic iteration scheme by Cousot and Cousot

arXiv.org e-Print Archive

Copenhagen University Research Information System

Comparación de dos algoritmos recientes para inferencia gramatical de lenguajes regulares mediante autómatas no deterministas

Author: García Pedro
Ruiz José
Álvarez Gloria I.
Publication venue
Publication date: 09/06/2011
Field of study

El desarrollo de nuevos algoritmos, que resulten convergentes y eficientes, es un paso necesario para un uso provechoso de la inferencia gramatical en la solución de problemas reales y de mayor tamaño. En este trabajo se presentan dos algoritmos llamados DeLeTe2 y MRIA, que implementan la inferencia gramatical por medio de autómatas no deterministas, en contraste con los algoritmos más comúnmente empleados, los cuales utilizan autómatas deterministas. Se consideran las ventajas y desventajas de este cambio en el modelo de representación, mediante la descripción detallada y la comparación de los dos algoritmos de inferencia con respecto al enfoque utilizado en su implementación, a su complejidad computacional, a sus criterios de terminación y a su desempeño sobre un cuerpo de datos sintéticos

Publicaciones Académicas de la Universidad del Valle

Biblioteca Digital de la Universidad del Valle

Recommended from our members

Symbolic Model Learning: New Algorithms and Applications

Author: Argyros Georgios
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2019
Field of study

In this thesis, we study algorithms which can be used to extract, or learn, formal mathematical models from software systems and then using these models to test whether the given software systems satisfy certain security properties such as robustness against code injection attacks. Specifically, we focus on studying learning algorithms for automata and transducers and the symbolic extensions of these models, namely symbolic finite automata (SFAs). In a high level, this thesis contributes the following results: 1. In the first part of the thesis, we present a unified treatment of many common variations of the seminal L* algorithm for learning deterministic finite automata (DFAs) as a congruence learning algorithm for the underlying Nerode congruence which forms the basis of automata theory. Under this formulation the basic data structures used by different variations are unified as different ways to implement the Nerode congruence using queries. 2. Next, building on the new formulation of L*-style algorithms we proceed to develop new algorithms for learning transducer models. Firstly, we present the first algorithm for learning deterministic partial transducers. Furthermore, we extend my algorithm into non-deterministic models by introducing a novel, generalized congruence relation over string transformations which is able to capture a subclass of string transformations with regular lookahead. We demonstrate that this class is able to capture many practical string transformation from the domain of string sanitizers in Web applications. 3. Classical learning algorithms for automata and transducers operate over finite alphabets and have a query complexity that scales linearly with the size of the alphabet. However, in practice, this dependence on the alphabet size hinders the performance of the algorithms. To address this issue, we develop the MAT* algorithm for learning symbolic finite state automata (SFAs) which operate over infinite alphabets. In practice, the MAT* learning algorithm allow us to plug custom transition learning algorithms which will efficiently infer the predicates in the transitions of the SFA without querying the whole alphabet set. 4. Finally, we use our learning algorithm toolbox as the basis for the development of a set of black-box testing algorithms. More specifically, we present Grammar Oriented Filter Auditing (GOFA), a novel technique which allows one to utilize my learning algorithms to evaluate the robustness of a string sanitizer or filter against a set of attack strings given as a context-free grammar. Furthermore, because such grammars are many times unavailable, we developed sfadiff a differential testing technique based on symbolic automata learning which can be used in order to perform differential testing of two different parser implementations using SFA learning algorithms and we demonstrate how our algorithm can be used to develop program fingerprints. We evaluate our algorithms against state-of-the-art Web Application Firewalls and discover over 15 previously unknown vulnerabilities which result in evading the firewalls and performing code injection attacks in the backend Web application. Finally, we show how our learning algorithms can uncover vulnerabilities which are missed by other black-box methods such as fuzzing and grammar-based testing

Columbia University Academic Commons

Mining Multiple Web Sources Using Non-Deterministic Finite State Automata

Author: Harun-Or-Rashid Mohammad
Publication venue: 'University of Windsor Leddy Library'
Publication date: 01/01/2012
Field of study

Existing web content extracting systems use unsupervised, supervised, and semi-supervised approaches. The WebOMiner system is an automatic web content data extraction system which models a specific Business to Customer (B2C) web site such as bestbuy.com using object oriented database schema. WebOMiner system extracts different web page content types like product, list, text using non deterministic finite automaton (NFA) generated manually. This thesis extends the automatic web content data extraction techniques proposed in the WebOMiner system to handle multiple web sites and generate integrated data warehouse automatically. We develop the WebOMiner-2 which generates NFA of specific domain classes from regular expressions extracted from web page DOM trees\u27 frequent patterns. Our algorithm can also handle NFA epsilon([varepsilon]) transition and convert it to deterministic finite automata (DFA) to identify different content tuples from list of tuples. Experimental results show that our system is highly effective and performs the content extraction task with 100% precision and 98.35% recall value

CiteSeerX

Scholarship at UWindsor

QUALITY IMPROVEMENT AND VALIDATION TECHNIQUES ON SOFTWARE SPECIFICATION AND DESIGN

Author: LIU SHUANG
Publication venue
Publication date: 24/03/2015
Field of study

Ph.DDOCTOR OF PHILOSOPH

ScholarBank@NUS

Methods for Structural Pattern Recognition: Complexity and Applications

Author: Průša Daniel
Publication venue
Publication date: 01/01/2018
Field of study

Katedra kybernetik

Digital Library of the Czech Technical University in Prague

Improving Programming Support for Hardware Accelerators Through Automata Processing Abstractions

Author: Angstadt Kevin
Publication venue
Publication date
Field of study

The adoption of hardware accelerators, such as Field-Programmable Gate Arrays, into general-purpose computation pipelines continues to rise, driven by recent trends in data collection and analysis as well as pressure from challenging physical design constraints in hardware. The architectural designs of many of these accelerators stand in stark contrast to the traditional von Neumann model of CPUs. Consequently, existing programming languages, maintenance tools, and techniques are not directly applicable to these devices, meaning that additional architectural knowledge is required for effective programming and configuration. Current programming models and techniques are akin to assembly-level programming on a CPU, thus placing significant burden on developers tasked with using these architectures. Because programming is currently performed at such low levels of abstraction, the software development process is tedious and challenging and hinders the adoption of hardware accelerators. This dissertation explores the thesis that theoretical finite automata provide a suitable abstraction for bridging the gap between high-level programming models and maintenance tools familiar to developers and the low-level hardware representations that enable high-performance execution on hardware accelerators. We adopt a principled hardware/software co-design methodology to develop a programming model providing the key properties that we observe are necessary for success, namely performance and scalability, ease of use, expressive power, and legacy support. First, we develop a framework that allows developers to port existing, legacy code to run on hardware accelerators by leveraging automata learning algorithms in a novel composition with software verification, string solvers, and high-performance automata architectures. Next, we design a domain-specific programming language to aid programmers writing pattern-searching algorithms and develop compilation algorithms to produce finite automata, which supports efficient execution on a wide variety of processing architectures. Then, we develop an interactive debugger for our new language, which allows developers to accurately identify the locations of bugs in software while maintaining support for high-throughput data processing. Finally, we develop two new automata-derived accelerator architectures to support additional applications, including the detection of security attacks and the parsing of recursive and tree-structured data. Using empirical studies, logical reasoning, and statistical analyses, we demonstrate that our prototype artifacts scale to real-world applications, maintain manageable overheads, and support developers' use of hardware accelerators. Collectively, the research efforts detailed in this dissertation help ease the adoption and use of hardware accelerators for data analysis applications, while supporting high-performance computation.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/155224/1/angstadt_1.pd

Deep Blue Documents at the University of Michigan

Optimization and Parallelization of RegEx Based Information Extraction

Author: Doleschal Johannes
Publication venue
Publication date: 01/01/2021
Field of study

EPub Bayreuth

Formal synthesis of control and communication schemes

Author: Chen Yushan
Publication venue: Boston University
Publication date: 01/01/2013
Field of study

Thesis (Ph.D.)--Boston UniversityIn traditional motion planning, the problem is simply specified as "go from A to B while avoiding obstacles", where A and B are two configurations or regions of interest in the robot workspace. However, a large number of robotic applications require more expressive specification languages, which allow for logical and temporal statements about the satisfaction of properties of interest. Examples include "visit A and B infinitely often, always avoid C, and do not visit D unless E vas visited before". Such task specifications cannot be trivially converted to a sequence of "go from A to B" primitives. This thesis establishes theoretical and computational frameworks for automatic synthesis of robot control and communication schemes that are correct-by-construction from task specifications given in expressive languages. We consider a purely discrete scenario, in which the dynamics of each robot is modeled as a finite discrete system. The first problem addressed in this thesis is the generation of provably-correct individual control and communication strategies for a team of robots from rich task specifications in the case when the workspace is static. The second problem relaxes this assumption and considers a scenario in which the environment changes according to some unknown patterns. It proposed a combined learning and formal synthesis approach to generate correct control policies. To tackle the first problem, we draw inspirations from the research fields of formal verification and synthesis, distributed formal synthesis, and concurrency theory. We consider a team of robots that can move among the regions of a partitioned environment and have known capabilities of servicing a set of requests that can occur in the regions of the partition. Some of these requests can be serviced by a robot individually, while some require the cooperation of groups of robots. We propose a top-down approach, in which global specifications given as Regular Expressions (RE) or Linear Temporal Logics (LTL) can be decomposed into local (individual) specifications, which can then be used to automatically synthesize robot control and communication strategies. To address the second problem, we bring together automata learning methods from the field of theoretical linguistics and techniques from temporal logic games and probabilistic model checking, to develop a provably-correct control strategy for robots moving in an environment with unknown dynamics. The robots are required to achieve a surveillance mission, in which a certain request needs to be serviced repeatedly, while the expected time in between consecutive services is minimized and additional temporal logic constraints are satisfied. We define a fragment of Linear Temporal Logic (LTL) to describe such a mission. We consider a single agent case at first and then extend the results to multi-agent systems. To this end, we apply approximate dynamic programming to our computational framework, which leads to significant reduction of computational time. To demonstrate the proposed theoretical and computational frameworks, we implement the derived algorithms in two experimental platforms, the Robotic Urban-Like Environment (RULE) and the Robotic InDoor-like Environment (RIDE). We assign tasks to the team using Regular Expressions or Linear Temporal Logics over requests occurring at regions in the environment. The robots are automatically deployed to complete the missions

Boston University Institutional Repository (OpenBU)

Quantitative Verification and Synthesis of Resilient Networks

Author: Schou Morten Konggaard
Publication venue: Aalborg Universitetsforlag
Publication date: 01/01/2023
Field of study

VBN