223,457 research outputs found
Regular Expressions in a CS Formal Languages Course
Regular expressions in an Automata Theory and Formal Languages course are
mostly treated as a theoretical topic. That is, to some degree their
mathematical properties and their role to describe languages is discussed. This
approach fails to capture the interest of most Computer Science students. It is
a missed opportunity to engage Computer Science students that are far more
motivated by practical applications of theory. To this end, regular expressions
may be discussed as the description of an algorithm to generate words in a
language that is easily programmed. This article describes a programming-based
methodology to introduce students to regular expressions in an Automata Theory
and Formal Languages course. The language of instruction is FSM in which there
is a regular expression type. Thus, facilitating the study of regular
expressions and of algorithms based on regular expressions.Comment: In Proceedings TFPIE 2023, arXiv:2308.0611
On the Expressive Power of Regular Expressions with Backreferences
A rewb is a regular expression extended with a feature called backreference.
It is broadly known that backreference is a practical extension of regular
expressions, and is supported by most modern regular expression engines, such
as those in the standard libraries of Java, Python, and more. Meanwhile,
indexed languages are the languages generated by indexed grammars, a formal
grammar class proposed by A.V.Aho. We show that these two models' expressive
powers are related in the following way: every language described by a rewb is
an indexed language. As the smallest formal grammar class previously known to
contain rewbs is the class of context sensitive languages, our result strictly
improves the known upper-bound. Moreover, we prove the following two claims:
there exists a rewb whose language does not belong to the class of stack
languages, which is a proper subclass of indexed languages, and the language
described by a rewb without a captured reference is in the class of nonerasing
stack languages, which is a proper subclass of stack languages. Finally, we
show that the hierarchy investigated in a prior study, which separates the
expressive power of rewbs by the notion of nested levels, is within the class
of nonerasing stack languages.Comment: 20 pages, the full version of the paper to appear in MFCS 202
On the Expressive Power of Regular Expressions with Backreferences
A rewb is a regular expression extended with a feature called backreference. It is broadly known that backreference is a practical extension of regular expressions, and is supported by most modern regular expression engines, such as those in the standard libraries of Java, Python, and more. Meanwhile, indexed languages are the languages generated by indexed grammars, a formal grammar class proposed by A.V.Aho. We show that these two models\u27 expressive powers are related in the following way: every language described by a rewb is an indexed language. As the smallest formal grammar class previously known to contain rewbs is the class of context sensitive languages, our result strictly improves the known upper-bound. Moreover, we prove the following two claims: there exists a rewb whose language does not belong to the class of stack languages, which is a proper subclass of indexed languages, and the language described by a rewb without a captured reference is in the class of nonerasing stack languages, which is a proper subclass of stack languages. Finally, we show that the hierarchy investigated in a prior study, which separates the expressive power of rewbs by the notion of nested levels, is within the class of nonerasing stack languages
A Grammatical Inference Approach to Language-Based Anomaly Detection in XML
False-positives are a problem in anomaly-based intrusion detection systems.
To counter this issue, we discuss anomaly detection for the eXtensible Markup
Language (XML) in a language-theoretic view. We argue that many XML-based
attacks target the syntactic level, i.e. the tree structure or element content,
and syntax validation of XML documents reduces the attack surface. XML offers
so-called schemas for validation, but in real world, schemas are often
unavailable, ignored or too general. In this work-in-progress paper we describe
a grammatical inference approach to learn an automaton from example XML
documents for detecting documents with anomalous syntax.
We discuss properties and expressiveness of XML to understand limits of
learnability. Our contributions are an XML Schema compatible lexical datatype
system to abstract content in XML and an algorithm to learn visibly pushdown
automata (VPA) directly from a set of examples. The proposed algorithm does not
require the tree representation of XML, so it can process large documents or
streams. The resulting deterministic VPA then allows stream validation of
documents to recognize deviations in the underlying tree structure or
datatypes.Comment: Paper accepted at First Int. Workshop on Emerging Cyberthreats and
Countermeasures ECTCM 201
Broadening the Scope of Nanopublications
In this paper, we present an approach for extending the existing concept of
nanopublications --- tiny entities of scientific results in RDF representation
--- to broaden their application range. The proposed extension uses English
sentences to represent informal and underspecified scientific claims. These
sentences follow a syntactic and semantic scheme that we call AIDA (Atomic,
Independent, Declarative, Absolute), which provides a uniform and succinct
representation of scientific assertions. Such AIDA nanopublications are
compatible with the existing nanopublication concept and enjoy most of its
advantages such as information sharing, interlinking of scientific findings,
and detailed attribution, while being more flexible and applicable to a much
wider range of scientific results. We show that users are able to create AIDA
sentences for given scientific results quickly and at high quality, and that it
is feasible to automatically extract and interlink AIDA nanopublications from
existing unstructured data sources. To demonstrate our approach, a web-based
interface is introduced, which also exemplifies the use of nanopublications for
non-scientific content, including meta-nanopublications that describe other
nanopublications.Comment: To appear in the Proceedings of the 10th Extended Semantic Web
Conference (ESWC 2013
On the Expressiveness of Languages for Complex Event Recognition
Complex Event Recognition (CER for short) has recently gained attention as a mechanism for detecting patterns in streams of continuously arriving event data. Numerous CER systems and languages have been proposed in the literature, commonly based on combining operations from regular expressions (sequencing, iteration, and disjunction) and relational algebra (e.g., joins and filters). While these languages are naturally first-order, meaning that variables can only bind single elements, they also provide capabilities for filtering sets of events that occur inside iterative patterns; for example requiring sequences of numbers to be increasing. Unfortunately, these type of filters usually present ad-hoc syntax and under-defined semantics, precisely because variables cannot bind sets of events. As a result, CER languages that provide filtering of sequences commonly lack rigorous semantics and their expressive power is not understood.
In this paper we embark on two tasks: First, to define a denotational semantics for CER that naturally allows to bind and filter sets of events; and second, to compare the expressive power of this semantics with that of CER languages that only allow for binding single events. Concretely, we introduce Set-Oriented Complex Event Logic (SO-CEL for short), a variation of the CER language introduced in [Grez et al., 2019] in which all variables bind to sets of matched events. We then compare SO-CEL with CEL, the CER language of [Grez et al., 2019] where variables bind single events. We show that they are equivalent in expressive power when restricted to unary predicates but, surprisingly, incomparable in general. Nevertheless, we show that if we restrict to sets of binary predicates, then SO-CEL is strictly more expressive than CEL. To get a better understanding of the expressive power, computational capabilities, and limitations of SO-CEL, we also investigate the relationship between SO-CEL and Complex Event Automata (CEA), a natural computational model for CER languages. We define a property on CEA called the *-property and show that, under unary predicates, SO-CEL captures precisely the subclass of CEA that satisfy this property. Finally, we identify the operations that SO-CEL is lacking to characterize CEA and introduce a natural extension of the language that captures the complete class of CEA under unary predicates
- …