583,411 research outputs found
Which Regular Expression Patterns are Hard to Match?
Regular expressions constitute a fundamental notion in formal language theory
and are frequently used in computer science to define search patterns. A
classic algorithm for these problems constructs and simulates a
non-deterministic finite automaton corresponding to the expression, resulting
in an running time (where is the length of the pattern and is
the length of the text). This running time can be improved slightly (by a
polylogarithmic factor), but no significantly faster solutions are known. At
the same time, much faster algorithms exist for various special cases of
regular expressions, including dictionary matching, wildcard matching, subset
matching, word break problem etc.
In this paper, we show that the complexity of regular expression matching can
be characterized based on its {\em depth} (when interpreted as a formula). Our
results hold for expressions involving concatenation, OR, Kleene star and
Kleene plus. For regular expressions of depth two (involving any combination of
the above operators), we show the following dichotomy: matching and membership
testing can be solved in near-linear time, except for "concatenations of
stars", which cannot be solved in strongly sub-quadratic time assuming the
Strong Exponential Time Hypothesis (SETH). For regular expressions of depth
three the picture is more complex. Nevertheless, we show that all problems can
either be solved in strongly sub-quadratic time, or cannot be solved in
strongly sub-quadratic time assuming SETH.
An intriguing special case of membership testing involves regular expressions
of the form "a star of an OR of concatenations", e.g., . This
corresponds to the so-called {\em word break} problem, for which a dynamic
programming algorithm with a runtime of (roughly) is known. We
show that the latter bound is not tight and improve the runtime to
Processing SPARQL queries with regular expressions in RDF databases
Background: As the Resource Description Framework (RDF) data model is widely used for modeling and sharing a lot of online bioinformatics resources such as Uniprot (dev.isb-sib.ch/projects/uniprot-rdf) or Bio2RDF (bio2rdf.org), SPARQL - a W3C recommendation query for RDF databases - has become an important query language for querying the bioinformatics knowledge bases. Moreover, due to the diversity of users' requests for extracting information from the RDF data as well as the lack of users' knowledge about the exact value of each fact in the RDF databases, it is desirable to use the SPARQL query with regular expression patterns for querying the RDF data. To the best of our knowledge, there is currently no work that efficiently supports regular expression processing in SPARQL over RDF databases. Most of the existing techniques for processing regular expressions are designed for querying a text corpus, or only for supporting the matching over the paths in an RDF graph.
Results: In this paper, we propose a novel framework for supporting regular expression processing in SPARQL query. Our contributions can be summarized as follows. 1) We propose an efficient framework for processing SPARQL queries with regular expression patterns in RDF databases. 2) We propose a cost model in order to adapt the proposed framework in the existing query optimizers. 3) We build a prototype for the proposed framework in C++ and conduct extensive experiments demonstrating the efficiency and effectiveness of our technique.
Conclusions: Experiments with a full-blown RDF engine show that our framework outperforms the existing ones by up to two orders of magnitude in processing SPARQL queries with regular expression patterns.X113sciescopu
Using regular expressions to express bowing patterns for string players
The study of bowing is critically important for string players. Traditional bowing annotations are a valuable part of orchestral and individual documentation, but they do not help the performer to search a piece for other passages that should be bowed the same way, or to identify alternative bowing styles. We introduce a notation based on regular expressions that describes patterns of notes in the music, as well as the bowing to be applied to the pattern. These expressions support complex bowings, and not just single annotations without musical context. The notation is simpler than general tools for regular expressions used in some software, and is suitable for use by students and musicians. We have developed a music editor that implements the notation and edits documents in Lilypond. The approach has been evaluated by experimenting with the editor on six violin sonatas by Mozart. The experiments demonstrate that the regular expression notation
is successful at finding passages and inserting the bowings; that the patterns occur a number of times; and the bowings can be inserted automatically and consistently
Tree pattern matching from regular tree expressions
summary:In this work we deal with tree pattern matching over ranked trees, where the pattern set to be matched against is defined by a regular tree expression. We present a new method that uses a tree automaton constructed inductively from a regular tree expression. First we construct a special tree automaton for the regular tree expression of the pattern , which is somehow a generalization of Thompson automaton for strings. Then we run the constructed automaton on the subject tree . The pattern matching algorithm requires an time complexity, where is the number of nodes of and is the size of the regular tree expression . The novelty of this contribution besides the low time complexity is that the set of patterns can be infinite, since we use regular tree expressions to represent patterns
PaREM: A Novel Approach for Parallel Regular Expression Matching
Regular expression matching is essential for many applications, such as
finding patterns in text, exploring substrings in large DNA sequences, or
lexical analysis. However, sequential regular expression matching may be
time-prohibitive for large problem sizes. In this paper, we describe a novel
algorithm for parallel regular expression matching via deterministic finite
automata. Furthermore, we present our tool PaREM that accepts regular
expressions and finite automata as input and automatically generates the
corresponding code for our algorithm that is amenable for parallel execution on
shared-memory systems. We evaluate our parallel algorithm empirically by
comparing it with a commonly used algorithm for sequential regular expression
matching. Experiments on a dual-socket shared-memory system with 24 physical
cores show speed-ups of up to 21x for 48 threads.Comment: CSE-2014, Dec. 19th - 21st, 2014, Chengdu, Sichuan, Chin
A regular expression generator based on CSS selectors for efficient extraction from HTML pages
Cascading style sheets (CSS) selectors are patterns used to select HTML elements. They are often preferred in web data extraction because they are easy to prepare and have short expressions. In order to be able to extract data from web pages by using these patterns, a document object model (DOM) tree is constructed by an HTML parser for a web page. The construction process of this tree and the extraction process using this tree increase time and memory costs depending on the number of HTML elements and their hierarchies. For reducing these costs, regular expressions can be considered as a solution. However, preparing regular expression patterns is a laborious task. In this study, a heuristic approach, namely Regex Generator (REGEXN), that automatically generates these patterns through CSS selectors is introduced and the performance gains are analyzed on a web crawler. The analysis shows that regular expression patterns generated by this approach can significantly reduce the average extraction time results from 743.31 ms to 1.03 ms when compared with the extraction process from a DOM tree. Similarly, the average memory usage drops from 1054.01 B to 1.59 B. Moreover, REGEXN can be easily adapted to the existing frameworks and tools in this task. © TÜBİTA
- …