Sequential Circuits from Regular Expressions Revisited by Ulus, Dogan
Sequential Circuits
from Regular Expressions
Revisited
Dogan Ulus
August 24, 2018
Abstract
We revisit the long-neglected problem of sequential circuit construc-
tions from regular expressions. The class of languages that are recognized
by sequential circuits is equivalent to the class of regular languages. This
fact is shown in [5] together with an inductive construction technique
from regular expressions. In this note, we present an alternative algo-
rithm, called the trigger-set approach, obtained by reversing well-known
follow-set approach to construct automata. We use our algorithm to
obtain a regular expression matcher based on sequential circuits. Fi-
nally, we report our performance results in comparison with existing
automata-based matchers.
1 introduction
A sequential circuit receives an input value at each step, updates its (internal)
state, and yields an output value. The current state of a circuit represents
the history of input values received during its operation. Circuits store their
current state inside memory elements (e.g. flip-flops or registers) and update
it with respect to newly received input value. Then, the circuit yields an
output value at each step, which is determined by the current state and input
in general. Clearly this is a simple but powerful model of computation and
apparently impressed many people from electrical engineering and computer
science over years. While engineers were building modern computers upon
sequential circuits, computer scientists studied abstract models of them.
Today we call these abstract models finite automata but their original relation
to sequential circuits is largely forgotten.
In this note, we study sequential circuits with a goal to obtain regu-
lar expression matchers, which is an application historically that employs
automata and related techniques [14, 1]. Interestingly enough, the use of
automata were also forgotten for pattern matching purposes in the sake of
search-based (backtracking) methods until it is revived in a blog post1 by
Russ Cox. Eventually this effort has provided us industrial-grade pattern
matchers that use automata (again) such as Google’s re2 engine2. Besides
recent programming languages such as go and rust implement automata
inside their regular expression engines unlike perl and python that im-
plement backtracking. It is worth to note that existing backtracking and
automata based matchers are mostly comparable in performance. However,
there are (pathological) cases that backtracking fails miserably and requires
exponential time for matching whereas automata have a formal guarantee
for linear execution time.
So what can we gain from sequential circuits in this already-quite-
developed world of pattern matching? First, we do not want a worst-case
exponential time algorithm, which can be more critical as we will eventually
1 https://swtch.com/~rsc/regexp/regexp1.html
2 https://github.com/google/re2
1
ar
X
iv
:1
80
1.
08
97
9v
1 
 [c
s.F
L]
  2
6 J
an
 20
18
2 sequential circuits 2
go beyond text processing as in [16, 15]. Therefore, we eliminate backtracking
matchers from our discussion and focus on the other class. Automata based
matchers usually implement two types3 of automata, namely deterministic
(dfas) and non-deterministic (nfas). Roughly speaking, for a given expres-
sion with a size m, dfas tend to be large (exponentially bounded by m) and
fast (constant to m) whereas nfas are small in size (linearly bounded by
m) and slow (quadratic to m). Note that actual performances considerably
depend on the expression as well as the input string to be matched. Se-
quential circuits would reside at the non-deterministic side of this spectrum
but they seem to be better to handle non-determinism than nfas. It is not
wrong to consider sequential circuits to be a yet another way to implement
non-deterministic automata —remember that automata are abstract models
of sequential circuits— but we argue that this view is missing some essence.
There is one important operational contrast between automata and se-
quential circuits regarding their update mechanisms. For automata, update
mechanism operates by pushing the information forward. The information
about being in a state is propagated to the next states with respect to the
current input so we compute reachable successors. On the contrary, update
mechanism for sequential circuits operates by pulling the information from
back. This time we independently compute the reachability of each state
from predecessors with respect to the current input. By analogy, we can
say that automata run by a rear-wheel-drive whereas sequential circuits a
front-wheel one. Both computations are going to exactly the same direction
but the difference in update mechanisms creates some interesting trade-offs
in practice. Maybe a 4-wheel drive is what we want.
In the following, we first review sequential circuit briefly and then pro-
pose an algorithm to directly construct sequential circuits from (classical)
regular expressions. The last section reports the implementation, test cases,
and some comparison with re2’s dfa and nfa implementations.
2 sequential circuits
A sequential circuit reads a sequence X1X2 . . . of k-dimensional Boolean
vectors such that Xn ∈ Bk for n = 1, 2, . . . . Clearly any finite alphabet Σ
can be encoded as a set of k-dimensional Boolean vectors for a suitable k
with typical examples of ascii and Unicode. Therefore, in this note, we
interchangeably use Boolean vectors and letters as inputs to circuits. Then,
we characterize a sequential circuit by three elements as follows.
V : m+ 1 dimensional Boolean state vector
Fi(V ,X) : Next-state functions for each position i = 0, 1, . . . ,m
Y(V ,X) : The output function of the circuit
such that X ∈ Bk is the current input value. The functions Fi and Y are
Boolean functions from Bm+1×Bk. We denote by Vn the valuation (content)
of the vector V at the time step n and we call V0 the initial valuation of the
circuit. The circuit operates by reading the current input value Xn, updating
the state vector V such that
Vn(i) = Fi(Vn−1,Xn) for i = 0, . . . ,m (1)
and yielding a Boolean value Yn such that
Yn = Y(Vn−1,Xn) (2)
at each time step n. The output of a sequential circuit is completely deter-
mined by the initial valuation V0 and the sequence of input values read so
far. This is called sequential model of computation, which relates input se-
quences to output sequences as illustrated in Figure 1. It is customary to say
3 Some can have more. For example, re2 implements an intermediate form called one-pass nfa
for patterns having limited non-determinism.
3 sequential circuits from regular expressions 3
V0
a1−→ V1 a2−→ V2 a3−→ V3 a4−→ V4 a5−→ V5
↓ ↓ ↓ ↓ ↓
1 0 1 0 1
Figure 1: A sequential circuit computation.
that any sequence of input values that make the circuit to yield 1 is accepted
by the circuit for a fixed initial valuation. For example, input sequences a1,
a1a2a3, and a1a2a3a4a5 are accepted by the circuit in Figure 1 whereas
a1a2 and a1a2a3a4 are rejected. The set of all input sequences accepted by
a sequential circuit is called the language of the circuit. It is shown in [5]
that all and only regular languages recognized by sequential circuits as an
analogue of Kleene’s theorem [7].
These concepts are of course familiar to anyone who studied automata
theory, which is established in classical papers [9, 10, 12] as an abstract
treatment of sequential circuits. However, reinstalled circuit approach here
makes it clearly visible that checking acceptance or matching a word can be
reduced to a computation of a set of Boolean functions repeatedly. Especially
we emphasize that (more) expensive list processing operations in nfa sim-
ulations can be avoided if we had constructed circuits instead of automata.
Besides the efficiency can be easily increased by exploiting parallelism of the
hardware if needed. On a fully parallel hardware4, circuits are probably the
most efficient representations of regular languages since they would have a
small (linear) size and operate very fast (one clock period per one letter of
the input). This fact is rather well-known in hardware communities [13, 3]
but methods involve a translation into dfa or nfa, which brings additional
work, if not complexity (e.g. -transitions).
3 sequential circuits from regular expressions
In this section, we present an algorithm to construct sequential circuits
directly from regular expressions, which are syntactic representations of
regular languages. We give the syntax of regular expressions in this note by
the following grammar.
E := a | E1 ∪ E2 | E1 · E2 | E∗ | E+ | E?
where a is a letter of an alphabet Σ and E is a regular expression. The
operators · and ∪ denote concatenation and union as usual. Other operators
∗, +, and ? are repetition operators and sometimes called zero-or-more,
one-or-more, and zero-or-one repetition, respectively.
Our algorithm falls into the family of position-based construction algo-
rithms that associate each letter (or sub-expression) of the expression with a
unique position value. Previous works [8, 6, 1, 2] in this family5 are inter-
ested in computing succeeding positions (follow sets) for each position in
the expression. The rough idea here is that the acceptance procedure for an
input word would resemble the act of going from position to position on the
expression. Consequently, these positions become states of an nfa and follow
sets make the transition function. On the other hand, we are interested in
computing trigger conditions (or preceding positions) for each position to
obtain a sequential circuit. To this end, we closely follow the work of Berry
and Sethi [2] but reverse the follow set approach. The proposed treatment
is mostly symmetric and yields a sequential circuit instead of an nfa as the
outcome. This result can be taken as yet another demonstration of the close
relation between automata and sequential circuits.
4 We say so because our circuit definition is still more mathematical than electrical.
5 Inductive constructions such as Thompson’s [14] are also in this family as positions are implicitly
assumed. The other family is called derivative-based constructions stemmed from Brzozowski’s
derivatives [4]. Berry and Sethi show that these two happy families are alike [2].
3 sequential circuits from regular expressions 4
We start by marking mark all letters of a regular expression E to make
them distinct. The marked version of a regular expression E is obtained by
associating letters in E with a number called position. Positions are denoted
by subscripts over letters. In the strict sense, marked letters ai and aj are
distinct if i 6= j. We designate the position 0 as the initial position of the
expression and then enumerate letters left to right. The leftmost letter in
E is associated with the position 1 and the rest goes on. For example, an
expression (a · b ∪ b)∗ · b · a becomes (a1 · b2 ∪ b3)∗ · b4 · a5 after marking
process. Finally, we say that the size #(E) of an expression E is the number
of letters in the expression.
In the following, we present three functions, skip, out, and trig that have
been used to construct a sequential circuit from a marked regular expression.
First, the function skip checks whether the language of an expression E
contains the empty word (thus whether E is skippable). This function can be
computed inductively for regular expressions by the following rules.
skip(ai) = 0
skip(E1 ∪ E2) = skip(E1) or skip(E2)
skip(E1 · E2) = skip(E1) and skip(E2)
skip(ϕ∗) = 1
skip(ϕ+) = skip(ϕ)
skip(ϕ?) = 1
Second, we define the out function that computes the outputting (or last)
positions of each node. The computation of the out function is inductively
given by the following rules.
out(ai) = {i}
out(E1 ∪ E2) = out(E1)∪ out(E2)
out(E1 · E2) =
{
out(E1)∪ out(E2) if skip(E2)
out(E2) otherwise.
out(E∗) = out(E)
out(E+) = out(E)
out(E?) = out(E)
Continuing our running example, we compute skip and out functions and
illustrate over the syntax tree in Figure 2. We annotate each node of the
syntax tree with corresponding value computed by skip and out functions.
Third, recall that we have reserved the position 0 as the starting position.
We use the position 0 as the trigger for the expression and it allows us
to control where to start a match on the input sequence. We will explain
this point later with an example. The function trig(E, {0}) yields a set of
triples (i,a,H) for a given regular expression E where i is a position, a is
·
·
∗∪
·
a1
b2
b3
b4
a5
skip = 0
out = {1}
skip = 0
out = {2}
skip = 0
out = {3}
skip = 0
out = {4}
skip = 0
out = {5}
skip = 0
out = {5}
skip = 0
out = {4}
skip = 1
out = {2, 3}
skip = 0
out = {2, 3}
skip = 0
out = {2}
Figure 2: Computing skip and out over the syntax tree of (a1 ·b2 ∪b3)∗ ·b4 ·a5.
3 sequential circuits from regular expressions 5
Position Letter Trigger Set
1 a {0, 2, 3}
2 b {1}
3 b {0, 2, 3}
4 b {0, 2, 3}
5 a {4}
Figure 3: Trigger sets computed by trig(E, {0}) where E = (a1 ·b2 ∪b3)∗ ·b4 ·a5.
the corresponding letter, and H is a set of positions that trigger the position
i together with a. In other words, one should be in a position j ∈ H and
read the letter a to reach the position i in the next step. Then, we inductively
define the function trig as follows.
trig(ai,H) = {(i,a,H)}
trig(E1 ∪ E2,H) = trig(E1,H)∪ trig(E2,H)
trig(E1 · E2,H) =
{
trig(E1,H)∪ trig(E2, out(E1)∪H) if skip(E1)
trig(E1,H)∪ trig(E2, out(E1)) otherwise.
trig(E∗,H) = trig(E, out(E)∪H)
trig(E+,H) = trig(E, out(E)∪H)
trig(E?,H) = trig(E,H)
An example computation of trigger sets is give over our running example
in Figure 3. By computing trig(E, {0}), we have almost finished our sequential
circuit construction but we must also define initializations and the output
function. The output function is determined by out(E) function at the top
level. For the starting behavior, we have two typical options: (1) Start
matching only from the beginning of the input sequence. This is also
known as the acceptance, full match, etc. (2) Start anywhere on the input
sequence. This is also known as the suffix acceptance, partial match, etc.,
which is indeed equivalent to the acceptance problem of Σ∗ ·E. These starting
behaviors is determined by the next-state function F0 of the initial position.
Finally, we present the full construction in Algorithm 1 and the circuit
generated for the expression
(
(a · b)∪ b)∗ · b · a in Figure 4.
Algorithm 1 Trigger Set Algorithm
For a given regular expression E, a sequential circuit (V , F, Y) is obtained as
follows.
1. Compute #(E), skip(E), out(E), and trig(E, {0}) functions.
2. Allocate an m+1 dimensional state vector V and initialize V such that
V0(0) := 1, V0(i) := 0 for i = 1, . . . ,m where m = #(E).
3. For the initial position, define either
(a) F0 := 0 to start matching only from the first letter, or
(b) F0 := 1 to start matching anywhere on the input sequence.
4. For each triple (i,a,H) ∈ trig(E), define the rest of next state functions
Fi := (X = a) ∧
∨
j∈H
V(j)
where X is the current input value.
5. Define the output function Y :=
∨
i∈out(E) Fi.
4 implementation 6
V0 := (1, 0, 0, 0, 0, 0)
F0 := 0
F1 := (X = a) ∧ (V(0)∨ V(2)∨ V(3))
F2 := (X = b) ∧ V(1)
F3 := (X = b) ∧ (V(0)∨ V(2)∨ V(3))
F4 := (X = b) ∧ (V(0)∨ V(2)∨ V(3))
F5 := (X = a) ∧ V(4)
Y := (X = a) ∧ V(4)
Figure 4: A sequential circuit that recognize the expression
(
(a ·b)∪b)∗ ·b ·a
constructed by trigger set algorithm.
4 implementation
The prototype implementation6 of the trigger set algorithm performs two
main tasks. The first task is in computing #, skip, out, and trig functions
for the regular expression given. Corresponding visitor implementations
annotate the syntax tree according to definitions in the previous section and
return the size of the expression, the output set, and trigger sets for each
position. Then, using this information, we construct a sequential circuit from
the expression, that is to say, generate a c++ code to be compiled into a
program that matches the expression over the input word. Generated code is
very simple and currently requires the name of a file that contains the input
word as its only argument. For example, in Listing 1, we show the code
generated for the sequential circuit given in Figure 4.
6 https://github.com/doganulus/reelay
Listing 1: Code generated for (((a;b)|b)*);b;a
#include <iostream >
#include <fstream >
#include <cstring >
int main(int argc , char **argv) {
int state [6] = {1,0,0,0,0,0};
int next_state [6] = {0,0,0,0,0,0};
std:: ifstream ifs(argv [1]);
std:: string word((std:: istreambuf_iterator <char >(
ifs)),(std:: istreambuf_iterator <char >()));
for (char letter : word){
next_state [0] = 1; // Start anywhere
next_state [1] = (letter == ’a’) and (state [0] or
state [2] or state [3]);
next_state [2] = (letter == ’b’) and (state [1]);
next_state [3] = (letter == ’b’) and (state [0] or
state [2] or state [3]);
next_state [4] = (letter == ’b’) and (state [0] or
state [2] or state [3]);
next_state [5] = (letter == ’a’) and (state [4]);
std:: memcpy(state , next_state , sizeof(state));
}
std::cout << next_state [5] << std::endl;
}
5 discussion 7
We compare our sequential circuit based implementation against au-
tomata based regular expression matcher re2’s dfa and nfa implementations.
We generate a similar code that uses re2 for testing purposes. To this end,
we simply call re2’s corresponding matching function (RE2::PartialMatch)
instead of our sequential circuit implementation. Finally note that re2 con-
structs a dfa by default. In order to force re2 to construct nfa, we reduce
maximum allowed cache for dfa simulation so that it falls back to nfa. These
codes that use re2 can be found in the appendix.
We compile the generated code using standard g++ compiler with level
two optimizations (-O2). In our tests, we perform suffix matching for a test
pattern over very long (67M characters = 67mb) random sequences of textual
characters over an alphabet Σ. All tests run on a 3.3GHz machine. Since
the execution time is linear to the input size, we only report the throughput
of matching (in mb per second) obtained by dividing input file size by the
minimum of 10 actual execution times. Then we present our experimental
results in Tables 1-5 and discuss them in the following section.
5 discussion
In this note, we presented an algorithm to construct a sequential circuit from
regular expressions. We implemented our algorithm straightforwardly and
tested sequential circuit approach against a well-engineered regular expres-
sion engine that implements automata. In our tests, sequential circuits clearly
outperforms standard nfas, which shares similar memory requirements.
There is one case (shown in Table 2) that nfa seems faster but this is probably
due to that re2’s engine falls back to one-pass nfa rather than standard
nfa since the pattern is unambiguous. It is no surprise that deterministic
Table 1:
(
(a · b)∪ b)∗ · b · a over Σ = {a, . . . , z}
dfa nfa Sequential
152 mb\s 13 mb\s 268 mb\s
Table 2: a · b · c · d · e · f · g · h · i · j · k · l ·m ·n · o · p · q · r · s · t · u · v ·w · x · y · z
dfa One-pass nfa? Sequential
304mb\s 161 mb\s 48mb\s
Table 3: (x∪y∪ z) ·a ·b · c ·d ·e · f ·g ·h · i · j ·k · l ·m ·n ·o ·p ·q · r · s · t ·u ·v ·w ·x ·y · z
dfa nfa Sequential
135mb\s 30mb\s 44mb\s
Table 4: (a?)n · an over Σ = {a, . . . , z}
n dfa nfa Sequential
10 335mb\s 38mb\s 257mb\s
20 335mb\s 22mb\s 170mb\s
30 335mb\s 15mb\s 115mb\s
Table 5: ((a∪ b)∗) · a · (a∪ b)n over Σ = {a,b}
n dfa nfa Sequential
10 93mb\s 5.4mb\s 35mb\s
14 37mb\s 4.3mb\s 19mb\s
15 4.3mb\s 4.3mb\s 17mb\s
20 3.3mb\s 3.3mb\s 12mb\s
30 2.7mb\s 2.7mb\s 7.2mb\s
REFERENCES 8
automata are faster than sequential circuits when the pattern is more deter-
ministic and not too large. However, there are patterns when the number
of deterministic states grows exponentially in the size of expression. The
pattern given in Table 5 is not randomly selected and similar to the example
used in [11], which proves the exponential bound on nfa-dfa conversion is
indeed tight. In short, this is very bad news for dfa. The dfa implementation
in re2 performs a lazy determinization procedure, which computes and
stores new deterministic states only when it is required. The number of
deterministic states kept in the cache is capped by a fixed value (the default
must be around 10K states) and the implementation falls back to nfa if
this limit is violated often. This perhaps explains the dramatic performance
decrease observed for dfa between n = 14 and n = 15 in Table 5. On the
contrary, sequential circuits perform better for such highly non-deterministic
patterns. Therefore, we believe that sequential circuits offer an alternative
solution for the pattern matching problem and can replace non-deterministic
automata to this end.
references
[1] Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman.
Compilers: Principles, Techniques, and Tools. 2nd ed. Addison-Wesley,
2007.
[2] Gérard Berry and Ravi Sethi. “From Regular Expressions to Determin-
istic Automata”. In: Theoretical Computer Science 48.3 (1986), pp. 117–
126.
[3] Marc Boule and Zeljko Zilic. Generating hardware assertion checkers.
Springer, 2008.
[4] Janusz A. Brzozowski. “Derivatives of Regular Expressions”. In: Journal
of the ACM 11.4 (1964), pp. 481–494.
[5] Irving M Copi, Calvin C Elgot, and Jesse B Wright. “Realization of
events by logical nets”. In: Journal of the ACM (JACM) 5.2 (1958), pp. 181–
196.
[6] Victor M Glushkov. “The abstract theory of automata”. In: Russian
Mathematical Surveys 16.5 (1961), pp. 1–53.
[7] Stephen Cole Kleene. “Representations of events in nerve nets and
finite automata”. In: Automata Studies: Annals of Mathematics Studies 34
(1956), pp. 3–42.
[8] R. McNaughton and H. Yamada. “Regular Expressions and State
Graphs for Automata”. In: Electronic Computers, IRE Transactions on
EC-9.1 (1960), pp. 39–47.
[9] George H Mealy. “A method for synthesizing sequential circuits”. In:
Bell Labs Technical Journal 34.5 (1955), pp. 1045–1079.
[10] Edward F. Moore. “Gedanken-experiments on sequential machines”.
In: Automata studies 34 (1956), pp. 129–153.
[11] Frank R Moore. “On the bounds for state-set size in the proofs of equiv-
alence between deterministic, nondeterministic, and two-way finite
automata”. In: IEEE Transactions on computers 100.10 (1971), pp. 1211–
1214.
[12] Michael O. Rabin and Dana Scott. “Finite automata and their decision
problems”. In: IBM journal of research and development 3.2 (1959), pp. 114–
125.
[13] Reetinder Sidhu and Viktor K Prasanna. “Fast regular expression
matching using FPGAs”. In: Field-Programmable Custom Computing
Machines, 2001. FCCM’01. The 9th Annual IEEE Symposium on. IEEE.
2001, pp. 227–238.
REFERENCES 9
[14] Ken Thompson. “Regular Expression Search Algorithm”. In: Communi-
cations of the ACM (CACM) 11.6 (1968), pp. 419–422.
[15] Dogan Ulus, Thomas Ferrère, Eugene Asarin, and Oded Maler. “Online
Timed Pattern Matching using Derivatives”. In: Tools and Algorithms for
the Construction and Analysis of Systems (TACAS). 2016, pp. 736–751.
[16] Dogan Ulus, Thomas Ferrère, Eugene Asarin, and Oded Maler. “Timed
Pattern Matching”. In: Formal Modeling and Analysis of Timed Systems
(FORMATS). 2014, pp. 222–236.
A generated codes 10
a generated codes
Listing 2: Example test code for RE2 (DFA)
#include <iostream >
#include <fstream >
#include <re2/re2.h>
int main(int argc , char **argv) {
std:: ifstream ifs(argv [1]);
std:: string word((std:: istreambuf_iterator <
char >(ifs)),(std:: istreambuf_iterator <char
>()));
std::cout << RE2:: PartialMatch(word , "
(?:(?:(?: ab)|b)*)ba$") << std::endl;
}
Listing 3: Example test code for RE2 (Forced NFA)
#include <iostream >
#include <fstream >
#include <re2/re2.h>
int main(int argc , char **argv) {
std:: ifstream ifs(argv [1]);
std:: string word((std:: istreambuf_iterator <char >(
ifs)),(std:: istreambuf_iterator <char >()));
RE2:: Options opt;
opt.set_max_mem (2048);
RE2 re("(?:(?:(?: ab)|b)*)ba$", opt);
std::cout << RE2:: PartialMatch(word , re) << std::
endl;
}
