Avoiding Latch Formation in Regular Expression Recognizers by Foster, Michael J.
Avoiding Latch Formation 
in Regular Expression Recognizers 
M. J. Foster 
Computer Science Depanr口ent
Columbia University 
New York City 10027 
6 October 1986 
Index terms 
CUCS-233-86 
Custom VLSI , pattem matchers. regular e x. pressions. source-to-source rransformaùons , specialized 
silicon compilers 
Abstract 
Specialized silicon compilers , or mod川e generators. are promising tools for automating 出巳 design of 
custom VLSI chips. In p缸ticular， generators for regular language recognizers seem to have many 
applications.ηus paper identifies a problem ca11ed latch fo口nation 出at causes 陀gular e x. pression 
recognizers to be more ∞mplex 由矶山ey would first ap阿ar. If recognizers are constructed in the most 
straightforward way 行。m certain regular exp陀ssions. 由ey may contain extraneous latches 由at cause 
incorrect operatio皿 After identi句ing 由e problem. 由e paper presents a "source-to-source" 
transformation 由at converts 陀但lar e x. pressions 由at cause latch fonnation into expressions 由at do no t. 
This transformation a110ws regular expression recognizers to be simpler and smaller，由us adding to 由e
advantages of speci a1 ized silicon compilers. 
1. lntroduction 
One of the most important tools available to today' s designers for speeding up 由epro臼ss of designing 
custom VLSI is the specialized silicon compiler. Sometimes ca1 led module generators. specialized 
silicon compilers are t∞Is 由at combine knowledge of a specific application area with a set of primitive 
cells and rules for combining rhe cells to produce a layout. Speci a1ized silicon compi1ers will only 
produce layours for rheir speci a1 ized areas of application. but will do a ve叩 good job within those areas. 
The chips 出at 由ey produce will be sma11 and fas t. Speci a1ized silicon compilers have been bui1t for 
digita1 signal processors 币， 14] ， combinaúonal logic [16 ], data parhs [3 , 11 , 13] , general-purpose 
processors [17] , synchronizers [2，坷， and 由e topic of 由is paper, pattern matchers 凹， 12]. Anyrhing 出at
improves 由e efficiency of rhe chips produ臼d by specia1 ized silicon compilers is a worthwhile 
contribution ω 由e field of VLSI design automation. 
This paper discusses specialized s i1icon ∞mpilers 由at produce recognizers for 陀引1ar expressions. A 
regular expression recognizer is a pattem matching circuit in which rhe pattern is specified by a 陀gu1ar
expression. A specia1 ized si1 icon compiler for these recognizers will ac臼pt a regu1ar expression as input. 
and produce the layout of a recognizer for 由at expression.. Regular exp陀ssions have seen wide 
app1ication in computer science; among other tasks. 由ey have been used to specify lexica1 analyzers for 
programming languages , conrro l1ers for sequentia1 machines. filters for on-the- t1y database search, 
pattems in image procεssing ， and communication protocols. Regular expression recognizer百出erefo陀
have wide application, and are an especially promising application area for specialized silicon compi1ers. 
The contribution of this paper is a "source-to-source" transformation of regular expressions 由at
allows more efficient recognizers to be constructed for them. The most straightforward circuirs for 
constructing regular expression recognizers exhlbit a problem: 陀cognizers for some expressions contain 
superfluous latches due to interactions between the ce l1s. Formation of rhese latches leads to recognizers 
由at function incorrectly. Previous solutions to 由e problem of latch formation have used more 
complicated circuirs to elirninate or 陀set 由e latches. 节1ese more comp1icated circuirs increase rhe size of 
recognizers. and may make them run more slowly. The transformation introduced in this paper eliminates 
the expressions that cause latch formation , and so allows rhe simple , straightforward circuits to be used 
for a11 expressions.ηlÎ s a110ws specialized silicon compilers to produce efficient 陀∞gnizers for r吨ular
languages. 
2. Regular Expression Recognizers 
This section gives a notaúon for regular expressions and describes how to compile rhem into circuits. 
Straightforward ∞mpilation of some reμar expressions leads to a problem called lalch formation , in 
which extraneous latches 缸穹 formed within the compiled circuit These latches lead to incorrect 
operation.η1is section indicates how the problem anses and mentions some techniques 由at have been 
used to combat it 
A r它gular expression descnbes a reguJar language over some a1 phabet L. A regular exp陀ssion may 
陀P陀sent 由e empty set (臼)，由e empty s tIÍ ng (E) , or any set of s町ings 由at can be built up by 
concatenation , union and repetition from E and the single characters of L. A regular expression over L 
may include some characters 由at are not in L , such as operators and paren由eses. Assuming 由at 由巳
2 
characters in the set {(þ ( ) • +} are not in r，由e syncactically correct 陀巴山r exp陀ssions over L can be 
defined inductively as follows. 
• $ is a regular expression over L. 
• lf a E L then a is a regular expression over L. 
• lfαand ß are regular expressions over L, then so areα: 目， (α+ 巳)， and (α)气
节le meaning of a regular expression can be detined inductively based on the forrn of the expression. 
The set of strings L(p) represented by a regular e:<pression p is: 
·币le empty set if p is $. 
• {a} ifp is a. 
• L(α) uL(自) if P is (α+ ß). 
• {σlσ2' whereσL E L(α) and σ2 E L(ß)} i f P isα; 自.
• {E}υ{σlσ2 . . .σ口， where n is any positive integer and <Ji E L(α)} if P is (α)气 ln this 
arti c1 e ， λwilI be used as an abbreviation for 4l".节1US ， L(λ) = {丘}.
η让s ar世cle uses a r古gular expression to denote the set of strings it 陀P陀sems; for e:<缸nple ，
abc E (α;b + c)气
Regular expressions describe pattems; very oñen one wants to build panem recognizers for Ù1ese 
pattems. A recognizer circuit for a r它gular expression a∞εpts a stream of characrers from Ù1e alphabet of 
出e expression as input. and produces a stream of bits. The operation of a 陀cogniz击 r circuit occurs on 
discr它 te clock ticks , or beats. On each beat，由e recognizer reads a character from the input and produces 
an output bi t. The first bit p陀cedes 由e stri ng of characters , then 由e recognizer produces one bit atì:er 
reading each character. Each bit indicates whe Ù1er 由e stream of characters immediately preceding it is in 
the language of the regular expression. For example , if a recognizer for 由己 expression (a;b + c)取 wer巳
given the stTeam ababc it would produ臼 the bit str它am 00101 1. 
Several researchers have described tree-based techniques for compiling regular expressions into 
recognizer circuits [1 , 7 , 9, 10, 15). All of these techniques u臼 produce tree-srrucrured circuics in which 
Ù\e nodes of the tree coπ时pondωchar白白白 of 由e regular exp陀ssion. Edges of the tr哇哇 ca盯 text and 
state information between nodes. The nodes of the tree are usually chosen from a library of standard 
cdls , wi由 the edges being the wiring between 由e cells. 
币le circuits of Mu挝lopa曲yay are typical examples of tree-based recognizer circuits. These circuits 
use ti ve types of 臼 lls: one for each of the three operaωrs (union , concatenation , and Kleene "), one for 
由e symbol t1l, and one ωrecognize ind.ividual characters. Figure 2-1 shows the ∞mparaωr for individual 
characters. On each beat, a character is input at the same time as the ENB signal. The RES signal is set to 
true for 问 following beat if and only if ENB is true and 阳 text character matches 阳 pattem characrer. 
节1己 three 0 pe raωr cells for 由己 expression tree recognizer are shown in Figures 2-2, 2-3 , and 2-4. 
节lese combine ENB and RES signals from their operands to produαsignals for a larger expr号ssion. A 
陀cognizer built from these cells is a tree wi由 the same structure as 由e expression tree of the 陀gular
expression, with operator cells in place of operators and ∞mparators in place of letters of the alphabe t. 
For example ， ωbuild a recognizer for A:B , a ∞mparator for A is connecred to the left port of 阳
3 
ENB RES 
Figure 2-1: Comparator for expression蜀 tree based recognizer 
concarenation cell , and a ∞mparator for B is connected to the right po口. A recognizer constructed using 
these cells outputs RES on beat i if and only if som巳 string in the Ianguage of the recognizer is input on 
beats i-n 由rough i-l and ENB i_n is true. (The ceIl for <l> is not shown, but is simply a connection to 
IogicaI 0: it always outputs 0 on RES.) 
白lere is a problem wi由由e expression tree cel1s in Figures 2町 1 through 2-4 , which was tìrst pointed 
out by Backhouse [4]. If a recognizer is constructed for an expression of 由e form E* , wh盯由eempty
string is contained in E，由e output result w i1l be latched to true. This can be seen most easily in the 
recognizer for (a*户， shown in Figure 2-5.ηle OR gates in the two Kleene star cells are cross coupled s。
由at 出ey form a latch. If 由ey are ever set to true , they can never be reset to faIse. Thus if ENB is true on 
beat 1, and the character b is input on that beat , the recognizer wi Il erroneously output 1 on RES during 
be:lt 2. Using 由E臼 CεI1 s 由en， correct r穹 cognizers can be constructed only for expressions in which 出e
star operator is never applied to a subexp陀ssion containing 由e empty string. This problem is common tO 
all of the tree-based recognizers 由at have been discussed in the literarure. 
Several solutions to 由e latching problem have been proposed. Foster [9] uses a c/ocked OR gate in the 
Kleene closure cell in place of the OR gate. The clocked OR ga臼 behaves like a normal OR gate ‘ except 
由at it brietly sets i 岱 output to 0 between beats. This clear古 any latches formed by a loop of OR gates , and 
it has been proven 由at recognizers built using 由is moditìed ceIl operate correctly for all regular 
expressions [8]. More re臼mly ， Anantharaman [1] has designed a set of ce \ls 由at can be interconnected 
to produαdelay-insensitive 陀cognizers. Among their other advantages, Anantharaman's cells avoid the 
latching problem by avoiding direct feedback between OR gates.τñe RES signa1s in these cεI1 s have more 
than two values; they ta.ke on a speci aI vaIue for matches of the empty string. The cells are designed so 
由at 由e special vaI ue cannot act as an enable signa l. ln particular, a match of the empty string by a 
Kleene closure cell cannot re-enab1e the same cel l. 
\Vhile these techniques correct the problem of latched recognizers，由ey complicate the ce l1s. The 
clocked OR gate is somewhat larger than an ordinary OR gate , and it requires connection to a clock 由at




白gure 1-2: Conc:lte !1.1tlon 臼U与」纠
Figure 2-3: Cnion ~11 
't RE9 




Figure 2-5: Faulty 陀cognizer for (a*)* 
more complicated than their synchronous ∞unte叩arts. 1n many cases in which a small, simple 
recognizer is desired，由e anti-latching techniques add too much overhead to 由e prinitive cells. 
3. A Conversion Theorem for Regular Expressions 
η1is paper provides a new method for avoiding latch formation in tree-structur叫陀cognizers. The idea 
is to transfonn the regular expression for which a recognizer is to be built into a new expression 由at will 
not lead to latch fonnation. Latch formation occurs when the star 0阴阳tor is applied to an expression that 
contains 由e empty s国ng; that is. any recognizer for an exp陀ssìon E气 whereεεE ， will have extran巳ous
latches. If we could transform E into a new exp陀ssìon N(E) such 由at N(E)* ". E* and E -= N(E) , we could 
avoid latch foπnation by building the 陀cognizer for N(E}*. This recognizer would recognize the 
language E*, and would operate correcùy using the simple cεIls of Figures 2-1 to 2-4. 
The obvious way to transform E is to form the set differ町lce E-λAlthough there is no way to build a 
ce lI for the set differ穹nceo严rator [町， E-λis a regular language, and so can be rep陀sented by some 
regular expression D{E). Clearly D(E)* ". E* and E f&': D(E). Unfortunately , the expression D(E) can be 
much longer than E. The set differ町lce operator is not used in regular expressions. and converting an 
expression containing set differencεto one 由at does not can greatly increase the length of Ù1e exp陀ssion.
For example , there is a sequence {En} of 陀gular exp陀ssions of increasing Iength such 由at the shortest 
expression for D(En} is about nJ2 times as Iong as En. (fhe first two expressions in the sequence are 
a*b* , (a*b* + c叮叮(e*俨+ g*h*). In gene ra1, Ei+l ~ (Ei + E飞);(E"i + E"'i)' where the four copies of Ei 
6 
have disjoint alphabets.) Thus there is no constant bound on the ratio of the length of D(E) to 由e leng由
of E. Since 由e size of a tree-based recognizer is proportional to 由e leng由 of its r毛gular expression. 
transfonning E [0 D(E) may lead to 陀cognizers 由at are too large. We therefore need a differenr 
transfonnation of E. 
The fo Jlowing 由eorem provides a transfonnation 由at is appropriate for constnlcting recognizers. It 
shows 出at any expression of 归 fonn E* can be transfonned into a new expression Ì','(E)* in which 
latches are not fonned. and which is no longer 阳n E*. This solves 阳 problem of latch formation in 
tree.based recognizers for regular expressions. We first srate without proof some easily veritìed fJcrs 
about regular expressions. 
Lemma : If P and Q are regular expressions the fo Ilowing statements hold. 
a. P嘟嘟= p* 
b. (p* + Q事)* = (P + Q)'" 
c. If Ee P and Ee Q then (P:Q)* = (P + Q)*. 
d. E e (P + Q) if and only if either Ee P or Ee Q. 
e. Ee (P;Q) if and only if Ee P and Ee Q. 
Th巳se facts ar穹 used in the proof of the following theo陀m.
Theorem : For any 陀部llar exp陀ssion R 由ere is a regular 击xp陀ssion S(R) such 出at
• N(R) does not conrain the empty srring. 
• R * ~ (N(R))* 
• N(R) is no longer tlun R. 
Proof: We compute N(R) 陀cursively. based on 出e expression tree of R. 1f e 运 R then :--;(R) = 
R. 口l1is grounds 由e recursion. since N(a) = a for any member of the alphabe t.) 0由巳阳ise ，
由e陀缸穹白皮e cases. . 
1. R - P"' 
2. R = P + Q, where either P or Q contains E 
3. R = P;Q. where b。由 P and Q contain E 
In case 1, N(R) ~ N(P). 1n cases 2 a l14才 3 ， N(R) = N(P) + N(Q). Clearly :\(R) is no longer than 
R. We musr now show 由at N(R) does not conta.i n E and 出at 川(R)" = R囔
We can show both of these facts by induc t1 on. Firs t, we show 由at N(R) does nor conrain E. 
If R has length 1, then it must be either a letter of the alphabe t, or one of 由e exp陀ssions <þ or À. 
:\(R) is then either <l> or a letter of 出e alphabet , neither of which contains e. Thus the bJse case 
of the induction holds. To prove the inductive s臼 p ， suppose 由ar for every expression P of 
length less 由an i, N(P) does not contain e. Suppose R has leng由 i. {f R does not contain E, 
neither does N(R). 1f R does conra.i n E，由en N(R) is ei由er N(P) or N(P) +兴(Q) ， where P and Q 
have length less 由an i. 1n neither case does N(R) contain the empty string , since by 由E
inductive hy阳出esis neither N(P) ∞r N(Q) can conrain it. 
Now we show 由at R"' = (N(R))气 We need only consider cases in which R comains the 
empty string , since N(R) and R 缸它 idemical otherwise. If R has length 1 and conrains the 
empty string, it must be λTo prove the base case , note that N(R) is then $, and (N(R))* = R" = 
λFor the inductive step, suppose 由ar R"' - (N(R))" 臼 long as the length of R is less 出an i. If 
R has leng由 i and comains 由e empty string , then R must be one of the three fonns nored above. 
7 
1. If R = P* , then N(R) = N(P) and (N(R))* = (N(P))*. By 由巳 inductive hypo由esis.
(N(P))* = P*. So (N(R))* = p* = P川= R*. 
2. If R - P + Q，由en N(R) ~ N(P) + N(Q). So (N(R))* = (N(P) + ~(Q))* = ((N(P))* + 
(N(Q))*)*. By the inductive hypothesis , (N(P))* = p* and (N(Q))* = Q气 50 (N(R))* = 
(P* + Q*)*. But 由1S is equal to (P + Q户， which is R气
3. If R = P;Q, then N(R) = N(P) + N(Q). By the same reasoning as in case 2, (只(R))瑞=
(P + Q)*. Now notice 由况， since we are assuming 由at R contains the empty string , 
both P and Q must also contain the empty string. Thus R* = (P;Q)* = (P + Q)*. 
Therefo币， R* = (N(R))*. 
Some examples may clari鸟'也is theorem. Given a 陀gular expression R. we can apply 由e theorem 
recursively to produ臼 N(R).ηle reader may verify the fo l1owing identities: 
• N(a*) = a 
• N(a*;b*;c*;d*) = a + b + c + d 
• N((a* + b*);(c* + d*)) = a + b + c + d 
1n these examples , a recognizer for N(R)* is sma lIer than a recognizer for R *. Although this contraction 
in size may not always occur, Theorem 2 states 由at 由e recognizer for N(R)* wi I1 never be larger than the 
one for R*. 
A side benefit of Ù1e transforτnation 仕om R into N(R) is 由at the cyc\e time of 由e recognizer for N(R)'" 
may be smaller than 由at for R气The cycle time of a recognizer is proportional to Ù1e number of 
combinatorial logic gates 由at a signa1 goes through between 1atches. A gene ra1 r它gu1ar expression of 
1eng由 n may have O(n) gates between 1atches: consider the expre臼ion a料*…，.. wi由 n stars. 1 f all 
subexpressions of an expression of the form R *缸它 replaced by N(R户， however, the maximum number of 
gates between two latches is attained in the tree of + cεlls needed for an expression of邮 form (P l + Pz + 
. . + Pk). Since such a tree can be bui1t wi由 logarithmic depth，由e transformation 陀duces the worst case 
cyc\e time of a recognizer of si四 n from O(n) ωO(log n). 
4. Conclusion 
ηlis paper has described 由e problem of latch formation in regular expression 陀cognizers ， and has 
suggested a new solution. By transforming 由e regular expression R'" into N(R户. not on1y may latches be 
avoided , but 阳∞rres阳nding recognizer may decrease in size. ln addÍtÍon. this transformation 
decreases 由e worst-c画e cyc\e time of recognizers from O(n) to O(log n). This source-ω-source 




(1) An:m也araman ， T. S. 
A Delay Insensitive Regular Expression Recognizer. 
1986. 
Privare Communicarion. 
[2] Anan出araman ， T. 丘， E. M. Clarke , M. J. Foster, and B. 予Æshra.
Compiling Path Expressions into VLSI Circuirs. 
In Proceedings of rhe }2' th Symposium on Principles of Programming Languages. 人C~I ，
J anuary , 1985. 
To appear in Disf1协uted Computing. 
[31 Anceau. F. 
CAPRl: A Design ~fethodology and Silicon Compiler for YLSI Circuirs. 
In R. 8ryam (editor) , Proceedings of rhe Third Caltech Conference on Very Large Scale 
Integration , pages 15-32. Caltech. Pasadena , Ca. , ~farch ， 1983. 
[4] 8ackhouse , R. C. 
Specificarion and Proof of a Regular La月guage Recogni::er in Synchronous CCS. 
Technical Report CSM-53 , Unive r'5 ity of Essex , January , 1983. 
(5J 8alraj , T. S. and M. J. Foster. 
MissM缸mer'5: A Specialized Silicon Compiler for Synchronize r'5. 
In Proceedings of rhe Fourth .\f/T Conference on Advanced Research in VLSI. 孔iπ， .-\pril , 1986. 
(6) 8ergmann. N. 
A Case Srudy of the F.I.R.S.T. Silicon Compiler. 
In R. 8ryant (editor) , Proceedings of the Third Caltech Conference on \' ery Large Scale 
I ntegration, pages 413-430. Calrech , Pasadena, Ca. , :\-farch, 1983. 
[7] Floyd. R. W. and J. D. Uliman. 
The Compilarion of Regular Exp陀ssions into lntegrated Circuirs. 
JACM 29(3):603-622 , July , 1982. 
(8) Foster，丸L J. 
Specia !i:ed Silicon Compilers for Langωge Recognition. 
PhD由esis ， Camegie-Mellon University. 1984. 
[9] Foster, M. J. 
AS阳cialized Silicon Compi1er and Programmable Chip for Language Recognition. 
In N. G. Einspruch (ediωr) ， VLSI Electronics: .\ficrosrructure Science. Yùlume 14:γLSJ Des:gn. 
chapter 5 , pages 139-196. Academic Press , Orlandù , Florida, 1986. 
[10] Foster, ~L J. and H. T. Kung. 
Recognize Regular Languages With Programmable Building Blocks. 
Journal σ"1 Digital Systems 6(4):323-332 , 1982 . 
.-\ preliminary ve r'5 ion of 由is paper appears in the VLSI-81 Proceedings , edited by 10M P. Gr3y. 
[11] Hitchcock, C. Y. and D. E. 节lomas.
A Method of Automacic Daca Path Synthesis. 
In Proceedings of rhe 20' rh Design Automarion Conference. IEEE, June , 1983. 
(12) Kar1in. A. R. , H. W. Trickey , and J. D. UlIman. 
Experience with a Regu Iar Expression Compile r. 
In Proceedings of rhe I nternationa1 Conference on Computer Design , pages 656-665. IEEE , 
ocωber， 1983. 
9 
[13] Kowalski , T. J. and D. E. Thomas. 
The VLSI Design Automation Assistant; Prototype System. 
In Proceedings 01 the 20' th Design AUlomarion Co，!卢rence. IEEE. June, 1983. 
[14] Lyon , R. F. 
A Bit-Seria1 VLSI Arcrutectural Methodo1ogy for Signa1 Processing. 
In J. P. Gray (editor) , VLSI 81 , pages 131-140. U niversity of Edinburgh , University of Edinburgh. 
August, 1981. 
[15] Mu挝lopadhyay ， A. 
Hardware AIgoriÙ1ms for Nonnumeric Computation. 
IEEE Transactions on Computers C-28(6):38 -l-394 , June , 1979. 
[16] Savage, J.E. 
The Vl.SI Compilation Techniques: PU' s; Weinberger Arr句'S; and SLAP , a New Silicon Layouc 
Program. 
Technical Repon CS-82-24 , Brown University , October, 1982. 
[17] Southa时. J. R. 
MacPitts: An Approach to Silicon Compílation. 
Computer 16(12):74-82 , De臼mber， 1983. 
