Let WRAM [PRAM] be a parallel computer with p processors (RAM's) which share a common memory and are allowed simultaneous reads and writes [only simultaneous reads].
share a common memory and are allowed simultaneous reads and writes [only simultaneous reads].
The only type of simultaneous writes allowed is a simultaneous AND: several processors may write 0 simultaneously into the same memory cell. Let t be the time bound of the computer.
We design below families of parallel algorithms that solve the string matching problem with inputs of size n (n is the sum of lengths of the pattern and the text) and have the following performance in terms of p, t and n:
i. For WRAM: pt = o(n) for for p ~ n/log n.
2.
For PRAM: pt = O(n) for P ~ n/log2n.
3.
For WRAM: t = constant for i~¢ p = n and any ¢ > 0.
4.
For WRAM: t = o(log n/log log n) for p = n.
Similar families are also obtained for the problem of finding all initial palindromes of a given string.
i. Introductio n .
We design parallel algorithms in the following model: p sychronized processors (RAM's) share a common memory. Any subset *Research supported by National Science Foundation Grant MCS-8303139, Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission.
© 1984 ACM 0-89791-133-4/84/004/0240 $00.75 of the processors can simultaneously read from the same memory location.
We sometimes allow simultaneous writing in the weakest sense: any subset of processors can write the value 0 into the same memory location (i.e., turn off a switch). We denote by WRAM [PRAM] the model that allows [does not allow] simultaneous writing.
We also consider (but only briefly) other models of parallel computation.
We actually design a family of algorithms because we have a parameter p.
The performance of the family is measured in terms of thmee parameters: p--the number of processors, t--the time, and n--the size of the problem instance.
It is well known that every parallel algorithm with p processors and time t can be easily converted to a sequential algorithm of time pt.
Hence the analog of linear-time algorithm in sequential computation is a family of parallel algorithms with pt = O(n).
We therefore call such algorithms optimal.
Surprisingly, while there are many problems for which lineartime algorithms are known, there are very few problems for which optimal parallel algorithms are known for a wide range of p.
So few, that we list them here. Every associative function of n variables can be computed by a PRAM in pt = O(n) for p ~ n/log n.
(Use a binary tree, each leaf "treats" n/p inputs.)
For a certain subset of these functions including the n variable OR (AND), D(log n) time is needed on the PRAM [CD] , so pt = O(n) is unattainable for p >> n/log n. Consequently, the only question left is with how few processors can we compute these functions in constant time on a WRAM. The answer depends on the specific function. The n variable OR (or AND) function can be computed by WRAM in pt = n for p ~ n (i.e., in time = 1 with n processors). The n variable MAXIMUM function can be computed in pt = O(n) for p ~ n/log log n and in constant time with n I÷~ processors (for every ¢ > 0) [V] , [SV] .
Optimal parallel algorithms are known for merging two sorted arrays (for p n/log n on a PRAM); merging can be done in constant time even by a PRAM with n l+C processors [SV] and in log log n with n processors [V] , [BM] .
Recently, optimal parallel algorithms were designed for the problem of converting an expression to its parse tree [BV] and for Selection [Vi] .
What is common to all these problems except Selection is that for each one of them there is a trivial (sequential) linear-time algorithm.
In this paper we design optimal parallel algorithms for string matching.
The linear-time algorithm for string matching is by now very well understood, but at one time, it was quite a major discovery.
Unlike the case of computing n variable functions (where it is trivial) and merging (where it is quite simple) designing optimal parallel algorithms for string matching was not immediate.
As for the problems mentioned above, we designed other parallel algorithms that perform string matching on WRAM in constant i+¢ time with only n processors.
As in the cases above the time is proportional to I/¢.
If only n processors are available the time needed is O(log n/log log n).
The families of algorithms we design have several appealing features: i. They are nqt derived from any of the variants of the linear-time sequential algorithms ( [KMP] , [BM] ).
The latter do not seem to be parallelizable, because they construct sequentially tables which are used sequentially.
So, even giving the tables for free does not seem to help much. Two known algorithms are parallelizable but do not yield optimal parallel algorithms: the O(n log n) algorithm in [KMR] yields tp = O(n log2n) and the probabilistic linear-time algorithm in [KR] yields a probabilistic family with tp = O(n log n).
2.
The algorithms we design are all derived from one algorithm:
it is an algorithm for WRAM with p = n and t = log n for the case that the text is twice longer than the pattern.
3.
The algorithms make use of properties of periodicities in strings derived from the Periodicity Lemma which states that two different periodicities cannot coexist long enough (if they do, then there is a common refinement).
Similar properties were used in a different way to design a linear-time algorithm for string matching which uses only constant (five) registers [GS] . Therefore, we have here an example for a relationship between sequential space and parallel time in the lowest level.
4.
As in the algorithm in [GS] , it is possible to write a very short program (for each processor), but a longer explanation is needed mainly because the algorithm uses implicitly properties of periodicities several times.
5.
The algorithms use what seems to be a novel method of communication among the various processors, as will be indicated below.
String matchinq is the following problem.
The input consists of two strings, x (the pattern) and y (the text), over a given alphabet of a fixed size. The-output is a Boolean array indicating all the occurrences of x in y. In Section 2 we prove several simple facts on periodicities of strings used by the algorithm.
In Section 3 we sketch the main algorithm which is non optimal (p = 3n t = log n) and only deals with a special case (ly I = 21x ~ = 2n).
In Section 4 we complete the details of the algorithm.
In Section 5 we show how the four families of parallel algorithms mentioned above are derived from the main algorithm.
In Section 6 we briefly discuss other models of parallel computation and the problem of finding all initial palindromes of a given string.
Periodicity in Strinqs.
A string u is a period of a string w if w is a prefix of u ~ for some k or equivalently if w is a prefix of uw. We call the shortest period of a string w the period of w. Thus a is the period of aaaaaaa while aa,aaa, etc are also periods of w. We say that w has period size P if the length of the period of w is P. If w is at least twice longer than its period we say that w is periodic. We will consider prefixes of the pattern x of increasing length. Assume we consider a prefix u and then a prefix v.
In the case that u is periodic we will say that the periodicity continues in v if the period of v is the same as the period of u (e.g. u = abcabcab, v = abcabcabcabcabca) and that the periodicity terminates otherwise (e.g. the same u, v = abcabcabcd...).
We will need some simple facts about periodicities.
Fact 1 (The Periodicity Lemma) [LS] : If w has two periods of size P and Q and lwl ~ P + Q, then w has a period of size gcd(P,Q).
For a one line proof see [GS] . In the rest of this section an occurrence at j will mean an occurrence at position j in a given fixed string z.
Fact 2: If v occurs at j and j + P, 'P ~ Ivl/2, then (i[ v is periodic with a period of length P, and (2) Proof: Otherwise ~ = mP + r, 0 < r < P, and m < k. Let w = u(k-m)u '. w is a suffix of v, so it occurs at j + mP. It is also a prefix of v, so it occurs at j + ~ = j + mP + r.
By Fact 2, w has a period of size r in addition to a period of size P. By Fact i, it has a period of size gcd(p,r) < P which divides P.
Hence P cannot be the period size Of v. Q We call an occurrence of v at j important if v does not occur at j + P. The input is a string z = x $ y of length 3n + 1.
x is the pattern, Ixl = n, and y is the text, ly I = 2n. Both are over a given alphabet of fixed size which does not contain $.
The output is a Boolean array of length 3n + 1 called SWITCH. x (i) occurs at j. We now describe stage i + i, which takes a constant (at most six) steps.
The task of the stage is to test whether each occurrence of x (i) is followed by an occurrence of y(i).
In case the answer is negative the corresponding 1 in SWITCH is turned off.
We divide the array SWITCH into blocks of size 2 i-l. We say that property i holds if each block has at most one i.
We distinguish between two cases: the regular case, and the periodic case.
The regular case is the one in which the first block of SWITCH has only one 1 (at position i).
By induction, the other blocks may have at most two l's.
In a block with two l's, the 1 at the smaller position is turned off.
(This occurrence of x (i) is not a beginning of an occurrence of x (i+l).) As a result, property i
There are processors responsible for the block.
Hence, in two steps they can test for y(i) at the appropriate position if they knew which comparisons they ought to perform• We will explain below how this is done.
We call it a regular step.
In the periodic case that follows a regular case the first block has two l's; the second of which at position p + i. It follows from Fact 2 that x (i) is periodic with period size P.
In the periodic case we test whether the periodicity of x(i)i ~ continues in x (i+l). We do it in two steps using x (i) as a yardstick• If x (i+l) has the same period we similarly find all its occurrences.
Then we start stage i + 2 in the periodic case. If x (i+l) does not have the same period we turn off (justifi ~ ably) many l's in SWITCH.
As a result, property i holds and we complete the stage with a regular step.
Each part in the discussion above makes some use of properties of periodicities.
During the algorithm the processors need to communicate.
For global communication we have a bulletin board, B B, where some announcements are posted; e.g. if the case is periodic and the size of the period. Also, the processors responsible for a block need to communicate in order to find which comparisons they ought to make in a regular step.
For this purpose we have local bulletin boards, ibb's.
We can use an additional array to store the ibb's. Alternatively, each ibb can be stored at the last element of its block.
At the end of each stage one of every two consecutive ibb's dies and may transfer some information to the surviving one before it passes away.
(See Figure i. ) 4. The Details.
The flow chart of the algorithm is given in Figure 2 .
In this section we give the details Of each one of the seven boxes in the flow chart.
The first and last stage are slightly different and are discussed at the end of the section.
We enter box 1 after a regular step in stage i. Consider blocks numbers 2j-I and 2j at the end of stage i. They contain at most one i. The ibb of the first block dies at the end of the stage. The processor responsible for the second Ibb (number 2j.2 i-2) looks at the dying ibb and if it is not empty, it tries to transfer its contents to its ibb. Two l's per block are discovered when its ibb is already nonempty.
Box 1 deals with the case j = i. If two l's are discovered in the (new) first block we are in the periodic case, which is explained below.
Boxes 2, 3 deal with the case j > i.
If two l's are discovered the first is turned off by the processor responsible for the surviving Ibb.
(It is the processor that discovers the two l's.)
To understand box 4, the regular step, consider Figure i .
If the occurrence starts at Z4+l, then the ibb contains 4.
Processor j in the group that corresponds to the block makes two comparisons : = Zk+4? for k e [j+2i,j+21+21-1}. If one of the answers is negative it turns off the 1 at SWITCH (4+1). This is the only place where the concurrent write is used.
The test in box 1 is actually handled differently. Since SWITCH(l) = i, the processor responsible for the second Ibb of stage i (the first of stage i+l) looks at its ibb.
If it is nonempty it contains P; i.e., x (i) is periodical with period size P. The processor posts P on BB.
During the periodic loop (boxes 5,6) the ibb's are not updated and are not used. Since SWITCH(l) = 1 and one of SWITCH(P+1), SWITCH(L+1) is zero, at least one of the occurrenees of x (i) at positions j ~ L + 1 -P is important. By Facts 5,6, either the occurrence at 1 is important or there is exactly one important occurrence at some j 1 ~ j L+i-P.
When we test if the periodicity continues, first, Pl checks SWITCH(P+1). If it is zero, then the occurrence at 1 is important and Pl posts 0 on BB.
Otherwise it tests SWITCH(P+L).
If it is i, the periodicity continues.
Otherwise, each processor pj tests (using SWITCH) whether there is an important occurrence at j. The unique pj that succeeds posts j-i on BB. Before executing the regular step (box 4) the Ibb's are restored.
Each processor Pr with SWITCH(r) = 1 writes r-i in its ibb.
By Claim 2, no conflict occurs. To be able to do it, each processor knows in each stage where is its ibb.
This information can be easily precomputed or updated dynamically.
The first stage is very simple. Processor pj tests whether z.z = XlX 2 If 3 j+l the test succeeds, pj turns on SWITCH (j) and makes the j-th ibb for the second stage point to the i.
Recall that the size of the blocks in the second stage is i. We now discuss the changes needed for the last stage, but first we need to elaborate on the other stages.
Consider stage f+l, and an occurrence of x (i) at j ~ n.
Assume j+2 i+l > n+l, so X (i+l) cannot occur at j simply because it is too long, and the $ does not match any symbol of x.
In case the first mismatch from the left is the S the algorithm will not turn off the 1 at SWITCH (j).
(It is as though the $ and the following symbols always match the symbols compared to them.
As a result, a 1 in SWITCH may stand for an overhanging occurrence.
In the last stage, if property i holds, or if the periodicity terminates (and as a result of including overhanging occurrences it means that it terminates before the S) we execute a regular step without any change. The only change is in the case that the periodicity continues.
While in the other stages it means that the periodicity continues to x(i+l),in the last stage it continues only to the $.
We find ourself in case when L + 2 i ~ n (]~(i+l)[ ~ixl). this
We call an occurrence of x (i) at j special if j + 2 i ~ n and j + P + 2 i > n + 1 (if the next occurrence of x (i), at j+P is the first overhanging occurrence).
As with important occurrences the unique pj that finds a special occurrence at j posts j-i on BB.
(Note that j = mP + 1 for some m, x = umuRu'u",u'u ''-a prefix of u 2.) Then each Pr that sees a 1 at SWITCH(r) checks whether SWITCH(r+j-I) = 1 and if not it turns off the i. If the test succeeds it checks whether SWITCH(r+j-I+P) = I.
If the test succeeds we know that x occurs at r (since the tests imply that um+k+lu' occurs at j).
If the test fails we still do not know the answer.
Note that in this case the occurrence at x (i)--at r+j-i is important and by Fact 6 if we restrict attention to occurrences at r's such that the occurrence at r+j-I is important, then property i holds. So we activate the ibb's and use a regular step to test whether such occurrences of x ~i)" " extend to occurrences of x.
5.
The Four Families.
5.1 Using only n/loq n processors.
The main algorithm can be implemented with only n/log n processors using the four RUssians trick [AHU] to pack log n symbols into one number.
Each processor is responsible for s consecutive symbols in z and in SWITCH, where 8 = c log n and c depends on the alphabet size:
processor Pr will be responsible for z., SWITCH(j) j ¢ A 3 r [(r-l)s+l,...,rs].
First, each Pr packs each substring of z of length s that starts with zj, j ~ Ar, into a new symbol z''3 Then it compares each zj, j ¢ A r , with z I and if they are equal it sets
This has the effect of the first t = log s stages and takes O(s) = O(log n) time.
Assume the next ((t+l)-st) stage is in the regular case.
The other stages are as in the main algorithm.
The only difference is that in each regular step the packed symbols z~ are used.
If the ~t+l)-st stage is periodical, then the period size P < s/2, and we need also to pack the bits in SWITCH.
Each Pr packs the s consecutive segments of SWITCH starting with each SWITCH(j) j ~ A . r When the periodicity continues and we test for occurrences of ~(i+l)'" we can handle all the l's in a packed symbol of SWITCH simultaneously using some simple bit vector operations on the packed symbols.
Even if we disallow bit vector operations, the n/log n processors can prepare (in time O(log n)) a table to implement these operations.
The general case.
We now have an algorithm with tP0 = O(n) for P0 = n/log n.
This immediately yields a family with tp = O(n) for p n/log n because of the well known downward translation. In general, if tP0 = f(n), then we have a family with tp = f(n) for P i P0' because having only p processors, each one will. simulate p0/p processors and the time will be slowed down by a factor of pup.
We still have to deal with the case in which Ixl and !Yl are unrelated. Let n = Ixl+lyl (the length of the input) and m = Ixl.
If p ~ 2n/m we divide y into p/2 equal parts.
Let the i-th piece be the concatenation of the i-th and (i+l)st parts.
There are p pieces and we assign one processor per piece.
The size of a piece S = 21y~/(p/2) satisfies 4n/p ~ S 2n/p ~ m.
Each processor looks for all occurrences of x in its piece in time O(S) = O(n/p).
Hence in this case, when we have a small number of processors, we have an optimal algorithm simply because we still solve the problem sequentially.
If p > 2n/m (p ~ n/log m) we break y into overlapping pieces of size 2m. The number s of such pieces satisfies n/m ~ s ~ 2n/m < p. We assign p/s (~ m/log m) processors per piece.
By the first paragraph above, all the occurrences in a piece can be found in time t such that t.p/s = O(m), or tp = O(ms) = O(n).
On the PRAM.
Consider the main algorithm. The only case of concurrent write is the regular step: the 2 i-I processors of a block compute an AND.
If we do not allow concurrent write, we can no longer execute one stage in constant time.
The algorithm on the PRAM takes time O(log2n), because each stage takes O(log n) time. Fortunately, we can implement this algorithm with only n/log2n processors. Each processor is responsible for 2 log n symbols or for log n packed symbols. In a regular step, the processors in a block make log n comparisons of packed symbols (in time log n).
They record only whether all the comparisons succeed. Then using the implicit tree structure, they 'and' their results in time O(log n).
The discussion above yields an algorithm on a PRAM with p = n/log 2 n and t =0(log2n).The rest is as in subsection 5. 2. The algorithm can be implemented without simultaneous reads.
Having many processors.
Assume ly I = 21x I = 2n. As waS noted 2 above, with n processors we can solve string matching in constant (t = 2) time on the WRAM.
We show below that if l+i/k p = n we can solve string matching in time O(k).
This immediately gives the third and fourth families:
for the third, take c = i/k and the constant is k.
For the fourth, take k = log n/log log n.
In this case p = n log n, but by packing symbols we reduce p to n. In this subsection we use a stronger version of WRAM.
In case of a write conflict the processor with the minimum number is the one that writes.
At the moment, if it is not known whether such a WRAM can be simulated by our weaker type without time loss.
However, in our case, such simulation is possible.
Assume one subset of p processors tries to write simultaneously into a register and the processor with the minimal number succeeds.
It was observed in [FRW] that our weaker model of WRAM can do the same in four steps: the processors are partitioned into 4~ groups of size ~. In the first step each group computes whether one of its members wants to write. The result is a Boolean array of size 4~. In the second step the l's in that array that are not first are turned off. This is possible because there are 4~ processors for each i. Now, the processors in the corresponding group find in a similar way the minimal in the group.
Such a simulation will easily be extended to our case.
When we have n or more processors we can use them to have x {i+l)'" more than twice larger than x (i) and as a result, to have less than log n stages. Specifi3nl+i/k cally, let p = . The processors are divided into 3n groups of n I/k-processors. Each group contains one principal processor, and is responsible for one symbol of z and SWITCH. The length of x ~i)" " is n i/k.-In the first stage (finding all occurrences of x (I)) the i-th group looks for an occurrence at i. The size of the blocks for stage i + 1 is IX (i) I/2 = hi/K/2.----A regular step is simple, since we have enough processors: the number of processors in the groups corresponding to a block is n (i+l)'-/2k = IX (i+l) I/2.
The parts concerning periodicity are slightly different, because the size of blocks much more than doubles from one stage to the next.
To test for periodicity, each principal processor in the first block that sees 1 writes its group number minus 1 on the same place of BB.
The one with the minimal group number succeeds, and posts the period size P. 
so x(i+l) < I~ (i+l)] < 3x (i+l) .)
If the test succeeds, a similar test is used to test which occurrence of x (i) is extended to an occurrence of ~(i+l). If the test fails, using the stronger form of concurrent writing the first group finds the first j ~ith ~SWITCH(i+jL i) = 0. The value of j is posted on BB, and next SWITCH(r) = 1 is not turned off only if the r-th group finds that SWITCH(r+JLi) = 0, and for all k < j SWITCH(r+kL=) = i.
The stronger type of concurrent write is used only within groups, and the memory locations are different for different groups.
The simulation mentioned above (for one group) can be obviously extended to our case.
We left out the details of allocating of processors. For fixed k this task is immediate because we can assume = 2 kr that n for some r.
In the general case (Ixl and ly I unrelated) the number of processors needed is only nm I/~^" and with p = n the time bound is O(log m/log log m).
Conclusion
We can implement the main algorithm in other models for parallel computation:
i. Boolean circuits of size O(n log2n)~Jand depth O(log2n).
2.
Fixed connection networks (the k-dimensional cube) and even networks with fixed degree (CCC's [PV]) in pt = O(n log n).
The details of these implementation are straightforward. Both use shifting networks as building blocks.
There are some questions unresolved:
i. Can we solve string matching on WRAM with n processors in constant (O(log log n)) time?
Can we solve string matching deterministically on PRAM with n/log n (or even n) processors in O(log n) time? (The parallel version of [KR] has p = n, t = O(log n) but is probabilistic.)
3.
Can we find optimal parallel algorithms for string matching on fixed connection networks?
Finally, families of parallel algorithms corresponding to all the families mentioned above can be derived for finding all initial palindromes of a given string w. The reduction of the latter problem to string matching [FP] does not help, because it makes use of the table of the KMP algorithm.
It is not clear how to compute efficiently this table in parallel. Instead we look for w in w rev, recording in SWITCH,,also overhanging occurrences. The main algorithm discovers the initial palindromes of length i, 2 i-I < Z~ 2 i in staae i t _ •
