A Linear Feedback Shift Register, or LFSR, can implement an event counter by shifting whenever an event occurs. A single two-input exclusive-OR gate is often the only additional hardware necessary to allow a shift register to generate, by successive shifts, all of its possible nonzero values. The counting application requires that the number of shifts be recoverable from the LFSR contents so that further processing and analysis may be done. Recovering this number from the shift register value corresponds to a problem from number theory and cryptography known as the discrete logarithm. For some sizes of shift register, the maximal-length LFSR implementation requires more than a single gate, and for some the discrete logarithm calculation is hard. This paper proposes for such sizes the use of certain one-gate LFSRs whose sequence lengths are nearly maximal, and which support easy discrete logarithms. These LFSRs have a concise mathematical characterization, and are quite common. The paper concludes by describing an application of these ideas in a computer hardware monitor, and by presenting a table that describes e cient LFSRs of size up to 64 bits.
Introduction
Using a linear feedback shift register, or LFSR, is an extremely attractive way to generate a sequence of binary words: a single two-input exclusive-OR gate is often the only extra logic needed to make a shift register generate, by successive shifts, all of its possible nonzero values. Applications of LFSRs include error-correcting codes 1, 21, 22] , pseudorandom sequence generation for ranging and synchronization 11], test-pattern generation and signature analysis in VLSI circuits 16] , and program counters in simple computers 7, 19] .
In this paper we study the use of LFSRs as event counters, in which the increment function is implemented by a shift of the register. Compared with the usual logic for a binary increment, an LFSR is wonderfully small and fast. The price for this is the problem of guring out, after the fact, just how many times the register has been shifted. Recovering the actual number of shifts from the shift register contents corresponds to a problem in number theory and cryptography known as the discrete logarithm problem 17, 18] . Crytographic applications of the discrete logarithm favor structures in which the calculation, because it is part of decrypting, is di cult 6]. We, on the other hand, are interested in systems in which the discrete logarithm is easy.
The key insight behind this application is that in a counting instrument, only the increment function needs to be fast. There is no need for any other on-line operations, such as comparison or general addition. Manipulation of the counts is done o -line, and hence can use calculations more expensive than the fast increment.
In the next section of this paper we rst brie y review the basic ideas behind LFSRs and then go on to consider the problem of awkward sizes: those sizes of counter in which the maximum-period LFSR implementation requires more than a single gate, or the discrete logarithm problem is hard. (The otherwise appealing size of 32 bits unfortunately has both problems!) We propose for such sizes the use of LFSRs whose periods are very nearly maximal, and we explore the properties of such registers. Section 3 proves a theorem that precisely characterizes shift registers whose period is greater than one-half the maximum for their size.
Then in Section 4 we consider the discrete logarithm problem, and present a version of the Pohlig-Hellman-Silver algorithm 23]. We show how the use of near-maximal-period LFSRs can lead to faster discrete logarithm calculations and more e cient use of storage for the necessary tables. In Section 5 we discuss applications of this method, including our use of it in a hardware monitor for a VAX computer, and also present a table specifying useful long-period shift registers for sizes up to 64 bits. Section 6 contains concluding remarks and discusses possible extensions of the results.
2 LFSRs as Counters Figure 1 shows a 5-bit example of the simplest type of autonomous (inputfree) LFSR: one having a single 2-input exclusive-OR gate between two of its stages, with the end-to-end feedback connected to one input of the gate. For purposes of algebraic manipulation, the contents of an LFSR are commonly treated as the binary coe cients of a polynomial in x. Thus in Figure  1 , the value 01010 would correspond to the polynomial x 3 + x. The shift register is wired to shift in the direction of increasing exponent, so a single shift is like multiplying by x, neglecting the feedback for the moment. One shift would therefore change the value 01010 into 10100, or the polynomial ( The next shift will involve the feedback circuitry and require arithmetic on the coe cients. The polynomial interpretation of the function of this circuitry is as follows. The exclusive-OR gate performs coe cient arithmetic modulo 2 (that is, in the Galois eld GF(2)), so addition is the same as subtraction. The feedback connections in Figure 1 Suppose we initialize the LFSR to the (polynomial) 1, and then shift it L times, thereby recording L occurences of some experimental event. This amounts to L multiplications by x, subject to the feedback circuitry, and the polynomial result is therefore the remainder of x L =(x The c i are either 1 or 0, corresponding to the presence or absence of an exclusive-OR gate; and because c 0 is always 1, p(x) is never divisible by x.
Of great interest are the periods of such circuits: when initialized to 1, after how many shifts will 1 reappear? From the polynomial point of view, so to speak, this is the same as asking: what is the smallest m such that x m 1 (mod p(x)) or, equivalently, what is the smallest m such that p(x) evenly divides x m ?1 over GF(2)? Thus we may speak of the period of a characteristic polynomial just as we do the period of an LFSR. In the counting application, the period represents the value at which the counter over ows, so the period must be greater than the maximum expected count. The period of the shift register of Figure 1 is 21. The maximum period for a 5-bit LFSR is clearly 2 5 ?1 = 31 (the all-zero value never appears), and is achievable with a di erent characteristic polynomial: x 5 +x 2 +1. The maximum possible period for an n-bit LFSR is 2 n ? 1, and it is well known that characteristic polynomials with maximum period, called primitive polynomials, exist for all n. Maximum-period LFSRs, primitive polynomials, and their corresponding algebras GF(2 n ), are the focus of most work with shift registers. Many applications need the maximum period or the Galois eld, but the counting application does not. The idea of using an LFSR to count in a Galois eld appears in Peterson 21] .
All primitive polynomials are irreducible|not divisible by any other polynomial over GF(2)|but some irreducible polynomials are not primitive, that is, some have periods less than the maximum. In fact, it is known that the period of an irreducible but nonprimitive characteristic polynomial of degree n is a proper divisor of 2 n ? 1, and hence is at most one-third of the maximum. Such a polynomial partitions all of a shift register's 2 n ? 1 nonzero states into a number of equal-sized cycles, one of which contains the polynomial value 1.
Thus it would seem that in our application we would always choose an LFSR whose characteristic polynomial was primitive. Certainly when a characteristic polynomial of the desired degree is a trinomial, implying a one-gate LFSR implementation, it would have great appeal. Unfortunately, however, for many sizes of shift register, no primitive trinomial exists. There are 30 sizes between 1 and 64 that have this problem 26]; in particular, there are no primitive trinomials|indeed, no irreducible trinomials|for any size that is a multiple of 8 24] , sizes that might otherwise be favored by current computer architectures.
The alternative of using, for some desired shift register size, an irreducible but nonprimitive trinomial is unattractive because its period can be at most one-third of the maximum for that size. This would seem an inecient use of the hardware, since an LFSR of fewer bits could potentially do as well. Primitive polynomials with more than three terms are another possibility, but unfortunately four-term polynomials are excluded because they are all divisible by x + 1 and hence not primitive. Primitive polynomials of ve terms may be unattractive due to their extra hardware cost (although Wang and McCluskey 25] show that an alternative LFSR wiring pattern for some of these polynomials can use two, not three, exclusive-OR gates, at the cost of increased combinational delay). And for some values of n, as we shall see in Section 4, the discrete logarithm calculation for the maximum period is hard, and this consideration may rule out using a primitive polynomial of any number of terms.
There remains the option of using a reducible trinomial: one that is the product (over GF(2)) of two or more irreducible factors. Reducible polynomials have been studied as a way to produce shift registers with prescribed periods 9]. The periods of such polynomials can be extremely close to the maximum and the discrete logarithm calculation can be easier than it is for the maximum period. Consider, for example, the useful size of 32 bits, which has no primitive trinomial. If we make an LFSR using the reducible ?1, and we also get a discrete logarithm procedure that is faster and much more space-e cient than the maximum-period one (details forthcoming in Section 5.1). whose period of 21 is 68 percent of the maximum period 31. As we will see in Section 5, reducible polynomials with long periods are quite common.
Reducible polynomials can also have short periods, however: x 32 +x+1, for example, is reducible and has period only 1023. It is interesting, therefore, to ask what properties of these polynomials might give them long periods. We turn next to this question.
Guaranteeing Long Periods
In this section we will characterize in a precise way those polynomials whose periods are greater than half the maximum for their degrees. A characteristic polynomial of smaller period would be ine cient in the sense that a di erent polynomial of lesser degree could have a longer period.
The period of a reducible polynomial is a function of the periods of its polynomial factors. Let the polynomial p(x) factor (over GF (2) , as usual) this way: (2) Equation (1) suggests that for period maximization, the polynomial factors' periods ought not to have any integer factors in common. Equation (2) suggests that repeated polynomial factors might be bad. We will now formalize these observations. Theorem (Long Periods): Let p(x) be an arbitrary polynomial over GF(2) of degree n, not divisible by x. Let m be the smallest integer such that p(x) evenly divides x m ? 1 over GF (2) . Then m, the period of p(x), 
We pull 2 n out of the product and write
This we can show by rst observing 13] that
Because the n i are distinct (condition D) and are all greater than 1 (condi- We will now prove that if m > 2 n?1 then all four conditions hold. We will argue by contradiction, demonstrating that the violation of any condition would imply that m 2 n?1 . For notational simplicity, let q(x) b denote an arbitrary factor of p(x), where q(x) is irreducible and has degree d. Let s be the period of q(x) and letm be the period of q(x) b . Then from Equation (2) we know that m = 2 dlog 2 be s :
Since q(x) b has degree bd, the period of the product of all the other factors of p(x) can be at most 2 n?bd ?1. Of course there may not be any other factors, but if there are, then the least common multiple formula for m (Equation (1) 
then we will immediately have m 2 n?1 . In the case that there are no polynomial factors of p(x) other than q(x) b we have m =m and n = bd, so (5) is the same as m 2 n?1 . Thus, showing that (5) follows from a violation of some condition will establish, by contradiction, that condition's validity. We will look at the rst three conditions in this way, and nish with a separate argument for condition D.
First, suppose that q(x) b violates condition A. Because x cannot be a factor of p(x), the only possibility of degree d = 1 is q(x) = x + 1, which has period 1. (We note in passing that if p(x) is itself x + 1, then the theorem holds vacuously.) Then (4) 272]. The periods' smallest common divisor must be at least 3, so the leastcommon-multiple formula for m guarantees that m < 2 n?1 . This completes the proof. 2 
Count Recovery with Discrete Logarithms
We turn now to the problem of recovering the integer number of shifts from the polynomial contents of an LFSR. If the register is not too wide, this conversion could be done in several obvious ways, including simply creating a lookup table of all possible values. For wide registers we need a more sophisticated technique. In our application, we are willing to invest in a large amount of up-front precomputation in the count recovery algorithm because we will want to do many recoveries for the same shift register. Thus we are prepared to construct tables and calculate constants that will be re-used in every count recovery.
Suppose we have an n-bit LFSR with characteristic polynomial p(x), of period m, and suppose it is initialized to 1. The set of values generated by such an LFSR is one representation of the cyclic group of order m. The group operation is polynomial multiplication modulo p(x) over GF (2) ; the identity element is the polynomial 1; and the generator of the group is the polynomial x. The elements of the group are just x j mod p(x), 0 j < m.
For notational convenience and because much of the following discussion applies to the abstract group independent of representation, we will denote the group by G and the generator by g, so the elements of G will be g We will use the Pohlig-Hellman-Silver algorithm 23], originally developed for the multiplicative group of a Galois eld (corresponding to a maximum-period, or primitive, characteristic polynomial), and generalized to any cyclic group by Massey 15] . The key to this algorithm is the Chinese remainder theorem: instead of calculating L directly, we will nd the residues of L modulo certain factors of m and then use the theorem to compute L itself by combining those residues. This method is attractive when all these factors are small compared to the period. (This leads cryptographers to shun such groups and favor instead groups whose orders have large prime factors.) Let the order m of the cyclic group G be expressed as the product m = m 1 m 2 m k of k factors that are pairwise relatively prime. We want the residues r i = L mod m i ; 1 i k :
The Chinese remainder theorem says that there is exactly one value of L in 0 : : :m ? 1] that satis es the simultaneous equations (7):
where each v i is chosen to satisfy m m i v i 1 (mod m i ) : Table 1 i : In other words, r i is itself the discrete logarithm of y i in the cyclic subgroup of G generated by g i |a formulation due to Massey 15] . This subgroup has order only m i , so if all the factors of m are small, it is feasible to do the subgroup discrete logarithms by table lookup. (When m has a large factor that is a power of a prime, Pohlig and Helmann give a more complicated method that saves table space 23] .) The main work in the calculation would then be the k exponentiations of y in G. In our setting these operations would be polynomial exponentiations over GF (2) , modulo some characteristic polynomial p(x) of period m. Each exponentiation takes at most 2 dlog 2 m=m i e polynomial multiplications 14].
Consider an example using the counter of Figure 1 , whose (reducible) characteristic polynomial is x 5 + x + 1, and whose period m = 21. Table   1 We now look up the y i in Table 1 We can check that this is right by cheating: in It is not hard to see that this faster and more compact approach does in fact nd the same r i . We know that y m=m i x r i m=m i (mod p(x)) and therefore that y m=m i x r i m=m i (mod p j (x)) ; since p j (x) is a factor of p(x). Now we divide each exponent by the integer m=m, which is relatively prime tom, the period of p j (x). This property guarantees that the resulting subgroup elements exist, and therefore ym =m i x r im =m i (mod p j (x)) ; which shows that the more e cient procedure does in fact nd the same residue r i as the original procedure. For each factor m i , therefore, we need a table of m i elements whose width in bits is equal to the degree of that polynomial factor whose own period has m i as a factor. Compared with a primitive polynomial, then, a reducible trinomial o ers four potential advantages in the counting application: rst, fewer LFSR logic gates for degrees that have no primitive trinomial; second, smaller exponents on the discrete logarithm algorithm's input polynomial; third, shorter discrete log tables when the period has smaller factors; and nally, narrower log tables, according to the polynomial factors of the trinomial. The table space savings can be enormous, as we will shortly see.
Applications
In this section we look more closely at a particular 32-bit LFSR, report on our use of a 36-bit LFSR in a hardware monitor, and give a table of useful trinomials, their periods, and the sizes of the required discrete log tables.
Details of 32-bit LFSR
Like all multiples of 8, degree 32 has no irreducible, and hence no primitive, (The large number of terms in these factors is of little concern since the factors themselves have no e ect on the LFSR hardware.) The period of this trinomial is greater than 99.95 percent of the maximum|implying, according to the long-period theorem, that both polynomial factors are primitive. It is also smoother|has a smaller largest factor|than the maximum period: The minimum space requirements for the discrete logarithm are therefore: three tables of lengths 49, 127, and 337, each 21 bits wide; and two tables of 23 and 89 entries, each 11 bits wide. The grand total is 12,005 bits, or less than 0:57 percent of the requirement for a maximum-period LFSR. Table 2 gives the constants v i for this polynomial. There is one other trinomial of degree 32 whose period is greater than 2
31
, but its period is less than the one discussed above, and its log tables are bigger.
VAX hardware monitor
With several colleagues at Digital we designed and built a VAX hardware monitor that uses an LFSR for counting 5]. This application was the original motivation for this paper. The monitor implements Emer's micro-Program Counter histogram technique 8]: it maintains a count for every control store address and increments the count each time the corresponding microinstruction is executed. The counts are kept in a static RAM addressed by the micro-PC. In every machine cycle a RAM location must be read, its contents incremented, and the updated count written back. Interpretation of the counts and any necessary arithmetic using them is done o -line, after a measurement experiment is complete. Thus while the increment must be extremely fast, a slow o -line conversion is quite tolerable, so an LFSR counter is ideal. Its tiny size is a boon too. (A pipelined implementation is feasible, of course. It would allow a standard binary increment, but would cost extra hardware: the incrementer itself, pipeline latches, multiplexors, and bypassing logic for the case in which the same location is incremented in successive cycles.)
For the VAX application, we wanted to measure about an hour's worth of 45-nanosecond cycles, so the count width needed to be in the vicinity of 36 bits. Happily, there is a primitive trinomial of degree 36, x 36 + x 11 + 1, whose period is smooth: its biggest factor is 109. Thus the discrete logarithm method of Section 4 was well suited to our needs. (At the time we designed this monitor, in fact, we believed that a primitive polynomial was required. For 36 bits, in any case, there is no reducible trinomial whose log tables are smaller than those of the primitive one.)
A measurement experiment starts with the initialization of all of the histogram counts to 1. Then programs of interest are run on the computer, without interference from the monitor. At the end of the measurement (and before any count has over owed), further counting is disabled, the RAM contents read out, and the discrete logarithms calculated. The resulting integer counts are matched against the microcode listing le. Information from the listing le, such as labels and comments, guides the subsequent manipulation of these counts to yield performance statistics of interest to the experimenter. For example, one could calculate the total time spent executing a particular VAX opcode by summing the histogram counts for all of that opcode's microinstructions. Many other useful statistics can be imagined.
This monitor has been used for a host of measurements at Digital, some of which are reported in 2, 5], some of which have been used in the development of subsequent VAX processors, and some of which have been used to evaluate and tune software of various kinds. Table 3 describes trinomials useful for counters up to 64 bits in width. We think 64 bits is a big enough counter for the foreseeable future: such a counter could count 10-picosecond events for ve years before over owing. We constructed the table by consulting Golumb 12] for trinomials up to degree 36, Zierler and Brillhart 26] for primitive trinomials up to degree 64, and using the Maple system 4] to investigate reducible trinomials between degrees 36 and 64. The table includes only trinomials of period greater than half the maximum for their degrees. (For each entry x n + x a + 1 there is a corresponding trinomial x n + x n?a + 1 whose period is the same, and whose polynomial factors have the same degrees as the listed one.) For each degree, the table gives the trinomial of longest period|often a primitive one|and the logarithm of the size of the required decoding tables. It then lists, in order of decreasing period, any reducible trinomials whose decoding tables are smaller than any already on the list. Periods are reported as fractions of the maximum. Table 3 therefore illustrates the tradeo between period and decoding table size: for a particular degree, a sacri ce in period is acceptable only if the resulting tables are smaller. For some degrees, a primitive trinomial has the smallest tables or is the only trinomial of long period, and for these degrees no reducible trinomial is shown (degree 36 is an example).
Table of practical trinomials
For reducible trinomials, Table 3 gives the degrees of the polynomial factors. From these the exact period of the trinomial can be calculated by The size of the discrete logarithm tables is calculated according to the method of Section 4. There is a separate decoding table for each integer factor of the trinomial's period, including powers of primes. The width of each table in bits is the degree of the polynomial factor whose own period includes that integer factor. Of course any e cient lookup procedure would need extra bits for various reasons; we count only the minimum required to 
Conclusion
In this paper we have combined some old results from the theory of linear feedback shift registers with a computational method from cryptography to show how to make very big event counters that are very cheap and very fast. While earlier work has focused on maximal-period shift registers and primitive polynomials, we have shown that reducible polynomials are often the best choice for this application, with respect to both hardware economy and computational e ciency. We proved that such polynomials can be precisely characterized, and showed that useful trinomials are quite common. While we have emphasized the instrumentation application, in which e cient hardware and easy discrete logarithms are both important, other applications and extensions of our results are apparent:
Because the long-period theorem and the discrete logarithm algorithm apply to all polynomials, not just to trinomials, applications not requiring an LFSR of absolutely minimal logic (one gate) could use a long-period polynomial with more terms. Such polynomials could have longer periods and/or smaller logarithm tables than the best trinomial alternative of the same degree. Because the discrete logarithm algorithm will work for any polynomial not divisible by x, applications whose primary constraint is the size of the log tables rather than the size of the hardware could use a polynomial of at least the desired period but of higher than necessary degree, purely for the size of its log tables. Some applications have no use for the discrete logarithm at all, needing only an e cient generator with a long period. When a primitive trinomial does not exist for some degree, a reducible trinomial of nearly maximal period almost always does.
The hardware monitor and the associated software were developed by Pete Bannon, Walter Beach, Dave Laurello, Dave Vaughan, and Bei-Pong Wang. Some of our polynomial manipulations were done with the Maple system 4], which came to us from the Symbolic Computation Group, Department of Computer Science, University of Waterloo.
