This paper compares the performance of conventional radix-2 program counters with program counters based on Feedback Shift Registers (FSRs), a class of cyclic sequence generator. FSR counters have constant time scaling with bitwidth, N , whereas FPGA-based radix-2 counters typically have O(N ) time-complexity due to the carry-chain. Program counter performance is measured by synthesis of standalone counter circuits, as well as synthesis of three FPGA-based processor designs modified to incorporate FSR program counters. Hybrid counters, combining both an FSR and a radix-2 counter, are presented as a solution to the potential cache-coherency issues of FSR program counters. Results show that high-speed processor designs benefit more from FSR program counters, allowing both greater operating frequency and the use of fewer logic resources.
Introduction
A Program Counter (PC) circuit [10] generates the address of the next instruction to be fetched for execution. The greatest contributor to total PC circuit latency in an FPGA-based processor can be due to the counter that is used to increment the current PC value. This is because conventional radix-2 counters implemented within FPGAs can have long carry chains. Pipelining can be used to obtain a higher operating frequency but also increases logic usage, and likely increases the branch penalty.
Non-radix-2 cyclic sequence generators can be used to generate the next instruction address, for example a maximum-cycle Feedback Shift Register (FSR) [5, 13, 6] , and can lead to a reduction in total PC latency. This is because FSR counters can be designed where the maximum depth of combinatorial logic required is only one gate [7] therefore the PC latency is constant with bit-width N . We explore how this reduction in PC circuit complexity can affect maximum processor operating frequency for three FPGA processor designs.
The sequence of instruction addresses generated by a PC circuit using a maximum-cycle FSR is pseudo-random. For a processor fetching instructions from a small embedded memory this presents no problem. For processors that feature an instruction cache an FSR PC will have poor cache-coherency behaviour. As as solution we present a hybrid PC architecture. The hybid-PC is the concatenation of two smaller counters. The hybrid PC uses a small radix-2 counter to step through instructions within a cache line, and a FSR counter that cycles between cache line. When implemented within FPGA processors, this hy-brid approach has low latency and avoids cache-coherency problems.
Program Counters
Stan et al. [14] list the properties of generic up/downcounters but not all of these properties are required for PC circuits. A program counter must be RESETable, increment once every clock cycle as long as an ENABLE line is asserted, and sometimes has its value changed by branch instructions (so it needs to be LOADable, and have IN lines). The value also needs to be readable every clock cycle (using the OUT lines) in order to access the memory address to fetch the next instruction. There is no need to support other common counter features [14] , such as being reversible, or any terminal count operations. A black box diagram for a generic PC circuit is shown in Figure 1 . 
Formal framework for counters
In order to investigate structures that can be used for program counters, it is worth setting up some formalism.
A conventionial radix-2 counter can take on a range of values from 0 to 2 n − 1 where n is the number of bits. The next value of the counter is obtained by incrementing the current value using radix-2 arithmetic. We will show there is a isomorphism between this counter architecture and a family of other counter architectures. We consider only finite counters, as all counters implemented in digital logic will have a finite number of states.
We will start by defining a counter as finite and closed. Closed is not necessary for a program counter (all we need is a sufficent sequence of states to put the program in), but give us some nice properties.
We will then set up a 'ordinary' counter C n which is a cycle of length n. We will use isomorphism to this. Lemma 3.1 is a useful grab bag of properties. Note that from Lemma 3.1 (4, 5) , any state can be used as a generator.
Lemma 3.4 states that all n-cyclic counters are isomorphic to C n (no suprises). Corollary 3.6 states that all counters have a n-cyclic subcounter form some n. Just find the limit cycle. Note that the limit cycle may have only one element in it.
PSfrag replacements

Possible Program Counters
Radix-2 Counters
Maximum Cycle FSRs The central theorem is Theorem 3.7, that all counters have a sub-counter that is isomorphic to C n for some n. This ties these new counters to counters with known behaviour.
Theorem 3.8 lets us join counters together to form hybrid counters.
Definition Consider a finite set of states S and an increment operator σ : S → S. For the purposes of this paper we define a counter as a pair (S, σ) with the following property:
This (closure) is one of Peano's axioms for the natural numbers [15] -we don't need the other four axioms for our purposes.
A useful special case is a modulo arithmetic counter, with the ordinary increment operator.
Definition Let C n be the counter (Z n , f n ) where f n (x) = x + 1 mod n.
The conventional radix-2 counter with m bits is now simply C 2 m .
Definition We also define σ n (s) as n successive applications of σ, so
Definition Given a counter (S, σ), (T, σ) is a subcounter of (S, σ) if (T, σ) is a counter and T ⊆ S.
Definition A counter (S, σ) is n-cyclic if |S| = n, ∃s 0 ∈ S, S = {σ i (s 0 )|i ≥ 0}, and ∃m > 0, σ m (s 0 ) = s 0 . s 0 is a generator for the counter.
Lemma 3.1. Given an n-cyclic counter (S, σ), then for any element s ∈ S (2) , assume the contrary. If σ i (s) = s for some 0 < i < n, then consider the set of states
Hence T is closed under σ. However |S| = n and |T| = i < n, so ∃u ∈ S \ T. However, T is closed under σ, so ∄p ≥ 0, σ p (s) = u, contradicting (1). Hence σ m (s) = s for 0 < i < n.
For (4), σ i (s) where 0 ≤ i < n are elements of S and from (3) these values are distinct, so |{σ (2) is contradicted, and i ≥ 0, so n − i = n and
PSfrag replacements Figure 3 : Homomorphism between Cn and (S, σ) as used in the proof for Lemma 3.4 |T| = |S| = n (from the bijection of g). Given
If (T, τ ) is n-cyclic, the proof that (S, σ) is n-cyclic follows from using the same argument with g −1 .
Proof Firstly we show that of a counter n-cyclic, then it is is isomorphic to C n . We construct an a mapping g and show it is an isomorphism. Given an n-cyclic counter (S, σ), pick any element s 0 ∈ S. Define the mapping g :
). Hence g is a homomorphism as shown in Figure 3 . From lemma 3.1(3) g is an injection, and from lemma 3.1(4) g is a surjection, hence g is an isomorphism, and (S, σ) ∼ = C n .
Going the other way, if a counter is isomorphic to C n , it is n-cyclic from lemmata 3.2 and 3.3.
Theorem 3.5 (Generator). Given a counter (S, σ), s ∈ S and an integer n such that σ m (s) = s for 0 < m < n and σ n (s) = s, then ({σ i (s)|0 ≤ i < n}, σ) is an n-cyclic subcounter of (S, σ).
We need to show that the cardinality of |T| = n. Clearly |T| ≤ n by the construction. If |T| < n, then there would be at least one repeated state, so
Putting this together, |T| = n, ∃s ∈ T, T = {σ i (s)|i ≥ 0}, and σ n (s) = s, so (T, σ) is an n-cyclic subcounter of (S, σ).
Remark It is worth noting (although we do not use or prove the result) that any element s of an n-cyclic counter (S, σ)can be used to generate the counter in this way. Corollary 3.6. All counters have an n-cyclic subcounter for some n.
Proof This can be done by construction. Consider any counter (S, σ). S is defined to be finite, so let m be the cardinality of S. Now pick any element s 0 ∈ S and define a se-
Because of equation 1, all members of this sequence are elements of S, and the sequence has m + 1 members. Therefore there must be at least one value that occurs more than once in this sequence, so we can find two integers i and j such that σ i (s 0 ) = σ j (s 0 ) where w.l.o.g. i < j and for all k such that i < k < j, σ j (s 0 ) = σ k (s 0 ) (if there was such a k we could choose k instead of j for the second value).
Let
is an n-cyclic subcounter of (S, σ) where n = j − i. Theorem 3.7. All counters have a sub-counter that is isomorphic to C n for some n.
Proof This follows directly from lemma 3.4 and corollary 3.6.
Definition Given two counters (T, τ ), (S, σ) and an element s 0 ∈ S, we define (T, τ ) ⊕ s0 (S, σ) as the structure (T ⊗ S, τ ⊕ s0 σ) where
is clearly a counter, as equation 1 is satisfied trivially. Theorem 3.8. If (T, τ ) is an n-cyclic counter and (S, σ) is an m-cyclic counter, then for any (T, τ ) ⊕ s0 (S, σ) is an mn-cyclic counter.
Proof We construct a mapping in g : C m → S as g(i) = σ i (s 0 ) and h : C n → T as h(i) = τ i (t 0 ) for some t 0 ∈ T. These are isomorphisms from lemma 3.4. Now consider C n ⊕ m−1 C m , and a map p :
It is also clearly a bijection and a surjection, hence an isomorphism. ...
There is one operation other than increment that may be performed on a Program Counter -the LOAD operation (RESET is simply a special case of this). There are two forms of LOAD that are currently used:
A program might perform an absolute jump (or LOAD). This simply loads the counter with a new value that has been calculated when the program is created, and is of no difficulty for any sort of counter.
The other form, which is worth examining in some detail, is where the new address is calculated as some offset b from the current one by 'adding' it to the current value of the program counter. Our system doesn't have addition defined, so we define one:
Definition Given an counter (S, σ), s ∈ S and an integer
In the case of
where 'plus' is the usual defintion of modulo addition.
Synchronous radix-2 counters are conventionally used for Program Counters, however there is a trade-off between speed and size because of carry propagation from low-order to higher-order bits [14] .
The simplest radix-2 counter is a ripple-carry counter based on an adder [14] . This is slow (O(N ) combinatorial delay with increasing bit-width N ) but cheap in logic (O(N )). Xilinx FPGAs use carry-chains to provide the logic for this. This can be improved to O(log N ) combinatorial delay using a carry-lookahead design [9] at the expense of extra logic.
An approach that allows increment in constant time is a redundant format [8] which allows what is called 'carryfree' addition, but that has two issues. A redundant representation needs twice as many latches and therefore consumes more FPGA resources. Another problem is that a redundant output is unsuitable for providing an address to access instructions. The output first needs to be converted to a non-redundant format during each clock cycle, which is going to require O(log N ) propagation delay and extra logic for an adder. Some work has been done using hybrid redundant number systems [11] to reduce the number of extra registers, but there is still a substantial propagation delay converting that to a usable format.
Another approach to designing an O(1) delay counter is to use a cascade [9] that begins with a short and fast counter, and continues with longer counters that only need to be incremented occasionally and don't need to be as fast. However, this requires the slower counters to have their increments to be precomputed [14] which makes the LOAD operation much more difficult.
Cyclic Sequence Generators
A cyclic sequence generator is a synchronous circuit that iterates through a cyclic sequence of states. For a program counter, it is desirable to have a simple circuit and a cyclic sequence that includes most of the possible states. A good candidate for this is a maximum cycle feedback shift register.
A Feedback Shift Register (FSR) is a shift register that satisfies the condition that the current state is generated by a linear function of its previous state. There are many types of these, but the ones considered here are linear (use XOR gates), and have a constant maximum combinatorial path, therefore constant time performance (O(1)) with increasing bit width N (see Figure 4 for some examples of these). This compares favourably with radix-2 counters as described in Section 4. A well-known example of an FSR is a Linear Feedback Shift Register [5, 6] (LFSR) that are widely used in cryptography [6] , communications systems [12] and for built-in self-test systems [1] . A few of the types of FSR are:
Fibonacci LFSR takes the output of several of the registers and XORs them together to feed to the input of the first register as shown in Figure 4a .
Galois LFSR takes the output of the last register and XORs it with several of the register inputs as shown in Figure 4b .
Ring Generators [7] rearrange the shift register into a ring, and arranges the feedback connections so that they only involve a small amount of routing, as shown in Figure 4c .
MFSR [19] (Multiple Feedback Shift Register) are a generalisation of ring generators, allowing any output to be XORed with any input, but limits the fan-in and fan-out to 2, as shown in Figure 4d . Also worth considering are cellular automata [3] . These are not Feedback Shift Registers but have similar properties. A cellular automaton has each bit set from the XOR of the previous value of that bit and neighbouring bits, which require more XOR gates than FSRs but keeps the routing very local.
An N -stage linear FSR is maximum-cycle when all 2 N −1 non-zero states occur as the FSR is iterated. Note that a cycle of 2 N cannot be achieved as the all zero state will always map to the all zero state. Any linear FSR can be represented as an N × N matrix M over the field GF (2), and this will be maximum cycle if and only of the characteristic polynomial p(x) = |M − xI| is primitive. Lists of primitive characteristic polynomials and counters with maximum cycles can easily be found [6, 2, 18] or generated.
All these structures are very fast, as described above, and have a relatively small amount of logic (O(N ) which will mostly be the registers to store the bits), and a pseudorandom sequence. They will have similar performance, due to all having a maximum combinatorial part of just one XOR gate (In Fibonacci this may be 4 gates, but this is still only one LUT on an FPGA). Slight variations in performance may depend on required LUTs and routing in any particular FPGA technology choice and application. The worst case number of XOR gates, fan in and fan out is shown in Table 1 . As the sequence is pseudo-random, the bits may also reordered to search for small routing improvements.
FSR counters meet the counter requirements for PC circuit described in Section 2. They are easily loadable, support an enable, and the register can be read directly. One disadvantage to using FSRs is that the maximal cycle size is 2 n − 1, instead of the 2 n cycle of radix-2 counters. We refer to the address not generated in the 2 n − 1 cycle as the "zero address". While the zero address does not represent a significant fraction of the address space, this can be addressed with extra logic as used by Wang and McCluskey [17] , but this brings the propagation delay back to O(log N ).
For the rest of this work we use MFSR counters. They have good fan-in, fan-out, and a low gate count. Other FSRs could be used and would have very similar results (see Figure 4 ).
Hybrid PCs
The instruction fetch order of FSR PCs may lead to poor run-time performance for processor designs that contain an instruction cache. It is desirable to have a PC that increments through all of the instructions within a cache line before fetching a new cache line. For a cache linesize of 32 bytes, and a fixed instruction-width of 32 bits, there will be eight instructions within a cache line. Ideally, and in the absence of branching instructions, each of these eight instructions should be fetched from the cache before fetching another cache line from system memory.
The solution presented here is to combine two counters into one PC, a radix-2 counter for the three leastsignificant bits and a MFSR for the most-significant bits. Since the Spartan-3 contains four-input Look-Up Tables  (LUTs) , and the Virtex-5 has six input LUTs, a 3-bit radix-2 counter can be built with just one layer of logic. When the upper count value is reached, the MFSR portion of the PC is then incremented.
FPGA Synthesis of Simple Counter Circuits
Performance of the synthesised radix-2 and FSR counters are shown in Figure 5 . As the bit-width N increases radix-2 counters show a linear increase in latency, this is O(N ) time-complexity. The results show that for radix-2 counters of more than 6 bits, latency can be estimated as 2.9 + 0.064 × N ns. The FSR counters have O(1) time complexity and a smaller constant of only 1.8 ns, compared with the radix-2 counters. 
FPGA Synthesis of Complete PC Circuits
A simple PC circuit was used when investigating the effect of counter-type on PC performance. Figure 6 is a block diagram of the circuit used for testing and the counter used was one of FSR, radix-2, or a hybrid where the lowest 3 bits are radix-2. Bit widths ranging from 8 to 32 bits were used. 
PC Circuit Results
The results of the PC circuit synthesis are shown in Figure 7 . As with the previous counter tests, the radix-2 circuits scale linearly with increasing bit-width. Again though, the latencies of FSR-based circuits are lower than radix-2 and both the FSR and hybrid PCs have O(1) scaling with increasing bit-width. Due to the Xilinx synthesiser using the carry-chain logic for the radix-2 counters, and FSRs having a gate depth of just one, this is as expected.
To demonstrate that the behaviour observed when synthesising for a Spartan-3 FPGA is not unique, Figure 8 also shows synthesis results for a Xilinx Virtex-5 FPGA. The latency scaling with increasing bit-width is similar but the Virtex-5 is clearly a faster architecture.
Processor Design Examples
Three FPGA processors were synthesised and evaluated to show the effects of different PC circuits on maximum clock frequency. The three processor logic cores used are aeMB, TTA16, and RISC16. aeMB was designed to use a conventional radix-2 PC whereas RISC16 and TTA16 were designed to use a PC circuit based on either a FSR or radix-2 counter. This is to examine if substituting a FSR-based PC into an existing processor design leads to any performance gains of the processor as a whole, and if there is a detrimental effect of adding a radix-2-based PC to a processor designed for a FSR-based PC. A Xilinx Spartan-3 FPGA was again the synthesis target and Xilinx ISE 9.2 was the synthesis tool.
aeMB: A MicroBlaze Compatible RISC Processor
The aeMB processor logic core is 32-bit, Harvard architecture, RISC processor with a three-stage pipeline. It is an open source project that was designed to be instruction compatible with the MicroBlaze core [20] . When the aeMB processor core was synthesised for a Spartan-3 FPGA it used about 2600 logic elements.
The critical path of aeMB, as determined from the Xilinx place-and-route timing report, is the instruction-cache look-up path. Total routing resources used were 515643 paths with the original radix-2 PC. The source code was modified, substituting a FSR PC for the radix-2 PC, and resynthesised. This was a straight PC-logic substitution and did not address any potential problems with relative branching and cache coherency.
Maximum operating frequency increased slightly, as is shown in Table 2 , logic resource utilisation was similar, but the FSR degign used only 503608 paths. This is a large change in pathing resources used for a small change to the total degign. This large difference in pathing resources, caused only by the change to the PC circuit, was not observed with the other processor cores. 
TTA16
TTA16 is a 16-bit, Transport Triggered Architecture [4] (TTA) processor optimised for Xilinx Spartan-3 FPGAs.
It is an open-source, Harvard architecture processor and was designed for the high-throughput, data-processing tasks of the Open Video Graphics Adapter (OpenVGA) project [16] . TTA processors have very simple instruction word formats and require only very simple instruction decoders resulting in smaller processor cores. The TTA16 PC circuit is similar to that shown in Figure 6 and contains source code for both types of PC, FSR or radix-2, and can be synthesised with either one.
TTA16 was the processor that showed the greatest frequency improvement with an FSR PC (see Table 2 ), the FSR counter has 22% greater clock frequency than with the radix-2 counter. TTA16 configured to use the FSR PC is substantially faster than with the radix-2 PC because TTA16 was designed to use a low-latency PC circuit. When TTA16 is synthesised with a radix-2 counter the PC circuit becomes the critical path limiting maximum frequency. We speculate that adding an additional pipeline stage to the radix-2 PC circuit may improve maximum processor frequency, but this would also increase branch latency and FPGA resources required.
RISC16: A Small 16-bit RISC Processor
RISC16 was also designed for OpenVGA [16] , for comparison with TTA16, and shares many design elements. The RISC16 core is has five pipeline stages arranged so that each has similar latency. Decreasing the latency of one small component, the PC circuit, would not be expected to have a big effect on overall performance. Table 2 shows that there were no substantial difference between the FSR and radix-2 PC implementations of the RISC16 processor core.
Discussion
Radix-2 counters are the conventional counters used for the processors PC. These have been studied and used extensively. Memories, including caches, often support linear, burst transfers assuming a radix-2 count order. Current software tools, like assemblers and compilers, assume a radix-2 count order as well. FPGAs, like the Xilinx Spartan-3 and Virtex-5 families, contain carry-chains to support radix-2 counters as well.
Traditional assemblers will generate object code suitable only for the radix-2 PC increment sequence. For processors with FSR-based PCs, an assembler is needed that will use a description of the FSR to correctly re-order the processor instruction. An assembler was developed for this purpose and reads an XML-encoded complete processor description prior to generating the assembly output. The assembler supports FSR and radix-2, as well as hybrid PCs consisting of a concatenation of an arbitrary number of FSR and radix2 counters. For each FSR counter used the tap sequence describing the FSR is also encoded within the processor description file.
FSR Program Counters
The FSR counters are very fast (see Figure 6 ) leading to a substantially lower-latency PC circuit (see Figure 5 ). This in turn leads to notable gains in maximum processor operating frequency for the processor with the highest operating frequency, TTA16 (see Table 2 ). Because a maximal-cycle FSR count cycle is different from a radix-2 counter this has effects on tools, instruction encoding, and cache coherency.
Relative Instruction Addressing
The instruction set of many processors contain branch instructions that store an offset that is added to the current value of the PC. These instructions are called PCrelative branch instructions. When the offset is encoded with a bit-width narrower than the PC, this branching is difficult to do with FSRs.
PC-relative branching with small offsets is widely used in contemporary processors [10] as it allows a subset of addresses, those that are close to the current instruction, to be encoded within an instruction. The simplest approach with FSR-based PCs is to not support PC relative branching. This problem is more easily resolved using Hybrid PCs (see Section 8.2).
Cache Coherency
For processor designs featuring an instruction cache, a FSR-based PC will lead to poor performance. A single FSR-based PC circuit will traverse program memory in a pseudo-random count cycle. Caches are typically designed to fetch multiple words from sequential adresses, called cache lines. A FSR PC circuit may only execute one instruction from this cache line, and then the cache may need to fetch another. A solution to this problem is introduced in Section 4.2 and then issues are discussed in Section 8.2.
Hybrid Program Counters
The synthesis results, shown in Figures 7 and 8 , for the hybrid PCs described in Section 4.2 show that performance is greater than radix-2 PCs at all tested bit-widths. They are therefore a good solution to the cache coherency problems of FSRs. Figures 7 and 8 show that hybrid PCs have only slightly higher latency than pure FSR PCs and with the same constant-latency behaviour with increasing N . Hybrid PCs can also be used to solve problems with relative branching and position independent code, though this is future work.
With a hybrid-PC, the FSR portion will not increment when its value is zero, so there is one entire cache line which will not be accessed by the count cycle. This need not be a disadvantage because the first cache line could be used to store other information, for example the interrupt vector table.
Conclusions
We have designed PC circuits that can allow some FPGA-based processors to operate at higher frequencies. This is because FSR-based counters have very low latency, a depth of just one logic gate, and PC circuits utilising FSRs can have substantially lower latency when implemented in FPGAs. Due to the constant-time behaviour, with increasing bit-width N , FSR counters have an even greater advantage, relative to radix-2 counters, when N is large.
For small embedded FPGA processors executing instructions stored in local SRAMs, the pseudo-random count cycle of FSRs is no significant problem either as long as the user has the necessary tools to generate code. We have also presented hybrid PCs to solve the FSR cachecoherency issues for processors that use an instruction cache to reduce average latency for instruction fetching.
Future Work
There can be many possible maximal-cycle FSRs for a particular bit-width. Some FSRs, due to each having a slightly different circuit, may be faster for a specific implementation than other FSRs. Testing was not performed to find the fastest available FSR circuit for a particular implementation. A future project might be searching amongst the many possible FSRs to find the lowest latency circuit.
Modern compilers can generate position independent code making use of relative branching that is not practicable with a pure FSR-based PC. Further work exploring hybrid program counters, consisting of three or more smaller counters, could probably be used to solve the FSR relative addressing problems.
The emphasis of this work is on FPGA-based processors. Due to their very low gate depth and reduced logic complexity FSR and FSR-radix-2 hybrid PCs may also prove useful with some very low gate-count, or very high clock frequency processors realised in Silicon.
