Reconfigurable data path processor by Donohoe, Gregory
(12) United States Patent 
Donohoe 
(io) Patent No.: 
(45) Date of Patent: 
US 6,883,084 B1 
Apr. 19,2005 
(54) RECONFIGURABLE DATA PATH 
PROCESSOR 
(75) Inventor: Gregory Donohoe, Albuquerque, NM 
(US) 
(73) Assignee: University of New Mexico, 
Albuquerque, NM (US) 
Subject to any disclaimer, the term of this 
patent is extended or adjusted under 35 
U.S.C. 154(b) by 370 days. 
( * ) Notice: 
(21) Appl. No.: 10/206,517 
(22) Filed: Jul. 25, 2002 
Related U.S. Application Data 
(60) Provisional application No. 601307,739, filed on Jul. 25, 
2001. 
(51) Int. Cl? ................................................ G06F 13/00 
(52) U.S. C1. ............................... 712/1; 712111; 712115; 
712116 
(58) Field of Search ............................ 7111, 10, 11, 15, 
71/16 
(56) References Cited 
U.S. PATENT DOCUMENTS 
4,839,851 A 611989 Maki .......................... 3641900 
OTHER PUBLICATIONS 
Donohoe et al., “Reconfigurable Data Path Processor for 
Space”, Proceedings of the Military and Aerospace Appli- 
cations of Programmable Logic Devices 2000, Laurel, MD, 
Sep. 24-28, 2000.* 
Donohoe et al., “Adaptive Computing for Space”, 42nd 
Midwest Symposium on Circuits and Systems, Aug. 8-11, 
Seth Copen Goldstein et al., “PipeRench: A Reconfigurable 
Architecture and Complier,” Apr. 2000 pp. 70-77. 
1999, V O ~ .  1, pp. 126-129, IEEE.* 
Masakatsu Maruyama et al., “An Image Signal Multipro- 
cessor on a Single Chip,” IEEE Journal of Solid-state 
Circuits, vol. 25, No. 6, Dec. 1990, pp. 1476-1483. 
* cited by examiner 
Primary Examiner-William M. Treat 
(57) ABSTRACT 
Areconfigurable data path processor comprises a plurality of 
independent processing elements. Each of the processing 
elements advantageously comprising an identical architec- 
ture. Each processing element comprises a plurality of data 
processing means for generating a potential output. Each 
processor is also capable of through-putting an input as a 
potential output with little or no processing. Each processing 
element comprises a conditional multiplexer having a first 
conditional multiplexer input, a second conditional multi- 
plexer input and a conditional multiplexer output. A first 
potential output value is transmitted to the first conditional 
multiplexer input, and a second potential output value is 
transmitted to the second conditional multiplexer output. 
The conditional multiplexer couples either the first condi- 
tional multiplexer input or the second conditional multi- 
plexer input to the conditional multiplexer output, according 
to an output control command. The output control command 
is generated by processing a set of arithmetic status-bits 
through a logical mask. The conditional multiplexer output 
is coupled to a first processing element output. A first set of 
arithmetic bits are generated according to the processing of 
the first processable value. A second set of arithmetic bits 
may be generated from a second processing operation. The 
selection of the arithmetic status-bits is performed by an 
arithmetic-status bit multiplexer selects the desired set of 
arithmetic status bits from among the first and second set of 
arithmetic status bits. The conditional multiplexer evaluates 
the select arithmetic status bits according to logical mask 
defining an algorithm for evaluating the arithmetic status 
bits. 
42 Claims, 13 Drawing Sheets 
https://ntrs.nasa.gov/search.jsp?R=20080005109 2019-08-30T02:51:05+00:00Z
U S .  Patent Apr. 19,2005 Sheet 1 of 13 
101 
US 6,883,084 B1 
- 
m 
Fig. I 
r 
* ? .  % 
Receive 
"D" 
Data L l l O  
L 1 1 4  Load Process n into Processor 
# 
Ir 
1 L  
I 122 Process Data n=n+l 
"D" According to 
Fig. 2 
U S .  Patent Apr. 19,2005 Sheet 2 of 13 US 6,883,084 B1 
0 
0 
N 
U S .  Patent Apr. 19,2005 Sheet 3 of 13 US 6,883,084 B1 
22 1 
Input Data Buffer 
I .- #.. *- + I 1 0 0 0  4 -223 
I Input Select Loge 
1 
c -233 227 /-/- Output Select Logc 
I I 1 0 0 0  I 
output 229 - 
- L235 Output Data Buffer 
Fig. 5 
U S .  Patent Apr. 19,2005 Sheet 4 of 13 US 6,883,084 B1 
U S .  Patent 
"1 
iD 
-1 
Apr. 19,2005 Sheet 5 of 13 US 6,883,084 B1 
U S .  Patent Apr. 19,2005 Sheet 6 of 13 
. 
US 6,883,084 B1 
Send 24-Bits Bitstream 
A fiom MUX to Lower 
Register of PAD 1 
+ - ~ A S t e p  2 
step I 
PAD Zeroes 
into Upper Register 
of PAD1 
%Step 3 
L 
Fig. 8 
U S .  Patent Apr. 19,2005 
0 a -  
Sheet 7 of 13 US 6,883,084 B1 
U S .  Patent Apr. 19,2005 Sheet 8 of 13 
Transfer the 24 Least 
Significant Bits "A" 
Bits Stream In1 to the 
Mult-Plexer MUXZ 
of Incoming 
US 6,883,084 B1 
-1 
The Most Sipficant 
24-Bits A' of the Bits Stream 
at Input In1 are Tmsfered 
to the Multi-Plexer MUX1 
I 
I Transfer Content I 
of MUX1 to the 
Lower Register "B" of the 
I PaddingModulePAD1 I 
7-43 
-4 
Fig. 10 
U S .  Patent Apr. 19,2005 Sheet 9 of 13 US 6,883,084 B1 
265- b-, 
CMUX 
Fig. 11 
Header 
402 404 406 
I 
OP - COPE PE-ADDR 
408 410 
3 Bits 4 Bits Fig. 12 
U S .  Patent Apr. 19,2005 Sheet 10 of 13 US 6,883,084 B1 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
1 
I 
I 
I 
I 
U S .  Patent Apr. 19,2005 Sheet 11 of 13 US 6,883,084 B1 
U S .  Patent Apr. 19,2005 
530 
\ 
530 --*+ 
I 
Sheet 12 of 13 US 6,883,084 B1 
1-w 530 
i- 
530 
U S .  Patent Apr. 19,2005 Sheet 13 of 13 US 6,883,084 B1 
0 1 2 3 4 5  m- 15 
t 
Output Select y(m,n> = 
Fig. 17 
US 6,883,084 B1 
1 
RECONFIGURABLE DATA PATH 
PROCESSOR 
RELATED APPLICATIONS 
The present application claims priority of the Provisional 
U.S. Patent Application No. 601307,739 filed on Jul. 25, 
2001 and entitled “RECONFIGURABLE DATA PATH 
PROCESSOR.” The Provisional U.S. Patent Application 
No. 601307,739 filed on Jul. 25,2001 and entitled “RECON- 
FIGURABLE DATA PATH PROCESSOR’ is herein incor- 
porated by reference. 
GOVERNMENT LICENSE RIGHTS 
The U.S. Government has a paid-up license in this inven- 
tion and the right in limited circumstances to require the 
patent owner to license to others on reasonable terms as 
provided by terms of the Federal Grant No. NAG5-9469 
awarded by NASA for the project entitled “RECONFIG- 
URABLE DATA PATH PROCESSOR,” and the Federal 
Grant No. NAG5-9704 awarded by NASA for the project 
entitled “SOFTWARE FOR RECONFIGURABLE PRO- 
CESSOR.” 
FIELD OF THE INVENTION 
The present invention relates to a reconfigurable data 
processing pipeline which is adapted to parallel processing 
in ultra low power CMOS circuitry. More specifically, the 
present invention relates to a reconfigurable data processing 
pipeline which is adapted to parallel processing in ultra low 
power CMOS circuitry through the data path switching of a 
conditional multiplexer controlled by an evaluation of arith- 
metic status bits produced during data processing. 
BACKGROUND OF THE INVENTION 
Owing largely to the history of microprocessor 
development, the von Neumann processor, with a single 
arithmetic-logic unit through which all data must pass, is a 
common reference against which other processing models 
are compared. The computational model incorporating the 
von Neumann processor typically envisions a sequential 
processor, a randomly-addressable memory, a single 
arithmetic-logic unit (ALU), and a control unit. The memory 
stores information and instructions, and the ALU transforms 
bit patterns. The control unit reads data and instructions 
from memory and routes data through the ALU, and back 
into memory. This Computational model is deeply embed- 
ded in programming languages such as C and Mathlab. For 
example, when the computer executes a function, such as 
sin(x), the main flow of execution stops; the sin(x) function 
is executed, typically to termination, and the main program 
flow resumes where it left off. Sequential processors execute 
alternative computations by switching the program flow 
through conditional branching. Program agility is thus 
achieved by changing the flow of execution. Time efficiency 
is not intrinsic in a sequential model of operation. According 
to different control inputs, different programs or sub- 
programs are granted run-priority. In one case, the processor 
executes one sequence of instructions. In another case, the 
processor executes a different sequence of instructions. 
FIG. 1 illustrates one embodiment of a von Neumann type 
processor. An input 101 is received by the microprocessor 
105. A memory module 103 coupled to the microprocessor 
105 contains various processes and algorithms for process- 
ing incoming data, and is capable of downloading these 
programs into the processor 105. An output module 107 is 
S 
10 
1s 
20 
2s 
30 
3s 
40 
4s 
so 
5s 
60 
65 
2 
coupled with the processor output, and is capable of receiv- 
ing the processor output. 
FIG. 2 is an exemplary looping process commonly occur- 
ring in conjunction with the architecture of the von Neu- 
mann processor illustrated in FIG. 1. According to the Step 
110, the processor 105 receives input data “D” from the 
input module 101. According to the step 112, a counter value 
n is set to one. According to the step 114, the process n 
within the memory area 103 is loaded into the micropro- 
cessor 105. According to the step 116, the data D is pro- 
cessed with algorithm n. According to the step 118, the 
output module 107 evaluates whether the processed data 
falls within a pre-determined range. According to the step 
120, if the processed data falls within the pre-determined 
range, the processed data is sent to the output 120. If in the 
step 118, the processed data falls outside the predetermined 
range, the value n is incremented by one in the step 122, and 
the process returns to the step 114, loading the process n into 
the microprocessor. According to the process illustrated in 
FIG. 2, the “looping” is recurrent until a desired data 
outcome is derived. The number of loops may be determined 
by control signals which are themselves generated by output 
data. Alternatively, the number of loops may depend upon 
the execution of a predetermined sequence of operations. 
The process illustrated in FIG. 2 is exemplary of one form 
of a “looping” program, wherein successive outputs are 
discarded if they are not within a specified range. Alternative 
looping programs are possible, such as accumulating suc- 
cessive outputs of processed data which have been pro- 
cessed by various algorithms successively loaded into the 
processor. The essential point illustrated by FIG. 2, however, 
is that looping programs which require multiple iterations 
become time consuming, each iteration consuming more and 
more processing time. The same phenomena occurs with 
branching programs wherein a branch “dead ends” and must 
be recalculated according to a different algorithm. Thus, 
such architectures are not time optimized. 
A second limitation of serial processing techniques gen- 
erally associated with RISC (Reduced Instruction Set 
Computer), DSP (Digital Signal Processor) and von Neu- 
mann type serial processors inheres from the inability of 
serial processing techniques to take full advantage of ultra 
low power (“ULP’) technology. In spacecraft applications, 
the need to conserve power is critical. This makes ultra low 
power (ULP) technology particularly attractive in spacecraft 
applications. The limitation of serial processing techniques 
in ULP technology can be illustrated by understanding the 
sources of power consumption in a CMOS circuit. Dynamic 
power consumption occurs when a transistor switches state, 
and is proportional to the square of the voltage. From this, 
it is easily understood that, when power voltage levels are 
reduced from approximately five volts to approximately 
one-half volt, dynamic power consumption may be reduced 
somewhere on the order of two orders of magnitude. Static, 
or parasitic power consumption, on the other hand, is 
generally proportional to the source of the drain area, and 
therefore increases with the number of transistors in the 
circuit. Static power dissipation generally occurs due to 
leakage in parasitic source and drain diodes. In conventional 
CMOS circuits in the 5 volt range, the dynamic power 
consumption is typically the dominant source of energy 
consumption. For this reason, there is little parallelism in 
most serial type processing models. However, if the same 
fundamental schematic used in a traditional 5-volt CMOS 
circuit were used for a ULP circuit, the ratio of power lost 
through static or parasitic power consumption would 
increase. Static power consumption occurs regardless of 
processing. 
“ > >  . 
US 6,883,084 B1 
3 
Additionally, resistance to radiation is particularly vital in 
spacecraft applications. Without the earth’s atmosphere, a 
circuit in outer space is bombarded with a higher level of 
background radiation than earthbound circuits. However, 
traditional CMOS processors are not easily radiation hard- 
ened without a significant performance degradation. Without 
radiation hardening, single event upsets, single event 
latchup, total ionizing dose and other radiation effects due to 
cosmic bombardment dramatically increase the likelihood of 
onboard failure in spacecraft applications. 
The single processing path concept inherent in the von 
Neumann processor, is often referred to as exhibiting “mini- 
mum granularity.” As illustrated in FIG. 3, the von Neumann 
processor is at one end of the granularity spectrum. RISC 
and DSP processors are more granular than von Neumann 
processors. At the other end of the spectrum are Field 
Programmable Gate Arrays (FPGAs). FPGAs have maxi- 
mum granularity, and are programmable down to the gate 
level. Fine-grained reconfigurable granularity offers great 
flexibility, and enables the architecture of the processor to be 
modified to closely match the architecture of the computa- 
tion problem, offering the possibility of very high perfor- 
mance. However, fine-grained reconfigurability exacts a 
high price in area. It is estimated that only 1% of the area of 
a typical FPGA is available for useable logic; the rest is 
consumed in interconnect and configuration memory. Within 
the spectrum illustrated in FIG. 3, complex programmable 
logic devices (CPLDs) are slightly less granular than 
FPGAs, while digital signal processors (DSPs) and super- 
scalar CPUs are more granular than simple von Neumann- 
type microprocessors. Additionally, FPGAs are not typically 
radiation-hardened, making them particularly failure-pron in 
spacecraft applications where cosmic rays are unfiltered by 
the earth’s atmosphere. Manufacture of radiation-tolerant 
FGPAs exacts a large prince in that the currently-available 
radiation-tolerant FGPAs have two orders of magnitude 
fewer equivalent gates than non-hardened FGPAs. 
Moreover, complex models synthesized from existing gates 
in FPGAs cannot take advantage of the circuit-level and 
layout-level optimizations which are attainable when these 
models are designed by hand. 
What is needed, therefore, is a processor design configu- 
ration method that can be used advantageously in ULP 
applications. Additionally, the need exists for a processor 
that can be easily manufactured to exhibit a high degree of 
radiation tolerance. The need also exists for a processor 
which can reduce the amount of wasted CMOS circuitry 
associated with Field Programmable gate array devices. 
There is further a need for a processing device that is 
user-configurable to maximize efficiency. There is a further 
need for a processing device that reduces or eliminates 
conditional branching, looping, retracing and re-calculating 
of data, as well as other programming procedures that slow 
processing throughput. 
SUMMARY OF THE INVENTION 
The present invention eliminates branching, simplifies 
looping, and reduces retracing and re-calculating by using 
parallel processing with conditional multiplexers which con- 
ditionally switch data paths according to control inputs 
derived from the data being processed, including the con- 
ditional selection of data for processing in parallel data 
paths. The present invention further provides a processor 
that can take advantage of the reduction in dynamic power 
consumption in a ULP circuit through greater parallelism. 
The present invention further provides a processor which 
can be easily manufactured to exhibit a high degree of 
radiation tolerance. 
4 
Areconfigurable data path processor comprises a plurality 
of independent processing elements. Each of the processing 
elements advantageously comprises an identical architec- 
ture. Each processing element comprises a multiplier and an 
s arithmetic logic unit, each capable of simultaneously pro- 
cessing data. According to the preferred system 
configuration, the multiplier and the arithmetic logic unit 
can process the same data from a processing element input, 
or process separate data received from separate processing 
i o  element inputs. Additionally, the processing clement can be 
configured such that the output of the multiplier can form an 
input of the arithmetic logic unit, and the output of the 
arithmetic logic unit can form an input to the multiplier. 
Each processing element further comprises a conditional 
is multiplexer having a first conditional multiplexer input, a 
second conditional multiplexer input and a conditional mul- 
tiplexer output. The conditional multiplexer output is 
coupled to a first processing element output. At least two 
processable values can be received at the inputs of the 
The processing element processes the first processable 
value according to a first algorithm, and a second process- 
able value according to a second algorithm, generating first 
and second processed values. The first processed value is 
25 transmitted to the first conditional multiplexer input, and the 
second processed value is transmitted to the second condi- 
tional multiplexer input. A set of arithmetic bits are gener- 
ated according to the processing of the first processable 
value. The conditional multiplexer evaluates the arithmetic 
30 status bits according to a logical mask defining an algorithm 
for evaluating the arithmetic status bits. The bit pattern of the 
logical mask is advantageously downloaded into the pro- 
cessor during the configuration process. According to the 
evaluation of the arithmetic status bits, the conditional 
35 multiplexer selects a data path which couples the first 
conditional multiplexer input to the conditional multiplexer 
output and a second data path which couples the second 
conditional multiplexer input to the conditional multiplexer 
output. According to one embodiment, various additional 
40 data paths within the processing element are selected and 
configured during a configuration stage prior to the process- 
ing of data. 
According to one embodiment, an arithmetic-status bit 
multiplexer selects a set of arithmetic status bits from at least 
45 two sets of arithmetic status bits generated by at least two 
different processing operations. The configuration of the 
arithmetic bit multiplexer is advantageously performed dur- 
ing the configuration stage. 
20 processing element. 
so BRIEF DESCRIPTION OF THE DRAWINGS 
FIG. 1 depicts a von Neumann type serial processor. 
FIG. 2 illustrates a looping-type program run in a serial 
FIG. 3 illustrates a spectrum of processor granularity. 
FIG. 4 illustrates a parallel pipeline architecture. 
FIG. 5 illustrates an architecture for configuring inter- 
processing element data paths in a pipeline processor in the 
FIG. 6 illustrates a schematic of components and data 
pathways within a processing element according to the 
present invention. 
FIG. 7 illustrates the bit flow of a “pad” operation of a 
FIG. 8 is a flow chart describing the “pad” operation of a 
processor. 
55 
6o present invention. 
65 padding module. 
padding module. 
US 6,883,084 B1 
5 6 
FIG. 9 illustrates the bit flow of a “sign extension” tions may be segregated into usable data and unusable data. 
operation by a padding module. The usable data is incorporated for further processing, for 
FIG, 10 is a flow chart describing the “sign extension” ultimate output, storage or any combination thereof. The 
operation of a padding module. un-usable data generated in parallel operations is discarded. 
FIG, 11 is an illustration of a mask operation on a control s Because unusable data is generated in parallel processing 
operations, it does not increase the throughput time of data signal. 
processing. 
FIG. 12 is an illustration of processing element a con- 
Power conservation is particularly desired in spacecraft figuration message. 
applications, and ultra low power “ULP’ technology is 
FIG. 13 is a simplified illustration of a processing element therefore a useful tool in optimizing microprocessor tech- 
nology in spacecraft applications. The present invention is focusing on the firing of the latched output. 
FIG. 14  illustrates a Parallel Pipeline Processor configu- particularly suited for implementation in ultra-low-power, 
ration. radiation tolerant CMOS using an AMI 0 . 3 5 ~  process. To 
FIG. 15 illustrates a firing sequence of the parallel pipe- maximize the power savings in ULP applications, the 
line configuration of FIG. 14. 15 present invention includes architectural parallelism within 
FIG. 16 illustrates a hierarchical configurable data path the data Path which is capable of approaching the optimal 
for interconnecting programming elements within a pipeline ratio between static and dynamic power consumption, while 
processor. simultaneously increasing processing speed. 
FIG, 17 illustrates an application of processing pixels FIG. 4 illustrates an overview of some of the basic 
through the pipeline processor of the present invention. 2o properties of a synchronous pipelined processor according to 
the present invention, which will be discussed in greater 
detail in subsequent drawings. The processor is regarded as 
synchronous because a common clock pulse drives coordi- 
pipeline. After receiving input data from an input cache 201, 
the processing element PE1202 processes the input data and 
sends the processed input data to multiplexer 210. The data 
is then coupled through parallel programming paths to 
programming element PE2 203 and PE3 204. If both PE2 
203 and PE3 204 are configured to receive data 
couple an output to PE2 203 and PE3 204, simultaneously. 
Alternative embodiments are envisioned, however, wherein 
the coupling of inputs to parallel programming elements 
PE2 203, PE3 204 from a common source such as the 
multiplexer 210 can be delayed until all of the parallel units 
are ready to receive an input’ 
DETAILED DESCRIPTION OF THE PRESENT 
INVENTION 
Reference will now be made in detail to the preferred 25 Ilated processing 202-207 comprising the RDPP 
embodiments of the invention, examples of which are illus- 
trated in the accompanying drawings. While the invention 
will be described in conjunction with the preferred 
embodiments, it will be understood that they are not 
intended to limit the invention to these embodiments. On the 30 
contrary, the invention is intended to cover alternatives, 
within the spirit and scope of the invention as defined by the 
appended claims. For example, although the invention 
described herein is especially useful in space craft 35 
applications, the exemplary use of this application herein is 
not intended to limit the applications of the present invention 
to space craft applications. Accordingly, many examples and 
numerous specific details are set forth within the detailed When the Processing PE2 203 has 
description of this invention in order to provide a thorough 40 Processing, PE2 203 couples an output of Processed data to 
understanding of the present invention, and the best mode of the hut Of Processing PE4 205 for further Pro- 
its use, H ~ ~ ~ ~ ~ ~ ,  it will be readily understood to one of cessing. When the processing element PE3 204 has finished 
ordinary skill in the art that the present invention may be Processing its data, PE3 204 couples an output of Processed 
practiced without these specific details. In other instances, data to PE5 206. SimilarlY, when PE5 has 
well-known methods and procedures, components and cir- 45 Processing, it couples its output of Processed data to PE6 
cuits haven not been described in detail so as not to 207. Each Processing element thus Processes the data 
unnecessarily obscure aspects of the present invention. received at its input, and couples it to the next processing 
present invention is based on a synchronous pipeline model, ing PE4 205 and PE6 207 converge into a sing1e 
In this model, multiple processing elements (“PES”) are SO path in a 
coupled in a network, and data and control information flow Within each processing element 203-207 is a conditional 
between them, discussed above, the N~~~~~~ model multiplexer with at least two conditional multiplexer inputs 
relied on sequential operations to be selected and performed and a single conditional multiplexer output. Each c o d -  
as a of conditional branching, to the neglect of tional multiplexer is controllably switched to select the 
alternative operational paths. In the RDPP, execution agility ss conditional multiplexer output from among one of the two 
is achieved, not through conditional branching of execution conditional multiplexer inputs. The conditional multiplexer 
as illustrated in FIGS, 1 and 2, but through conditional is dynamically switched according to the value of various 
switching of data paths, T~~ or alternative computa- arithmetic status bits derived from a mathematical operation 
tions can be carried out in different processing elements being Performed within each of the respective Processing 
within separate branches of a network. The alternative 60 elements. As discussed in conjunction with FIG. 6, the 
computations can be performed simu~taneous~y through the dynamic switching of the “conditional” multiplexer stands 
use of multiple programming elements ( “ p ~ ~ ” )  which run in contrast to the switching of the plurality of “selective” 
concurrently, Each operating programming clement pro- multiplexers, for which the switching state and selected data 
duces at least one output data set. Through conditional Path is determined Prior to system configuration. 
multiplexing controlled by the processing of input data, and 65 In addition to the conditional multiplexer, each processing 
crossbar switching defining pre-determined data paths at element 203-207 has a additional output regulated by a 
configuration time, data produced through parallel opera- crossbar switch. As the two parallel paths converge, at least 
modifications and equivalents, which may be included simultaneously, the multiplexer 210 is programmed to 
The reconfigurable data path processor (RDPP) of the in the pipeline. the Outputs Of process- 
switch 211. 
US 6,883,084 B1 
7 8 
one of the two outputs of processing element PE4 205 and ing to a newly established architecture implemented to 
at least one of the outputs of processing element PE6 207 are perform a different data processing algorithm. 
coupled to separate inputs input of circuit element 212. Accordingly, there are two aspects of the configuration 
By conditionally switching after data is processed, rather process. The first aspect of configuration involves the rela- 
than conditional branching prior to processing, the recon- s tionship of various programming elements to each other 
figurable data path processor of the present invention has the through the input select logic and output select logic. The 
advantage of being able to evaluate the usefulness of pro- second aspect of configuration involves the configuration of 
to determining which data is to be ignored and which data of various pre-determined constant values in select constant 
is to be utilized. Additionally, after processing data, each i o  registers within the respective processing elements, and the 
processing element has the ability to select unprocessed data configuration of select portions of the data path within each 
through the conditional multiplexer CMUX. Unprocessed processing element by pre-loading various selective multi- 
data can also be transmitted to a second output through a plexers with control values for controlling the multiplexer 
crossbar switch. However, unlike the CMUX, the input/ switching, as further illustrated in Table 18 herein. 
output selection of unprocessed data by the crossbar switch 1s Architecture of the Processing Element 
is not dynamic. It must be made during the configuration FIG. 6 illustrates the components and architecture com- 
process. By Processing data in parallel and controlling the prising a single processing element according to the pre- 
flow of data through arithmetic status bits, the Present ferred embodiment. Within FIG. 6, control signals, varying 
invention does not need to re-calculate data if a first calcu- from one-bit to g-bits, and a 10-bit control-signal mask, are 
lation is determined to fall out-of-range or is otherwise 20 represented by the thin arrowed lines, Those control and 
Unsuitable. Accordingly, the Present invention speeds UP mask signals which are fixed with pre-determined values 
Processing time by reducing the looping and retracing of during the configuration process are distinguished by a 
branches which commonly attends programming features in single cross hashing. The fire control signals, fire-lPE and 
the prior art. Thus, the network according to the present fireP2PE, are not pre-determined during configuration, but 
invention makes decisions through conditional switching of 2s are preferably coupled to the clock driving the processing 
processed data rather than conditional branching to process element, They are identified by a double hash-mark, The 
data according to an algorithm of a Particular Path at the control signals dynamically generated by arithmetic status 
rejection of an alternative processing algorithm. bits during processing, ALU-SW, MUL-SW and CMUX- 
FIG. 5 illustrates an overview of an array of configurable SW, are distinguished by a triple hash mark. The thicker 
processing elements within an RDPP 220 according to the 30 arrowed lines represent 24-bit data paths. Parallel 24-bit data 
preferred embodiment of the present invention. The pro- paths entering or exiting a single component represent a 
cessing elements 225-231 are coupled linked to each other, 48-bit data path connecting various components. Because of 
and to input and output buffers 221, 237 through a flexible the space limitations of FIG. 6, some data and control lines 
switching network. The input data 219 is coupled to the are not drawn contiguously between two components. Dot- 
input data buffer 221, and is directed by the input select logic 3s ted lines are used to identify these discontiguous lines, 
223 into an input of a selected processing element or which are shown coupled to both the source and destination 
multiple select processing elements 225-231, depending components, with arrows showing the direction of informa- 
upon operational requirements. After processing, the output tion flow. 
of the selected operating processing element or multiple FIG. 6 illustrates three inputs Inl, In2 and In3 coming into 
processing elements 225-231 is directed variously to the 40 the processing element 203, and the two outputs exiting 
output select logic 233 and the input select logic 223. If processing element 203 (FIG. 4). The discussion of FIG. 6 
processing is completed, the processing element output is should be taken in conjunction with FIG. 4. According to the 
routed to the output data buffer 237. If the output of a preferred embodiment, these inter-PE paths are preferably 
processing element 225-231 is to be further processed, the all 24-bit paths, whereas the data paths between various 
output data from the processing element 225-231 is routed 45 components within the processing element include 24-bit 
back to the input select logic 223, from which it is directed and 48-bit paths. 
to the input of another one or multiple processing element As illustrated in FIG. 6 each processing element 203 
225-233. As will be further understood in conjunction with advantageously includes three 24-bit inputs, Inl ,  In2 and 
FIG. 6, each arrow coupling the arithmetic logic-unit of FIG. 1113, and two main processing components, a multiplier 250 
5 to a processing element actually represents three separate so and an arithmetic logic unit (“ALU’). The first input In1 is 
24-bit input data paths. coupled to the first input of the selective multiplexer MUX1, 
Prior to receiving a first data input 219, the pipeline which is configured to controllably couple or decouple the 
architecture is established such that select outputs of the first input In1 from the first input of the arithmetic logic unit 
predetermined processing elements 225-231 are coupled to ALU through a sequence of intermediate elements, 
select inputs of the various processing elements 225-231. ss specifically, the padding module PAD1 and the shift module 
Pipeline architecture is established through the use of a ALSHIFT.  
pipeline configuration message or collection of configura- The second input In2 is coupled to the first input of the 
tion messages. In addition to configuring the data paths selective multiplexer MUX3, which is configured to con- 
between processing elements of the RDPP, configuration trollably couple or de-couple the second input In2 from the 
messages are used to configure the data paths within the 60 first input of the multiplier 250. 
individual processing elements. After the configuration of The third input, 1113, is configured to be controllably 
the pipeline is completed, including inter-PE data paths and coupled to, or de-coupled from the second input of the 
intra-PE data paths, the pipeline is ready to process data. The multiplier 250 by the agency of the selective multiplexer 
pipeline will continue to process data according to the MU=, and is further configured to couple to or de-couple 
established architecture until a new set of configuration 65 from the second input of the ALU through the agency of the 
messages is received. Upon receipt of a new configuration selective multiplexer MUX4, as further described herein. 
message, the pipeline is reconfigured to process data accord- Accordingly, each processing element may be configured to 
cessed data according to various arithmetic status bits prior individual programming elements, including the pre-loading 
US 6,883,084 B1 
9 10 
couple up to two distinct inputs, In2 and In3 to the multiplier a binary zero, thereby defining the operation as a “pad” 
250, and up to two distinct inputs, In1 and 1113, to the function. According to the “pad” function, the twenty-four 
arithmetic logic unit ALU. Additionally, each processing bit PE-INTvalue received from MUXl is stored in the least 
element 203 may be configured to couple the output of the significant 24-bits of the 48-bit PAD register. The sign bit is 
multiplier 250 to the second input of the arithmetic logic unit s therefore stored in bit 23 of the PAD register, and the 
ALU, and the output of the ALU to the second input of the most-significant bits within the PAD register, bits twenty- 
multiplier 250, thereby creating substantial processing flex- four through forty-seven, are “padded” with zeroes. 
ibility as also described in greater detail herein. This pro- Accordingly, the sign bit is not located in the most signifi- 
cessing flexibility in each processing element contributes the cant bit of the 48-bit PAD register, but is in a middle bit 
power, flexibility, efficiency and speed of the RDPPpipeline. i o  storage location of the register. 
Examining FIG. 6 in greater detail, the first input, Inl ,  is FIGS. 7 and 8 illustrate the transfer of a 24-bit value from 
configured to receive a first 24-bit data stream from the input the input In1 to a 48-bit register in the padding module 
select logic 223 of FIG. 5, and is coupled to the first input PADl via the multiplexer MUX1. According to the step 1 of 
of the selective multiplexer MUX1. Data register DR1 is a FIG. 8, a 24-bit bitstream “A” is sent from the input In1 to 
24-bit register which contains a constant numerical value is the multiplexer MUX1. In the step 2, The 24-bit field “A” is 
which is pre-loaded during the configuration phase of the sent from the multiplexer MUXl to the lower 24-bits A’ of 
pipeline. Data register DR1 is coupled to the second input of the 48-bit register in the padding module PAD1. In the step 
the multiplexer MUX1. MUXl is a selective multiplexer 3,  the padding module PADl pads the upper 24-bits register 
which is controllably configured to select and output data B’ of the padding module with zeroes. Each step in FIG. 8 
from either its first input, Inl ,  or its second input, DR1. The 20 is performed on a clock pulse, such that the entire process 
switching state of the selective multiplexer MUXl is con- requires a minimum of three clock pulses. Alternative 
trolled by the one-bit control signal sel-muxl. As further embodiments are envisioned, however, wherein the data 
discussed herein and illustrated in conjunction with the register within the PAD module is “zeroed-out’’ between 
Table 18, the switching state of the control sel-muxl, as operations. According to this embodiment, step 3 of FIG. 8 
well as the state of the other selective multiplexers MUX2 zs is unnecessary. Although the “clearing” or “zeroing out” of 
through MUX7 are predetermined at the time of pipeline such registers would require a clock pulse, the clearing may 
configuration by pre-loading predetermined control values be done while the processing element PE 203 is not actively 
for the respective control signals during the configuration processing data, thereby eliminating the step 3 of FIG. 8, 
process. In contrast, the switching status of the conditional thereby reducing actual processing time. 
multiplexer CMUX, is not pre-determined during the con- 30 When the padding module PADl operates according to 
figuration of the processing element, but, as discussed the second, or “sign extended” mode, (abbreviated “sign- 
further herein, is conditioned upon a four-bit control signal ext”), the value to be stored in the padding module register 
CMUX-SW derived from arithmetic status bits generated has more than twenty-four actual bits of data. Accordingly, 
during the data processing within the processing element. both the lower register (the least significant 24-bits) and the 
The output of selective multiplexer MUXl is coupled to 3s upper register of the padding module PADl will store actual 
the input of the padding module PAD1. The function of the data. Because the incoming data path of In1 is only 24-bits, 
padding module PADl can best be illustrated by understand- the receipt and storage of a P E L O N G  value over a 24-bit 
ing that the data paths between the processing elements data path must occur over several clock pulses. To control 
202-207 are preferably 24-bit paths, whereas data paths storage of a PEELONG value in the 48-bit PAD register, the 
within a processing element include both 24-bit and 48-bit 40 control signal set-pad1 is set to a binary one, which defines 
data paths. Accordingly, the padding module expands the the “sign extended” function. 
data path from 24-bits to 48-bits. The 24-bit data paths are FIGS. 9 and 10  discloses a process for storing a 48-bit 
designated by the acronym P E I N T ,  and the 48-bit data value from the input In1 into the 48-bit register of the 
paths are designated by the acronym P E L O N G .  The padding module PAD1. According to the step 1 of FIG. 10, 
maximum and minimum values in a signed 24-bit field are 4s the twenty-four least significant bits “A” (FIG. 9) of the 
represented by PE-POS-MAX and PE-NEG-MAX incoming bit stream from the input In1 are transferred into 
respectively. The largest positive number in a signed twenty- the multiplexer MUX1. In the second step, the contents of 
four bit field, PEPOS-MAX, is 8,388,607, commonly the multiplexer MUXl is coupled into the lower register B 
represented by the hexadecimal value Ox7FFFFF. Those of the padding module PAD1. In the step 3, the most 
skilled in the art will recall that, in two’s compliment binary, SO significant 24-bits A’ of the bitstream entering through In1 
“zero” is the first “positive” number, whereas negative-one are transferred to the multiplexer MUX1. In the step 4, the 
is the first negative number. Therefore, the scalar value of contents of the multiplexer MUXl is stored in the upper 
PE-NEG-MAX is one integer greater than the scalar value register B’ of the register within padding module PAD1. In 
of PE-POS-MAX, or negative 8,388,608. Those skilled in this manner, 48-bits of data may be transferred to the 
the art will further recognize that, when a number is accom- ss padding module PADl in two separate transmissions. When 
panied by a negative sign bit in two’s compliment binary, the transferring a value exceeding 24 bits, the sign bit is stored 
scalar value increases by adding the zeroes within the field, in the forty-seventh bit of the PADl module. Each step in 
not the ones. Accordingly, PE-NEG-MAX is typically FIG. 10 requires one clock pulse, such that the entire process 
represented by the hexadecimal value 0x800000. The pad- disclosed in FIG. 10  requires four clock pulses. 
ding modules converts all data to a 48-bit format for internal 60 In both the “pad” mode and the “sign-ext” mode, a 24-bit 
processing through one of two padding operations. data path PE-INT is converted to a PE-LONG data field. 
In the first padding operation, “pad,” the incoming value In the case wherein the total incoming value is contained in 
is a twenty-four bit PE-INT value transmitted from the the first 24-bits, the sign bit remains in bit twenty-three (the 
output of the multiplexer MUXl to the input of the padding middle of the 48-bit field), and the left hand bits are padded 
module PAD1. All incoming values are processes as signed 65 with zeroes. In the case wherein a 48-bit value (including 
values. To store the incoming value within the register of the any value exceeding 24-bits) is transferred, two separate 
PADl module, the one-bit control signal set-pad1 is set to transfers between the multiplexer and the padding module 
US 6,883,084 B1 
11 
must take place, and the sign-bit is stored in bit forty-seven, 
the most significant bit of the 48-bit PAD register. The value 
remains in the P E L O N G  format through subsequent pro- 
cessing until the ALU-CLIP module reduces the 48-bit data 
stream back down to 24-bits, as discussed in greater detail 
herein. 
Table 1 illustrates the binary control codes and identical 
operation and functionality of the padding modules PADl 
(discussed above), and PAD2 (discussed subsequently). 
TABLE 1 
Action in PADl 
set-pad1 
mnemonic binary 
description control code 
PADl input receives PELINT signal from pad 0 
MUXl and pads upper portion of register 
with zeroes. 
PADl input receives PELLONG signal from sign-ext 
MUXl and extends sign bit to 47” bit 
in PAD register. 
1 
Action in PAD2 
setLpad2 
mnemonic binary 
description control code 
PAD2 receives PELINT signal from In3 and pad 0 
upper pad portion of register zeroes. 
PAD2 receives PE-LONG signal from In3 sign-ext 1 
and extends sign bit to 47’h bit in 
PAD register. 
Each of the control codes, set-pad1 and set-pad2, is 
defined by a single bit. According to the syntax of Table 1, 
the command to place the first padding module PADl into 
the first mode, “pad” is “padl=sign_ext.” The control 
values for set-pad1 and set-pad2 are downloaded during 
the configuration phase. 
After the padding operation is completed, the 48-bit 
output signal “pad-l” of padding module PADl forms the 
input signal into the arithmetic-logic shifter module, 
AI-SHIFT. The AI-SHIFT module performs various bit- 
shift operations to prepare the data for processing in the 
arithmetic logic unit ALU. In a logical shift, all bits are 
shifted a fixed number to the left or right according to the 
control signal, and zeroes are fed in to fill the bits. For 
example, in a logical a bit shift left of five bits, all binary 
values are shifted five bits to the left. Values stored in bit 
addresses 43-47 are shifted out, and bit addresses 0-4 are 
filled with binary zeroes. 
The three shift functions or operational modes of the 
AI-SHIFT module are controlled by the 8-bit control, 
set-alshift. As illustrated in Tables 2 and 3, the eight-bit 
control signal, “set-alshift” can be divided into three sub- 
fields for controlling the shift of data. Bit 0 determines the 
operation (logical or arithmetic shift); bit 1 is the direction 
(right or left shift); and bits 2 through seven are the shift 
count. 
TABLE 2 
Action set-alshift Comment 
Logical shift xxxxxxx0 
Arithmetic shift xxxxxxxl 
Shift right xxxxxxOx 
Shift left xxxxxxlx 
Shift count 
Shift left or right, with zeroes are shifted in. 
Right shift only: sign extend. 
ccccccxx cccccc = 000000 to 111111 (0 to 63) 
According to Table 2 above, when the least significant bit 
(bit zero) is a binary zero, the command is for a logical shift. 
12 
When bit-zero is a binary one, an arithmetic shift is estab- 
lished. The value of bit one determines if the shift command 
is for a shift right, indicated by a binary zero, or for a left 
shift, indicated by a binary 1.  Bits 2-7 determine the shift 
5 length. Since the P E L O N G  data path as defined herein is 
preferably a forty-eight bit data path, a shift may be as little 
as one bit or as many as forty-seven bits. Since zeros are 
inserted into the “source” end of a logical shift, a logical 
shift greater than forty-seven bits would effectively “zero- 
out” the AI-SHIFT register. Those skilled in the art will 
recognize that a six bit field is required to identify binary 
values from zero to forty-eight. Accordingly, the “shift 
count” field “cccccc” in table 2 above is seen to be a six bit 
field. However, a six bit field may include values from zero 
to sixty three. Since shift counts greater than forty-seven bits 
are meaningless in a forty-eight bit data path or register, if 
a number greater than forty-seven bits is entered in the six 
bit shift count, the command is preferably flagged by the 
compiler. When all shift sequences are completed, the data 
is transferred from the output of the AI-SHIFT module to 
Table 3 illustrates exemplary mnemonic operators for 
10 
20 the arithmetic-logic-unit (“ALU’). 
describing the control functions of Table 2. 
TABLE 3 
23 
Function set-alshift mnemonic Control code 
Logical right shift “cccccc” bits ilshr 
Arithmetic right shift “cccccc” bits iashr 
Left shift “cccccc” bits ish1 
ccccccoo 
cccccc01 
cccccc10 
30 
In implementing the mnemonic table and control com- 
mands of Tables 2 and 3 above, a bit shift left command with 
a shift length of three bits is syntactically represented as 
“set-alshift=iashl 3.” This command produces the code 
35 0000001110. The first and second bits (from the right) are 
written “10,” which form the “logical shift left” command in 
the Table 3 above. The third and fourth bits in this example 
are “11,” which form the least significant bits in the “cccccc” 
portion of the field. The “cccccc” shift count field begins at 
40 bit three of the control code. Although bit three would have 
a binary value of four if counting began at bit zero, the 
additive binary progression, 1, 2, 4, 8 . . . etc. for the shift 
count begins at the first bit of the bit count. Accordingly, the 
“11” found in bits three and four represents a shift count of 
45 three bits, not twelve bits. The values for all eight control 
bits for the AI-SHIFT module are downloaded during the 
configuration phase. 
After the AI-SHIFT has completed all shift operations, 
the output signal ah-X of the AI-SHIFT module forms 
SO the first input signal into the arithmetic logic unit ALU. 
Because the ALU includes inputs from the multiplexer 
MUX4 as well as the AI-SHIFT module, the data paths 
leading to MUX4 is advantageously discussed at this time. 
Input In3 of the programming element 203 is coupled to the 
ss padding module PAD2, where the signal is converted from 
a 24-bit signal PE-INT to 48 bit signal PE-LONG suitable 
for processing by the arithmetic logic unit ALU. Because the 
padding module PAD2 (FIG. 6) receives data from only one 
source, input 1113, no intermediate multiplexer is necessary 
60 to switch between optional inputs as was necessary to select 
the input to padding module PAD1. The operational mode of 
the padding module PAD2, “pad” or “sign-ext,” is con- 
trolled by a 1-bit control signal set-pad2, which is deter- 
mined prior to operation and downloaded during the con- 
65 figuration phase. The 48-bit PEELONG output of the 
padding module PAD2 forms a first input to the multiplexer 
MUX4. 
US 6,883,084 B1 
13 14 
The second input to the multiplexer MUX4 is the 48-bit essentially the output signal of the arithmetic-logic-unit 
PELLONG output signal MULLOUT produced by the ALU after it has been reduced from a 48-bit data path to a 
mathematical operation conducted by the multiplier 250, as 24 bit data path. The output Mu1-X of MUX2 forms the 
discussed further discussed below. The multiplexer MUX4 second input to the multiplier 250. Accordingly, the second 
selects between the 48-bit “pad2” signal from the padding 5 input Mu1-X of the multiplier 250 will either be a fresh 
module PAD2 and the 48-bit MULLOUT signal from the input signal, or a processed signal fed back from the 
multiplier 250. The switching of the multiplexer MUX4 is arithmetic logic unit m u .  The multiplexer MU= is con- 
controlled by the l-bit control signal selLmux4, which is trolled by the control signal selLmux2, which determines 
determined prior to operation and downloaded during the which of the two inputs, 1 ~ 3  or A L U ~ C L I P ~ O U T  form the 
configuration Phase. The output signal a1U-Y of the multi- 24-bit output signal Mu1-X coupled to the second input of 
plexer MUu forms the second hut into the arith- the multiplier 250. The binary state of the control signal 
metic logic unit ALU. As illustrated in FIG. 6, the inputs selLMUX2 is determined prior to the processing of input 
conducted On a data, and is downloaded into the processing element 203 
during the configuration phase. PELLONG data path. 
The processes the 48-bit huts and The 48 bit PELLONG output MULLOUT of the multi- 
alu-X according to a variety of arithmetic manipulations plier is coupled both to the shifter module MULLSHIFT and 
well known to those skilled in the art. Table 4 below includes to one of the two inputs of the multiplexer MUX4, The 
examples of the various functions carried out within the output MULLOUT of the multiplier 250 forms a potential 
input to the arithmetic logic unit ALU if selected by the ALU. 
2o multiplexer MUX4, and the output ALU-OUT of the ALU, 
after further processing by the ALUSHIFT,  A L U R N D  
and ALU-CLIP modules, forms a potential input to the 
Action mnemonic control code multiplier if selected by the multiplexer MUX2. 
The multiplier 250 output MULLOUT and the ALU 
25 output ALU-OUT are respectively coupled to identical alu-out = X + Y (arithmetic) op-add 0000 alu-out = Y - X (arithmetic) op-subX 0001 
alu-out = X - Y (arithmetic) op-subY 0010 “output shifters,” ALU-SHIFT and MULLSHIFT. In addi- 
alu-out = X and Y (logical) op-and 0011 tion to having arithmetic shift and logical shift capab 
alu-out = X nand Y (logical) op-nand 0100 also found in the AILSHIFT module discussed above, the 
output shifters also incorporate a circular shift function not alu-out = X or Y (logical) op-or 0101 alu-out = X nor Y (logical) op-nor 0110 
alu-out = X xor Y (logical) op-xor 0111 30 found in the AILSHIFT module. A circular shift distin- 
alu-out = X xnor Y (logical) op-xnor 1000 guishes from a logical shift in that, in a logical shift left, a 
alu-out = X (arithmetic) op-X 1001 binary value in the most significant bit is shifted out. In a 
circular shift left, a bit exiting the most significant bit alu-out = not X (logical) op-invX 1010 alu-out = -X (arithmetic) op-negX 1011 
alu-out = Y (arithmetic) o p Y  1100 re-appears in the least significant bit. Similarly, in a logical 
alu-out = not Y (logical) op-invY 1101 35 shift right, binary values exiting bit zero (the least significant 
alu-out = -Y (arithmetic) op-negY 1110 bit) are simply lost, but in a circular shift right, binary values 
exiting the least significant bit re-appear in the most signifi- 
According to Table 4 above, the command to produce the cant bit. 
logical XOR of the two inputs (alu_out=X xor Y) would be In preparing a 48-bit value for narrowing down to 24-bits 
syntactically represented as “ a l u ~ = o p _ x o r . ”  This sets 40 by a clipping module, one function of an output shifter is to 
the aluLout control code to 0111, When the m u  has re-position the most significant bits (numerically) or the 
completed its operation, data is transferred from the output most important bits (logically) into select bit addresses 
of the m u  to the input of the arithmetic-logic-circular- which are not subject to clipping or truncation. Accordingly, 
shifter (“ALUSHIFT”). the output shifters help ensure that the data eventually made 
since the output signal A L U ~ O U T  of the m u ,  and the 45 available to the 24-bit output out2 is the most accurate 
output signal M U L ~ C L I P ~ O U T  of the multiplier 250 are representation possible of the relevant data delivered by the 
processed in an identical sequence of sub-components, a multiplier 250 and the ALu  to the 48-bit data Paths MULL 
shifter, a rounding module and a clipping module, the inputs OUT and ALu-OuT. The output shifter has a nine bit 
to the multiplier 250 are advantageously discussed at this control word. Bits zero and one control the type of shift: no 
time. so shift, logical shift, circular shift, and arithmetic shift. Bit two 
ne second input 1 ~ 2  to the processing element 203 is a controls the direction of the shift. Bits three through eight 
24-bit data path coupled to the first input of the multiplexer are the shift count, as illustrated in Table 5 below. 
MUX3. The second input of the multiplexer MUX3 is 
coupled to the 24-bit constant data register DR2, which is 
pre-loaded with a constant numerical value during the ss 
MUX3 is controlled by the control signal selLmux3, which 
24-bit output signal Mu1-Y coupled to the first input of the ~ ~ ~ ~ ~ ~ ~ h i f t  zzzzyi ~ ~ h ~ ~ ~ : ~ ~ ~ ~ i ~ f x ~ n ~ e s  
multiplier 250. The control signal selLmux3 is a one-bit 60 circular shift 
and are both 48-bit 
TABLE 4 
a l u o p  alu-op 
TABLE 5 
set-alu-shift 
configuration of processing element 203. The multiplexer Action set-mul-shift Comment 
determines which of the two inputs, In2 or DR2, form the 
control signal, the state of which is selected prior to opera- 
tion and downloaded during the configuration phase. Shift left xxxxxxlxx 
No shift xxxxxxxoo 
xxxxxxxll 
xxxxxxoxx Shift right 
The third input 1 ~ 3 ,  which was earlier noted to form an (shift count) ~~~~~~~~ cccccc = ~~~~~~ to 111111 (0 to 63) 
input into the padding module PAD2, is also coupled to the 
first input of the multiplexer MUX2. The second input of the 65 
multiplexer MUX2 is coupled to the output signal ALUL 
CLIPLOUT, which, as discussed in greater detail below, is 
The “shift count” in the final row of Table 5 above is not 
technically an action, but is included in the table to illustrate 
the function of the left hand most bits as representing the 
US 6,883,084 B1 
15 
shift count for the other six actions listed in Table 5 above. 
The values of the binary control signals set-mul-shift and 
set-alu-shift are determined and downloaded during the 
configuration phase prior to operation. Table 6 below lists 
the ALU-SHIFT and MUL-SHIFT functions, their control 
mnemonics, and the binary codes of the control signals. The 
control codes are identical for both control signals, set- 
alu-shift and set-mul-shift. 
TABLE 6 
mnemonics for 
setpalupshift and 
Output Shifter Function set-mul-shift Control Code 
No shift noshift 000000000 
Left shift “cccccc” bits shl cccccc101 
Arithmetic right shift “cccccc” bits ashr cccccc001 
Logical right shift “cccccc” bits lshr cccccc001 
Circular right shift “cccccc” bits cshr cccccc011 
Circular left shift “cccccc” bits cshl cccccclll 
As with the AI-SHIFT module above there is no arith- 
metic shift left. The only left shifts within the A L U S H I F T  
and MULLSHIFT modules are logical and circular. An 
example of the code for performing an “arithmetic right 
shift,” a total of five bits would be “set-alu_shift=ashr 5” 
and the bit pattern produced in conjunction with the above 
tables for that command would be “000101001.” 
Upon completion of the A L U S H I F T  operation, the data in 
the A L U S H I F T  output is transferred to the input of the 
rounding module A L U R N D .  
The outputs al-shift-out and mul-shift-out of the 
A L U S H I F T  and MULLSHIFT modules are respectively 
coupled to the inputs of the ALU-RND and MUL-RND 
modules. Rounding occurs only after a right shift has taken 
place in the ALU-SHIFT or the MUL-SHIFT, and is based 
on the last bit shifted out of the least significant bit (“LSB”) 
during an A L U S H I F T  operation. For a left shift or a 
circular shift, no rounding occurs. Positive numbers are 
rounded toward infinity if the LSB is one, and negative 
numbers are rounded toward negative infinity if the LSB is 
zero. Table 7 below illustrates the actions and control codes 
for the rounding functions. Since the rounding functions are 
identical for the A L U R N D  module and the MULLRND 
module. both are illustrated in Table 7 below. 
TABLE 7 
Action Command mnemonic Code 
No rounding or incoming signal. 
(ah-rnd = alu-shift) and 
(mulhnd = mul-shift) 
Round incoming signal round 
(ah-rnd = rounding of alu-shift) and 
(mulLrnd = rounding of mul-shift) 
noround 
According to the above table, the command syntax for 
disabling rounding in the A L U R N D  module is “set-alu- 
round=noround.” 
S 
10 
1s 
20 
2s  
30 
3 s  
40 
4s 
so 
16 
The outputs alu-round-out and mul-round-out of the 
ALU-RND and MUL-RND modules are coupled to the 
respective inputs of the clipping modules ALU-CLIP and 
MULLCLIP. Because the transmission of data between 
processing elements is conducted over 24-bit data paths, the 
clipping module reduces the data field from P E L O N G  
(48-bits) down to PE-INT (24-bits) by truncating the upper 
24-bits. Each clipping module has three modes of operation. 
In the first mode, the upper 24-bits are devoid of meaningful 
data. An example of this is a positive number requiring less 
than 24 bits which has been shifted to the least significant 
twenty-four bits of the input to the clipping module. The left 
hand (most significant) 24 bits, being non-data, are simply 
truncated. Typically, non-data are all zeroes, though this is 
not always true. Because data is lost, the first mode of 
operation is known as a “no-clipping’’ process. For example, 
if a data range for MUL-OUT was anticipated, at the time 
of the pipeline configuration, to extend into bits 25,26 and 
27 of a 48-bit field, a shift operation could be pre-configured 
to shift all bits three bits to the right. The least significant bits 
would be lost in the process. The values remaining in the 
twenty-four lower bits of the field would be the most 
significant values, and the value represented therein would 
have 24-bit accuracy. The upper 24-bits would then be 
zeroes, and no data would be lost in their truncation. 
The second and third modes actually perform data clip- 
ping in which some actual data is lost. However, the shifting 
and rounding processes discussed above are performed to 
ensure that the data discarded through the clipping process 
is the least significant data, thereby retaining the maximum 
possible accuracy in a 24-bit field. In the second operational 
mode, the value remaining after clipping is an unsigned 
24-bit value. Being unsigned, the 24th bit may be used to add 
scalar value to the stored number. In a 24-bit field, the 
numerical range 0 to PE-MAX-POS of an unsigned value 
is zero to OxFFFFFFh (zero to 16,777,215). An unsigned 
24-bit field is said to be “saturated”. 
In the third process, the signal is clipped to a signed 24-bit 
value in the range PE-MAX-NEG to MAX-POS, which, 
as discussed above, ranges from negative 8,388,608, or 
Ox800000h to positive 8,388,607, or Ox7FFFFFh. 
The three different clipping processes are controlled by 
the two bit control signals “set-mul-clip” and “set-alu- 
clip” within their respective modules MULLCLIP and 
ALU-CLIP. The control signal “set_mul_clip” therefore 
determines the character of the output signals MULLCLIP- 
OUT and ALU-CLIP-OUT. Table 8 below summarizes the 
three different clipping actions performed by a clipping 
module, along with their respective mnemonic and binary 
codes. 
TABLE 8 
Action 
control Binary 
signal mnemonic Code 
aluLclipLout = aluLroundLout PE-INT (bits 0-23) setLaluLclip noclip 00 
mulLclipLout = mulLroundLout, PELINT (bits 0-23) setLmulLclip 
aluLclipLout = clip_pos(alu_roundLout) setpalupclip clip_pos 01 
mulLclipLout = clip_pos(mulLroundout) setLmulLclip 
aluLclipLout = clipLposLneg(aluLroundLout) setpalupclip clipLposLneg 11 
mulLclipLout = clipLposLneg(mulLroundLout) setLmulLclip 
US 6,883,084 B1 
17 
According to the above table, the command syntax for 
setting the multiplier clipper between PE-MAX-NEG and 
PELMAXLPOS would be written: “setLmulLclipLout= 
clipLposLneg.” The binary values for the control code 
setLaluLclip and setLmulLclip are selected and down- 
loaded during the configuration phase, and do not change 
during data processing. 
As will be further discussed in conjunction with the 
conditional multiplexer CMUX, the clipping modules 
ALU-CLIP and MULLCLIP respectively generate four 
arithmetic status bits. The four arithmetic status bits of 
ALU-CLIP form the control signal A L U S W ,  and the four 
arithmetic status bits of MULLCLIP form the control signal 
MUL-SW. One of the two control signals, ALU-SW and 
MULLSW, is selected to determine the data path through 
the conditional multiplexer CMUX. These four arithmetic 
status bits comprising CMUXLSW, however, do not 
directly determine the switching of the conditional multi- 
plexer CMUX. Rather, arithmetic status signal CMUXLSW 
is processed by a ten-bit “mask signal” selLcmux.” 
The output 24-bit data path MULLCLIPLOUT is 
coupled to the first input of the Multiplexer MUX5. A 
second input of multiplexer MUX5 is coupled to the output 
XB1 of the crossbar switch XBAR5x3. The output 24-bit 
data path ALULCLIPLOUT is coupled to the first input of 
the Multiplexer MUX6. A second input of multiplexer 
MUX6 is coupled to the output XB2 of the crossbar switch 
XBAR5x3. As discussed in greater in conjunction with 
Table 15 each crossbar switch XBAR5x3 configurably 
routes any of five different crossbar inputs, Inl ,  1112, 1113, 
DR1 or DR2 to any of three crossbar switch outputs, XB1, 
XB2 and XB3, essentially providing direct throughput 
switching of the various input signals to the crossbar out- 
puts. The controlled routing is achieved through three sepa- 
rate 3-bit control signals, sei-xbl, selLxb2 and selLxb3 
which are preset during the configuration mode. 
The multiplexer MUX5 is controlled by the 1-bit control 
signal selLmux5, which is pre-set during the configuration 
process to select one of the two inputs of MUX5, XB1 or 
MULLCLIPLOUT, as the output mux5_out of MUX5. The 
multiplexer MUX6 is controlled by the 1 bit control signal 
sleLmux6, which is pre-set during the configuration process 
to select one of the two inputs to MUX6, XB2 or ALUL 
CLIPLOUT, as the output mux6Lout for MUX6. The 
outputs mux5Lout and mux6Lout form the inputs of the 
conditional multiplexer CMUX, discussed in greater detail 
herein. 
As discussed above, in addition to the output signals 
ALULCLIPLOUT and MULLCLIPLOUT respectively 
generated by the ALU-CLIP and MULLCLIP modules, the 
ALU-CLIP and MULLCLIP modules are each capable of 
generating a 4-bit status signal. The 4-bit status signals, 
A L U S W  and MULLSW reflect the respective state of 
select arithmetic status bits generated by the multiplier 250 
and the ALU. One of these 4-bit status signals, A L U S W  or 
MUL-SW will be selected to form the control inputs of the 
conditional multiplexer CMUX. The 4-bit signal MULLSW 
is coupled to the first input of the multiplexer MUX7, and 
the 4-bit signal A L U S W  is coupled to the second input of 
the multiplexer MUX7. The 1-bit control signal selLmux7, 
which is pre-set to a pre-selected value during configuration, 
controls the multiplexer MUX7 to select one of the four-bit 
status words ALU-SW or MUL-SW as the output signal 
CMUXLSW of the multiplexer MUX7. This output signal 
CMUXLSW serves as the control signal of the conditional 
multiplexer CMUX. However, the four arithmetic status bits 
comprising CMUXLSW do not directly determine the 
18 
switching of the conditional multiplexer CMUX. Rather, 
arithmetic status signal CMUXLSW is processed against a 
ten-bit mask signal, selLcmux. Although the bit values 
comprising the 4-bit status signal CMUXLSW are gener- 
5 ated through the processing of data, the mask signal selL 
cmux is determined prior to operation and downloaded 
during the configuration process. The logical product of the 
four-bit arithmetic status signal CMUXLSW and the ten-bit 
mask signal selLcmux determines if the output of the 
conditional multiplexer CMUX is the output signal from 
MUX5 or the output signal from MUX6. Various features of 
the control signal CMUX, the mask signal sel_cmux, and 
the logical interaction between them are illustrated in Tables 
9-13 below. 
The control signal CMUX_SW, as it interacts with the 
mask signal sel_cmux, dynamically controls the dynamic 
data path through the conditional multiplexer, CMUX, 
which is an important feature of the RDPP computational 
model. The conditional multiplexer has both conditional and 
unconditional modes, as determined by the most significant 
20 bit in the ten bit control code selLcmux. In the “uncondi- 
tional mode,” the conditional multiplexer ignores the arith- 
metic status bits CMUXLSW and is simply switched the 
state of the ninth bit (bit 8). Table 9 below illustrates the 
control code for conditional and unconditional switching. 
TABLE 9 
10 
2s 
Conditional switching command “selLcmux” signal 
Conditional switching of CMUX (determined by the 
cccccccc of the mask signal “selLcmux”) 
Unconditional switching of CMUX (dependant only 
on bit 9 (c) of sei-cmux, and unaffected by arith- 
metic status bits CMUX-SW or mask bits 0-8 
(xxxxxxxx) of “selLcmux.” 
lxcccccccc 
30 arithmetic status bits cmux-sw, and by bits 0-8, 
Ocxxxxxxxx 
3s 
In referencing the bits of the control signal in Table 9, the 
“first bit” refers to the least significant bit, bit zero, and the 
“last bit” refers to the most significant bit, bit nine. As 
4o illustrated in Table 9 above, when the last bit of the control 
signal selLcmux is a binary “one”, the CMUX will condi- 
tionally switch data according to the logical product of the 
arithmetic status bits of CMUXLSW and bits 0-7 of the 
mask signal selLcmux. If the last bit of selLcmux is a zero, 
4s the switching will not be conditioned upon the interaction of 
status signal CMUXLSW and mask signal selLcmux, but 
will be unconditionally determined by the status of bit eight 
(the ninth bit) of selLcmux as further illustrated in Table 10 
below. 
so 
TABLE 10 
Action sei-cmux signal 
Unconditionally select MUXS ooxxxxxxxx 
ss Unconditionally select MUX6 Olxxxxxxxx 
As discussed above, the outputs of MUX5 and MUX6 form 
the inputs of the conditional multiplexer. For exemplary 
purposes only, the default output of the conditional multi- 
60 plexer is herein designated as the output of MUX5. As 
illustrated in Table 10 above, in the unconditional switching 
mode (binary zero in bit 9, the most significant bit of 
sel_cmux), a binary zero in sel_cmux(bit-8) will select the 
default output, mux5_out, as the output cmux-out of the 
65 conditional multiplexer CMUX. Alternatively, a binary one 
in selLcmux(bit8) will select the output mux6Lout as the 
output cmux-out of the conditional multiplexer CMUX. 
US 6,883,084 B1 
19 20 
As discussed previously, the “CMUX_SW’ signal is a purposes has been designated as MUX5, to the non-default 
four bit control signal reflecting the value of various arith- input, MUX6. According to the above Table 12 above, if bit 
metic status bits produced in the data processing of either seven of the mask, selLcmux(7) were set to a binary “one,” 
clipping module ALU-CLIP or MULLCLIP. The control it would allow switching of the conditional multiplexer 
signal, cmuxLsw, is selected from the 4-bit arithmetic status 5 CMUX when the status bit cmuxLsw(3) were a binary 
signals ALU-SW and MULLSW according to the switch- “one,” indicating that the 24-bit output data, either ALUL 
ing of the multiplexer MUX7. These four status bits, CLIPLOUT or MULLCLIPLOUT, equaled zero. Similarly, 
control the switching of the conditional multiplexer CMUX. “one” during the configuration process, the conditional 
multiplexer would switch from the default output of MUX5 
to MUX6 whenever status bit cmuxLsw(0), the “underflow” 
bit, was set to a binary “zero,” indicating that no underflow 
had occurred. The architecture of each processing element 
PE 203 allows multiple conditional switches to activate 
1~ switching of the conditional multiplexer CMUX provided 
they are not mathematically contradictory. A second positive 
switching command will not toggle back to the default 
multiplexer MUX5. Multiple affirmative switching com- 
mands will have the same effect as a single switching 
2o command, switching the conditional multiplexer CMUX 
from the default source to the alternative source. However, 
described in Table 11 below, act as “control bits” in that they if the mask bit selLcmux(0) were configured as a binary 
Arithmetic Status 
(Bit) Name Meaning 
CMUX-SW(3) Z Bit is set to “1” if the data generated by the 
is negative. 
Bit is set to “1” if overflow has occurred in 
processing of data. 
underflow) has occurred in the processing 
of data. 
clipping module is zero. 
Bit is set to “1” if data (the clipping output) CMUX-SW(2) N 
CMUX-SW(1) V 
CMUX_SW(O) U Bit is set to “1” if underflow (negative 
if multiple switching commands are mathematically 
contradictory, according to the preferred embodiment, the 
Because the ALu-sw and are Output conditional multiplexer will revert to the default output. An from the clipping modules, the term “data” as used in Tables 
25 example of mathematically contradictory switching signals 11-13 refers to the data processing and data output of the 
would be if selLcmux(7) and selLcmux(3) were both set to clipping modules ALU-CLIP and MULLCLIP. 
Flexibility is built into the program in that the mask signal a binary “One.” 
selLcmux may be configured to induce switching when a Table 13 below further illustrates the meaning and sig- 
specific status bit of Table 12 is set to ((1,” or, alternatively, nificance of individual bits in the mask selLcmux, including 
to induce switching of the CMUX when the state of the 30 the binary code for a specific mask operation. The condi- 
specific status bit is “0,” m i s  flexibility can be clearly tional switching nomenclature in the first row, “cmux=mux6 
understood by examining the relationship of the status bits if Zero else mux5,” signifies that the CMUX will default to 
0-3 of cmuxLsw to the mask bits 0-7 of “selLcmux” in receiving input data from the output of MUX5, but will 
Tables 13 and 14. switch to receive input data from MUX6 if the arithmetic 
discussed above, when the last bit of “selLcmux” is a 35 status bit “Z’ (Table 11 above) becomes a binary “one”. As 
binary “one”, the switching of the conditional multiplexer discussed above, it is possible to effect switching from more 
CMUX, is conditioned on the relationship of select arith- than one status bit. Accordingly, the term “else” is not meant 
metic status bits CMUX_SW to the pre-configured mask to exclude other status bits from affecting a switch from 
“selLcmux.” Table 12 below illustrates the logical function MUX5 to MUX6, but is simply incorporated to utilize 
of each bit within the pre-configured mask selLcmux. 40 COmmOn software code terminology. 
TABLE 12 TABLE 13 
Arithmetic 
Mask bit Status bit 
sel-cmux M U L S W  Significance if mask 
[bit) [bit) Mnemonic bit is true: 
sel-cmux (7) Z if-zero Switch CMUX if Z = 1 
(when data = zero) 
sel-cmux (6) N if-neg Switch CMUX if N = 1 
(when data is negative) 
sel-cmux (5) V if-oflow Switch CMUX if V = 1 
(when overflow has 
Condition triggering switching sel-cmux Binary code 
45 cmux = mux6 if Zero else mux5 
cmux = mux6 if Negative else mux5 
cmux = mux6 if Overflow else mux5 
cmux = mux6 if Underflow else mux5 
cmux = mux6 if non-Zero else mux5 
cmux = mux6 if non-Negative else mux5 
50 cmux = mux6 if No Overflow else mux5 
cmux = mux6 if No Underflow else mux5 
if-zero 
if-neg 
if-oflow 
if-uflow 
if-not-zero 
if-not-neg 
if-nooflow 
if-no-uflow 
lxlxxxxxxx 
lxxlxxxxxx 
lxxxlxxxxx 
lxxxxlxxxx 
lxxxxxlxxx 
lxxxxxxlxx 
lxxxxxxxlx 
lxxxxxxxxl 
occurred) 
sel-cmux (4) U if-uflow Switch CMUX if u = 0 
(negative overflow has 
occurred) 
(when data is not zero) 
(when data is not 
negative) 
(when no overflow has 
occurred) 
(no negative overflow 
According to the syntax of tables 9-13, an exemplary line of 
code fixing the conditional multiplexer to a specific path 
ss would appear as “selLcmux=mux5 always.” An exemplary 
line of code switching to the non-default input when the “Z’ 
sel-cmux (2) nN ifpnotpneg Switch CMUX if N = 0 bit of MULLSW becomes true would appear as “selL 
cmux=ifLzero .” 
Because these two lines of code are mutually exclusive, 
60 however, they are not offered as examples of code which can 
be used in conjunction with each other, but are offered as 
sel-cmux (0) nU if-no-uflow Switch CMUX if U = 0 independent examples. In the first line of code, “selLcmux= 
mux5 always,” the conditional multiplexer CMUX is set 
unconditionally to received data from MUX5. It will be 
65 recalled that the last bit, bit-9, must be a binary zero to set 
CMUX to unconditionally receive input from a specific 
source. To select the default source, MUX5, b i t4  must be 
if-not-zero Switch CMUX if Z = 10 sel-cmux (3) nZ 
if-no-oflow Switch CMUX if V = 0 sel-cmux (1) nV 
has occurred) 
The command “switch CMUX’ in Table 12 refers to 
switching from the default input, which for exemplary 
US 6,883,084 B1 
21 
zero. Accordingly, the binary code representing the instruc- 
tion “selLcmux=mux5” is 0000000000. In the second line 
of code, “sel_cmux=if_zero,” the CMUX will switch from 
the default input, MUX5, to the alternate source, MUX6, if 
the arithmetic status bit “Z,” MULLSW(3), is true, indicat- 
ing that the 24-bit output ALULCLIPLOUT or MULL 
CLIPLOUT equals zero. The binary code representing this 
instruction in the selLcmux control mask is 10lxxxxxxx. 
Bits zero through six, shown as “x” may be in either state 
provided they do not contradict the logical implications of 
the binary “1” in bit seven. According to this mask 
command, if status bit Z becomes a binary “one,” the 
conditional multiplexer CMUX will switch to MUX6 as its 
output. If the Z bit is not set, the CMUX will revert to the 
default source, MUX5. As discussed above, it is understood 
that either MUX5 or MUX6 could be selected as the default 
source according to a pre-determined architecture. 
As previously noted, compound conditions may be 
specified, as illustrated by the following lines of code: 
“selLcmux=ifLzero” and “selLcmux=ifLnotLnegative.” 
Assuming MUX5 is the default multiplexer in FIG. 5,  
according to the instructions expressed in code lines 4 and 
5 above, CMUX will switch to the input from MUX6 if the 
result of a computation is non-negative (positive) or if the 
value is zero. As noted earlier, contradictory settings are 
flagged by the compiler. According to the preferred 
embodiment, the program will be rejected in the event of 
contradictory instructions. However, embodiments are envi- 
sioned wherein contradictory instructions act similar to an 
“unconditional” selection, defaulting to one of the two 
inputs in the face of contradictory instructions under any 
status-bit conditions. 
FIG. 11 is an illustration of a mask 260 used to process the 
arithmetic status signal CMUXLSW. The binary status of 
mask signal selLcmux is stored in a mask register 261, 
forming a binary mask 262. The control signal CMUXLSW 
representing the arithmetic status bits produced by either the 
ALU-CLIP module or the MULLCLIP module is pro- 
cessed against the binary mask 262 according to preset mask 
logic 263. According to the preferred embodiment, a CMUX 
control output signal 265 determines the switching state of 
the conditional multiplexer CMUX. Although the mask 260 
is illustrated in FIG. 11 as being separate from the condi- 
tional multiplexer CMUX, according to the preferred 
embodiment, the mask 260 is integral to the conditional 
multiplexer CMUX. Accordingly, the 4-bit signal CMUX- 
SW is generally depicted herein as the control signal which 
enters the conditional multiplexer CMUX and effects the 
switching of data paths therein. 
The output cmux-out of the conditional multiplexer is 
coupled to the input of the first output register RLOUT1, 
thereby depositing the final form of the processed data into 
the output register RLOUT1. The 1-bit fire control signal 
firel_PE controls the firing of the contents of the first output 
register RLOUTl to the first output Out1 of the processing 
element 203. 
As noted throughout the preceding discussion, the selec- 
tive multiplexers MUX1-MUX7 each have two inputs, and 
are controlled by a one bit control signal. The control signal 
determines which of the two input signals will form the 
multiplexer output. The value of the control signal is pre- 
selected and downloaded during the configuration process, 
thereby defining the data path at that time. Table 14 below 
describes the various inputs which may be selected by the 
selective multiplexers MUXl through MUX7 according to 
the control signal they receive. For example, an examination 
of FIG. 6 discloses that MUXl may receive inputs from 
22 
DR1 or Inl .  Therefore, the action “muxl=DRl” indicates 
that the control signal selLmuxl controlling MUXl is set to 
configure MUXl to receive its input data from DR1 rather 
than Inl.  As noted in FIG. 6, multiplexer MUX4 is unique 
s in that its inputs are 48-bit P E L O N G  data paths, whereas 
the inputs of all other multiplexers in the processing element 
203 are 24-bit PE-INT data paths. 
TABLE 14 
10 
Action Code 
muxl = In1 
15 m u x l = D R l  
sei-muxl 
selLin1 
selLdr1 
sel mux2 
0 
1 
mux2 = In3 selLin3 0 
mux2 = ALU CLIP OUT selLalu- 1 
20 mux3 = In2 selLin2 0 
mux3 = DR2 selLdr2 1 
selLmux3 
selLmux4 
mux4 = padz-out selLpad2Lout 0 
mux4 = MULLOUT selLmulLout 1 
25 selLmux5 
mux5 = MULLCLIPLOUT selLmulLclipLout 0 
mux5 = XB1 sei-xbl 1 
selLmux6 
30 mux6 = ALULCLIPLOUT selLaluLclipLout 0 
mux6 = XB2 selLxb2 1 
selLmux7 
m u 7  = ALULSW selLmulLsw 0 
m u 7  = MULLSW selLaluLsw 1 
3 s  
In using the above table, if the binary control signal for 
MUX6 were a binary zero, the control signal would indicate 
“selLsb2,” and the multiplexer MUX6 would select the 
signal XB2 for its output. 
The crossbar switch XBAR5x3 is a five-input, there- 
output crossbar (crosspoint) switch. The crossbar switch 
XBAR5x3 has five inputs, Inl ,  1112, 1113, and the data 
registers DR1 and DR2, and three outputs, XB1, XB2 and 
XB3. As discussed above, the data registers DR1 and DR2 
45 are loaded with a predetermined constant values during the 
configuration process. As illustrated in FIG. 6, any of the 
five crossbar switch inputs, Inl ,  1112,1113, DR1 and DR2 can 
be selectively routed to any of the three crossbar switch 
outputs XB1, XB2 and XB3 through the respective control 
50 of the pre-configured control signals sei-xbl, selLxb2 and 
selLxb3. The output XB3 forms the input to the second 
output register RLOUT2. The outputs XB1 and XB2 are 
respectively coupled to inputs of MUX5 and MUX6. 
Table 15 illustrates the binary codes and control codes by 
55 which the respective control signals sei-xbl, selLxb2 and 
selLxb3 route the various inputs Inl ,  1112, 1113, DR1 and 
DR2 to the respective output ports XB1, XB2, XB3. Each 
control signal is 3-bits, and is pre-determined and down- 
loaded during the configuration process. 
40 
60 
TABLE 15 
Action Code 
selLxbl 
65 
xbl = DR1 selLDRl 000 
US 6,883,084 B1 
23 
TABLE 15-continued 
Action Code 
xbl = DR2 
xbl = In1 
xbl = In2 
xbl = In3 
xb2 = DR1 
xb2 = DR2 
xb2 = In1 
xb2 = In2 
xb2 = In3 
xb3 = DR1 
xb3 = DR2 
xb3 = In1 
xb3 = In2 
xb3 = In3 
selLDR2 
selLInl 
selLIn2 
selLIn3 
selLxb2 
selLDRl 
selLDR2 
selLInl 
selLIn2 
selLIn3 
selLxb3 
selLDRl 
selLDR2 
selLInl 
selLIn2 
selLIn3 
001 
010 
011 
100 
000 
001 
010 
011 
100 
000 
001 
010 
011 
100 
Returning to FIG. 6, the output of the second output 
register R-OUT1 is coupled to the second output Out2 of 
the processing element 203. A fire control signal, “fire2- 
PE” triggers the firing of the second output register 
R-OUT2 to the second output port, Out2 of the processing 
element 203. The fire control signal, f i r e 2 P E ,  however, 
does not exercise independent control of the output register 
RLOUT2, but works in conjunction with the output enable 
signal ROut2-en. The output enable signal R o u t L e n  is 
predetermined and downloaded during the configuration 
phase, and must be enabled (in a binary one state) in order 
for the second output register RLOUT2 to fire to the output 
Out2. If the output register RLOUT2 is not enabled through 
the output enable signal R o u t L e n ,  the output Out2 will 
hold its previous value regardless of the state or transition in 
the firing signal f i re2PE.  If the output register Rout2 is 
enabled through output enable signal R o u t L e n ,  the value 
stored in output register Rout2 will be sent from the second 
output register RLOUT2 to the output Out2 upon the a 
binary one in the fire control signal fire2-PE. 
TABLE 16 
ROutlLen firelLPE Action 
0 (binary zero) 0 (binary zero) Outl holds previous values 
on 24-bit output bus. 
0 (binary zero) 1 (binary one) Outl holds previous values 
on 24-bit output bus. 
1 (binary one) 0 (binary zero) Outl holds previous values 
on 24-bit output bus 
1 (binary one) 1 (binary one) Send contents of output 
register RLOutl to output 
Outl on leading edge of 
firelLPE signal. 
As noted in FIG. 6, there is no enable bit effecting the 
output of R-Out2. Accordingly, the primary output register, 
R-OUT1, is enabled by a run-time program, and cannot be 
disabled. The secondary output register, R-OUT2, is not 
used in all applications. It must therefore be explicitly 
enabled by the enable output ROut2-en. According to the 
preferred embodiment, fire control signals fire-lPE and 
fire-2PE are coupled to the same signal source, and will 
therefore go high and low simultaneously. 
Configuration of the Processing Element 
As discussed above, constant data registers DR1 and DR2 
(FIG. 6, top) are twenty-four bit registers used for storing 
fixed numerical values. The values are downloaded during 
the configuration process. Because the most significant bit 
(bit 23 in a 24-bit register) is the sign bit, the maximum 
5 
10 
15 
20 
2s 
30 
3 s  
40 
4s  
so 
55 
60 
65 
24 
value which may be stored in either of the 24-bit constant 
data registers DR1 and DR2 is positive 8,388,607 and 
negative 8,388,608, the scalar value being stored in the 
twenty-three least significant bits. In Table 17, the value “n” 
represents the constant value stored in each of the respective 
constant registers DR1 and DR2. In a 24-bit register, this 
means that (-8,388,608)<=n<=(+8,388,607). However, 
because the present invention envisions applications com- 
prising constant data registers greater than 24-bits and less 
than 24-bits, Table 17 simply expresses these values as 
MAX-NEG and MAX-POS. 
TABLE 17 
Action Comments 
Load DR1 = n 
Load DR2 = n 
MAX-NEG <= n <= MAX-POS 
MAX-NEG <= n <= MAX-POS 
Although Table 17 illustrates the preferred embodiment, the 
present invention envisions alternative embodiments such as 
loading a 24-bit unsigned value in the constant data registers 
DR1 and DR2. 
The configuration of each processing element (“PE’) 
involves loading a total of one-hundred thirteen configura- 
tion bits into the processing element. The table below 
identifies these bits and their function. The first sixty-five 
bits, 0-64, addressed from right to left (least significant bit 
to most significant bit) are control bits, followed by the 
transmission of two separate twenty-four bit constant values 
which are to be loaded into the constant data-registers DR1 
and DR2. Table 18 below illustrates the syntax of the bit 
stream. 
TABLE 18 
Bit Number in 
Configuration 
Bitstream (beginning Configuration Number Default 
at zero) signal of bits value 
0-3 
4 
5 
6-13 
14-22 
23-31 
32 
33 
34-3s 
36-37 
38 
39 
40 
41 
42 
43 
44 
45-54 
55-57 
58-60 
61-63 
64 
65-88 
89-112 
alu-op (0-3) 
set-pad1 
set-pad2 
set-alshift (0-7) 
setpalupshift (0-8) 
setLmulLshift (0-8) 
setpalupround 
setLmulround 
setpalupclip (0-1) 
set-mulLclip (0-1) 
setLmux1 
selLmux2 
selLmux3 
selLmux4 
selLmux5 
selLmux6 
selLmux7 
sei-cmux (0-9) 
selLxbl (0-2) 
selLxb2 (0-2) 
selLxb3 (0-2) 
ROut2Len 
DR1 
DR2 
4 0000 
1 0 
1 0 
8 00000000 
9 000000000 
9 000000000 
1 0 
1 0 
2 00 
2 00 
1 0 
1 0 
1 0 
1 0 
1 0 
1 0 
1 0 
100000000000 
3 000 
3 000 
3 000 
1 0 
24 o...o 
24 o...o 
As noted in Table 18, the default (initialization) bitstream 
is all zeroes. The default processing element behavior is to 
add In1 and In3 with no shifting, rounding or clipping, and 
to place the sum in R-Outl. These control bits are loaded 
by means of a series of configuration messages as discussed 
in conjunction with FIG. 12. 
The Configuration Message 
FIG. 12 illustrates a configuration message 400 for load- 
ing configuration data into a processing element PE within 
US 6,883,084 B1 
25 
a RDPP pipeline. Each configuration message comprises a 
header 402, a body 404 and a trailer 406. The body 404 of 
each configuration message is capable of storing up to 
24-bits of information, thereby taking full advantage of the 
24-bit data paths to the various processing elements. It is 
understood that, for embodiments utilizing data paths of 
more or less than 24-bits extending to the various processing 
elements, the body 404 of a configuration message 400 may 
be advantageously re-sized to take full advantage of the 
pipeline architecture. The data stored within the body 404 of 
the configuration message 400 is then downloaded into 
pre-determined configurations registers as determined by the 
header 402. 
The header 402 of each configuration message 400 is a 
7-bit field including a 3-bit operation code field “OP- 
CODE’ 408 and a 4-bit PE address field (“PE-ADDR’) 
410. According to the binary capacity of a four bit 
PE-ADDR field 410, a configuration message may be 
directed to up to sixteen independently addressable process- 
ing elements, numbered zero through fifteen. The present 
invention envisions a pipeline comprising sixteen process- 
ing elements, thereby making optimal use of the storage 
capacity within PE-ADDR field 410 in the message header. 
Embodiments are envisioned, however, for RDPP pipeline 
processors comprising more than sixteen processing 
elements, and the capacity of the address field PE-ADDR 
410 may be changed accordingly. 
As noted in Table 18 above, there are at least one-hundred 
thirteen configuration bits, including the two 24-bit values 
stored in DR1 and DR2, which must be downloaded to 
configure a single processing element. Since no more than 
24-bits may be downloaded into the processing element in 
any one configuration message, the downloading of 113-bits 
will require at least five separate configuration messages 
400. To distinguish these five configuration messages, an OP 
code 408 within the header 402 designates which set of 
registers are to be configured by a particular configuration 
message. In order to identify at least five distinct configu- 
ration messages for each processing element, an operational 
code field OP-CODE field 408 comprising a minimum of 
three bits is also located in the header. The binary pattern 
stored in the OP-CODE field 408 identifies the configura- 
tion data being downloaded, and ensures that it is switched 
and routed to the proper configuration registers. Table 19 
below illustrates the binary values for the operational codes 
and the configuration data corresponding to those particular 
OP-CODE. 
TABLE 19 
Op Code of Bit address Control 
configuration within Body of Configuration Bit signal, mask 
message Configuration Number (0-112) or register 
(3-bits) Message. (0-23) being downloaded being configured 
000 0-3 
4 
5 
6-13 
14-22 
001 0-8 
9 
10 
11-12 
13-14 
15 
16 
17 
18 
19 
20 
21 
0-3 
4 
5 
6-13 
14-22 
23-31 
32 
33 
34-3s 
36-37 
38 
39 
40 
41 
42 
43 
44 
alu-op (0-3) 
set-pad1 
setLpad2 
setpalshift (0-7) 
setpalupshift (0-8) 
set-mul-shift 
setLaluLround 
setLmulLround 
setLaluLclip (0-1) 
setLmulLclip (0-1) 
set-muxl 
selLmux2 
selLmux3 
selLmux4 
selLmux5 
selLmux6 
selLmux7 
(0-8) 
26 
TABLE 19-continued 
Op Code of Bit address Control 
configuration within Body of Configuration Bit signal, mask 
message Configuration Number (0-112) or register 
(3-bits) Message. (0-23) being downloaded being configured 
010 0-9 
10-12 
13-15 
10 16-18 
19 
011 
100 
101 0-23 
110 0-23 
15 111 0-23 
45-54 sei-cmux (0-9) 
55-57 sei-xbl (0-2) 
58-60 selLxb2 (0-2) 
61-63 selLxb3 (0-2) 
64 ROutZ-en 
65-88 DR1 
89-112 DR2 
Res e 1v e d 
Res e 1v e d 
Res e 1v e d 
As illustrated in Table 19 above; a 3-bit OP-CODE field 
within the header of the configuration message allows for 
the configuration message to be routed to a specific con- 
20 figuration register or group of configuration registers within 
a processing element. According to the exemplary values 
used in Table 19, a configuration message defined by the 
operational code “000” will contain 23 useful bits of con- 
figuration data in the 24 bit body. The configuration message 
2s will be routed to the processing element defined in the 
header address, the data stored in the body of the message 
will be downloaded into the processing element and used to 
configure the 4-bit control signal “ah-op”, the 1-bit control 
signal “set-padl,” the 1-bit control signal “set-pad2,” the 
30 8-bit control signal “set-alshift,” and the 9-bit control 
signal “set-alu-shift.” Operational codes of 011 and 100 
designate the storage of the two 24-bit constant values 
respectively stored in constant storage registers DR1 and 
DR1. It is understood, however, that the terms “value” and 
35 “constants” are not intended to limit the operations associ- 
ated with these digital values to mathematical operations. 
The values may be used for any logical digital operation, 
ANDs, NANDs, bit shifts, etc., whether or not the operation 
is directed to a known mathematical operation. 
Operational codes 101,110 and 111 are reserved for future 
use. Among the various configuration features to which the 
reserved control codes may be directed, it is envisioned that 
the reserved operational codes may be used to store preset 
values for counters. As discussed in further detail below, the 
45 pipeline architecture of the present invention is particularly 
useful in outer space applications such as processing pho- 
tographic and other scientific and sensory data. Such appli- 
cations lend themselves to a repeatable sequence of firing 
codes. A counter with a predetermined preset value could be 
SO used to repeat a sequence of firing patterns a fixed number 
of times. As noted, the body of a configuration message is 
24-bits in length, which can be downloaded into a counter 
preset register. Those skilled in the art will recognize that a 
counter with a 24-bit preset value is capable of exceeding a 
ss count of eight million. Because some processes for digital 
imaging require sequential operations of upwards of a 
million iterations, the architecture of the present invention is 
particularly amenable to such applications. 
The message trailer 406 is preferably an eight bit field 
60 used to contain an error checking code such as a cyclical 
40 
redundancy check or other error checking sum. 
Operation of an RDPP Pipeline 
Because dynamic power consumption is proportionally 
smaller in ULP technology, ULP circuits exhibit superior 
65 power consumption characteristics when running multiple 
operations or algorithms in parallel. Parallel run operations, 
accordingly, reduce the total throughput time required to 
US 6,883,084 B1 
27 28 
calculate the final output value. Accordingly, use a process- the preferred embodiment, during any program cycle, a 
ing element comprising a conditional multiplexer CMUX given processing element PEO 502-PE4 510 will be in one 
disclosed in FIG. 6 is particularly amenable to parallel of four states. In the “wait-state’’ W, a processing element is 
processing operations in ULP circuits. By controlling the waiting for input data from the preceding processing ele- 
switching of the conditional multiplexer CMUX on the s ment. In the “process-state’’ P, a processing element pro- 
arithmetic status bits produced during data processing, each cesses data by executing a set of logic instructions. A 
processing element can be conditioned to output data meet- succession of consecutive processing states are possible for 
ing certain pre-determined specifications, thereby prevent- a given processing element. After a processing element PEO 
ing unfit data from passing through the selective multiplexer 502 . . . PE4 510 completes its processing, which may be 
to the output Outl. Accordingly, the present invention i o  accomplished in a single cycle or a succession of consecu- 
utilizes a parallel pipeline configuration of multiple process- tive cycles, it enters the “firing state” F, wherein it couples 
ing elements in conjunction with a selective multiplexer, its output to the next processing element. The firing, 
thereby more fully exploiting the advantages made available however, does not automatically occur on the cycle imme- 
through ULP technology. diately following the last processing cycle. If a successor 
By shutting down unused PES additional power reduc- is processing element (“PE’) is not ready to receive the output 
tions can be achieved. The “shutting down” may be achieved from a preceding processing element (“PE’)), the preceding 
in at least two ways. The first way is to place a software- PE will transition from the process-state “P’ to a blocking 
programmable switch between the processing element and state “B”, wherein the control signal “ f i r e P E ’  (FIGS. 6 and 
its power supply and ground lines. A second way is to 13) is blocked or suppressed until the successor PE is ready 
dynamically adjust the back bias on a PE to raise the 20 to receive data. The blocking state is repeatable for as many 
threshold. This has the effect of both throttling leakage cycles as necessary. If a parallel branch is beginning com- 
power, and making the PE unresponsive to signals on the prising multiple successor processing elements, the blocking 
inputs, so that its internal gates do not change state and state will continue until all successor processing elements 
consume dynamic power. The first approach increases gate are ready to receive an input. In the blocking state, a 
delays. The latter requires careful circuit design to manipu- zs processing clement cannot process data, receive data, or fire 
late the thresholds, as well as a CMOS process that supports the data which it has processed. Accordingly, a preceding 
these circuits. processor will enter the blocking state if it has finished 
FIG. 13 is a simplified illustration of a single processing processing and the successor processor is in either a pro- 
element that was illustrated in detail in FIG. 6. Contrasting cessing state or also in a blocking state. According to the 
FIG. 14 to the detailed processing element schematic in FIG. 30 preferred embodiment, a preceding processor may fire to a 
6, the output latch 517 which represents both output registers successor processor simultaneous with the firing of the 
RLOUTl and RLOUT2 of FIG. 6. As noted, according to successor processor. 
the preferred embodiment, the fire control signals firel-PE Within FIG. 15, the arrows illustrate the firing of output 
and f i r e 2 P E  of FIG. 6 are a single signal. Accordingly, data from one processing element to another processing 
FIG. 13 is a simplified illustration of a processing element 3s clement. If an arrow is shown to extend over several cycles, 
PE focusing on the firing of the latched output through the the actual firing occurs at the earlier cycle, but the extension 
single fire control signal, fire-PE 523. The inputs 521 over several cycles illustrates that the data received in the 
represent all inputs of FIG. 6, Inl ,  112,1113, DR1 and DR2. firing is not processed until a the remaining necessary inputs 
The outputs 519 represent all outputs of FIG. 6, Outl and are received. In cycle 1,  PEO 502 has received data and is 
Out2. FIG. 13 is simplified in that all other circuitry illus- 40 processing it. Since this is part of the initialization process, 
trated in FIG. 6 is simplified by the combinational logic 523 PE1 504-PE4 510 are seen to be in the wait-state “W,” 
of FIG. 13. The output latch 517 is controlled by the fire awaiting in input for processing. 
control signal f i r e P E  523. In cycle 2, PEO 502 has fired, enabling PE1504 and PE3 
FIG. 14  illustrates a processing pipeline 500 for perform- 508 to begin processing. PE2 506 and PE4 510 remain in a 
ing both sequential and parallel operations in processing 4s wait state, awaiting valid data for processing. 
data. Each block PEO 502, PE1504, PE2 506, PE3 508, PE4 In cycle 3, PEO 502 has received new data according to 
510 represents a separate processing element (“PE”), the exemplary program input depicted in the firing cycle of 
according to the simplified representation illustrated in FIG. FIG. 15. PEO 502 begins processing data immediately upon 
13. The various executable processes executed by PEO receipt. 
502-PE4 510 are driven by a clock pulse (not shown). SO In cycle 4, PEO 502 is blocked from firing. It has finished 
Advantageously, the same clock pulse drives all separate processing, but cannot fire until PE1504 and PE3 508 have 
programming elements PEO 502-PE4 510, thereby achiev- completed processing and fired their outputs. Although a 
ing synchronicity. Each successive clock pulse therefore processing element may receive data and process before it’s 
transitions the next cycle of a program. Within each pro- successors have fired, it may not fire until it’s successors 
cessing element PEO 502-PE4 510, the value “z” represents ss have fired. Accordingly PEO 502 has completed processing 
the number of cycles necessary to execute and trigger the in step 3, but is blocked from firing in steps 4 and 5,  as seen 
process executed by that particular processing element. in the states P-B-B for cycles 3-5. 
PE,, is used herein to designate the processing element In cycle 5, PE1504 couples and enables PE2 506 to start 
within a pipeline requiring the greatest number of cycles processing. 
z,, to execute its assigned process, when compared against 60 In cycle 6, PEO 502 transitions from the blocking mode to 
the other executable processes within the pipeline 500. the firing mode. This can be understood by noting that PE3 
According to the pipeline 500 of FIG. 14, PE,, is PE1504, 508 fired in cycle 4, and PE1 504 fired in cycle 5 .  Because 
and accordingly, zmUx of the pipeline 500 is 4 cycles. both of these processing elements are again free to receive 
FIG. 14  illustrates a timing chart defining the sequence of data in cycle 6, PEO 502 is free to fire to them again, thereby 
states being executed within various processing elements 65 enabling PE1 502 and PE3 508 to start processing again. 
PEO 502-PE4 510 during successive cycles of an operating PE2 506 also couples enabling PE4 510 to begin processing. 
sequence within the processing pipeline 500. According to It is noted that, according to the pipeline architecture of FIG. 
US 6,883,084 B1 
29 30 
14, processing element PE4 510 is the final processing and 13. The firing states in Table 20 conform to the “F” 
element in the pipeline. As a general rule, the entrance of the firing-states illustrated in FIG. 15. The Table is divided into 
last processing element within a pipeline to the process state two portions, the first portion, cycles 1-5, represent the 
“P’ marks the transition from the initiation cycle to the first initialization firing sequence. The second portion, cycles A, 
repeatable sequence. The repeatable sequence is equal to s B, C, and D represent the repeatable sequence which com- 
z,, cycles. Accordingly, beginning in cycle 6, the timing mences following cycle 5. Because the are repeatable, they 
chart of FIG. 15 will enter a repeatable sequence that is 4 are defined by letters rather than numbers. 
cycles in length. This can be observed by examining the 
states of any single processing element in cycles 6 , 7 , 8  and 
9. and comaarinn them to the succeedinn four cvcles. For i o  
TABLE 20 
L a  a 
example, cycles 6-9 of PEO 502 are Seen to be F-P-B-B, The Clock Cycle fire-PEO f i r e P E l  fire-PE2 f i r e P E 3  fire-PE4 
same cycle is repeated in the cycles 10-13. The repetition of 1 n n n n n 
processing cycles 6-9 can similarly be observed for all of the 2 1 0 0 0 0 
processing elements in FIG. 15. 3 0 0 0 0 0 
initialization sequence for the pipeline 500 of FIG. 15, and A 1 0 1 0 0 
the states represented in cycles 6-9 represent the first cycle B 0 0 0 0 1 
of a repeatable sequence. As noted, because z,, within C 0 0 0 1 0 
cycle 6 has a cycle length z of four cycles. 
0 0 0 1 0 
0 1 0 0 0 Accordingly, clock cycles 1-5 of FIG. 15 represent the IS 
D 0 1 0 0 0 pipeline 500 is 4 cycles, the repeating sequence beginning in 
In cycle 7, PEO is assumed to receive data and begin According to the above table, during the first clock cycle 
processing again. As discussed above, if no input were 1, the fire control signals in all processing elements are zero, 
forthcoming, PEO would remain in the wait mode indefi- or “do not fire.” In the clock cycle 1, the control signal 
nitely through consecutive cycles until it received an input. applied to the control input fire PEO is a binary “one” 
PE1 will continue processing through cycles 7 and 8, and 2s activating the firing of the output registers of processing 
fire in cycle 9. Because PE1 will be unavailable to receive element PEO 502. Subsequent cycles are interpreted in the 
data until cycle 10, PEO will remain in the wait mode same manner. According to the preferred embodiment, the 
through cycles 8 and 9. PE4 has finished processing the data repeatable portion of the cycle, A, B, C and D, is governed 
it received, and enters the firing mode, outputting the data to by a counter which counts the number of times the repeat- 
the next segment, and enabling PE4 to receive input as soon 30 able sequence is repeated, running the preset sequence a 
as the next input is ready. pre-determined number of times. As discussed above, a 
In cycle 8, PEO enters the blocking mode, blocking any 24-bit counter preset allows for a sequence to be repeated 
output until both successor processing elements, PE1 and over eight million times. Those skilled in the art will 
PE3 have both finished processing their current contents and recognize that counter values exceeding the capacity of a 
fired. PE1 continues processing, and PE3 couples to the 3s  counter preset register may be achieved by cascading 
input of PE4. Because PE4 will lack the input from PE2 until counters together. Alternative embodiments are envisioned, 
cycle 10, the input from PE3 alone will not enable PE4 to however, wherein the number of times the repeatable 
commence processing, and PE4 remains in the W “wait sequence is run is not pre-determined by a counter, but 
state” through cycle 8. Although ULP (“ultra low power”) determined dynamically through an evaluation of data. 
networks exhibit less power loss per clock cycle than an 40 Pipeline Architecture 
equivalent CMOS circuit operating in the region of five FIG. 16 further shows a block diagram of architectural 
volts, there remains nevertheless some power loss through features of the preferred embodiment of an RDPP pipeline 
each clock cycle. Accordingly, embodiments are envisioned according to the present invention. FIG. 5 briefly discussed 
wherein most of the transistors comprising a specific pro- the interconnectability of component processing elements in 
cessing element are isolated during wait states and blocking 4s a pipeline 220 by means of input select logic 223 and output 
states for the executable process associated with that specific select logic 233. As further discussed in conjunction with 
program, thereby reducing the dynamic power consumption. FIG. 6, many of the internal paths within a processing 
In cycle 9, PEO is again blocked since PE1 has not element are advantageously 48-bit data paths, whereas the 
completed its processing and firing sequence. PE1 is firing inputs Inl ,  1112, 1113, DR1 and DR2, and outputs Out1 and 
to PE2, and PE2 begins processing the data input from PE1. SO Out2 shown in FIG. 6 are advantageously 24-bit paths. 
PE3, having already fired and now waiting for a new input Within the individual processing elements of FIG. 6, mul- 
from PEO, enters the wait mode. Similarly, because PE4 has tiplexers provided the configurable data path. 
only received input from PE3, and continues to wait for Between processing elements, a more versatile method is 
input from PE2, PE4 also remains in the wait mode. required. Fully connected programmable interconnects con- 
As discussed above, cycles 1-13, 14-17, etc. will simply ss sume significant chip area. As noted, the preferred embodi- 
repeat the state-sequence of cycles 6-9. Because PE,,, is ment utilizes sixteen processing elements per RDPP 
PE1 504 in FIG. 14, the speed of the pipeline operation is pipeline, although embodiments containing greater or fewer 
limited by the 4-cycle z,, of PE1 504. Additionally, it is than sixteen processing elements are envisioned. Because a 
noted that after the repeatable cycle begins in cycle 6, PE,, shared bus requires bus arbitration logic, and limits the 
will never enter the wait state. Being the slowest operation 60 activity on the bus to one module at a time, the current 
in the pipeline 500, the other processing elements are forced invention advantageously achieves interconnectability 
to wait for it. between the various processing elements, the RDPP archi- 
Table 20 below is an illustration of a firing-sequence-table tecture advantageously employs a hierarchical scheme as 
for controlling the output firing of the pipeline 500 of FIG. illustrated in FIG. 16. The hierarchical scheme specifically 
14. Within Table 20, a binary “zero” is “do not fire” and a 65 employs crossbar or crosspoint switches 530 which enable 
binary “one” is a command to fire an output latch corre- multiple “talkers” to connect to multiple “listeners” over 
sponding to the fire control signal, as illustrated in FIGS. 12 dedicated connections in much the same way that pairs of 
20 
US 6,883,084 B1 
31 32 
telephone users talk over dedicated lines. According to 
commonly known switching theory, if a crossbar switch is 
serving N inputs and N outputs which may be configured 
and interconnected in any combination, producing NxN 
possible combinations, using kxk switches, the total states 5 
required for implementation is log, N stages. Accordingly, 
an 8x8 crossbar can be implemented using 2x2 switches in 
address-encoding scheme, in which each bit of a destination 
coordinates x(m,n). The gain and offset parameters are 
respectively defined as a(m,n) and b(m,n), so that the output 
(y) of the respective pixels is defined according to the linear 
equation: 
y(m,n)=u(m,n)x(m,n)+b(m,n). 
three levels, This allows implementation of an effective 
address controls one level of switching. Because the number 
of permissible connection paths at each node differs accord- 
further be evaluated against hardware costs, there is no 
However, as the result of radiation, abuse, manufacturing 
be ‘‘dead,” Or non-responsive due to. In this case, no 
gain or offset can meaningfully restore the actual input 
commonly replaced by a spatial average of its neighbors. 
defect, or Some other failure mechanism some pixels may 
ing to the algorithm and computationa~ model, and must values sensed by that pixel. In such cases, the pixel value is 
preferred embodiment for the crossbar switching of the Using RDPP according to the present invention, however, 
the output values for the “good pixel” and “bad pixel” are 
calculated simultaneously. Two processing elements read 
The by 
In the above 
pipeline. 15 ne reconfigurable nature of the RDPP pipeline permits 
hardware failures. When a PE has failed, this can be detected 
be identified, configured, and connected into the network of 2o 
skilled in the art are familiar with the various methods for 
detecting system faults and re-configuring a system to utilize 
alternative resources. The determination on whether or not a pixel is good is 
Application Illustration of an RDPP 2s determined by calibration data, wherein a particular code 
Although the present invention is not limited to any one indicates which case applies to the pixel, If the pixel is 
application, Some typical data-intensive spacecraft aPPlica- reliable, the actual scaled value is selected. If the pixel is 
tions are digital filters, Pixel readout correction, hyper- defective, the value derived from the neighboring pixels is 
spectral image data conversion, and object detection and selected. The output signal is generated accordingly. By this 
tracking. According to these examples, the Processor Will be 30 process, when the conditional switch determines if the pixel 
required to operate on at least four kinds of data: (1) Sensor is good or bad, all the data is ready and available for further 
signal data; (2) address data; (3) data state information, such processing. The conditional switching of the conditional 
signals. output data, thereby substituting values for defective pixels 
fault tolerance through in-system reconfiguration to repair 
through an onboard test procedure. An unused PE can then 
PES thereby taking Over the function of the failed PE. Those 
the gain and Offset and correct each incoming pixel. 
determining the average Of three neighboring pixels by 
a bad pixel is therefore rep1aced by the 
for a bad pixel are 
Of three Or more processing 
y (m, n)=%(x(m- 1 ,n- l)+x(m, n- l)+x(n- 1 ,n)) . 
as pixel labels; and (4) status information such as “done” multiplexer CMUX performs the selection of alternative 
FIG. 17 is an illustration of the present invention used in 35 without the delay imposed by repetitive serial processing 
after a pixel is discovered to be bad based on its unlikely conjunction with an “infinite response filter,” which lends 
itself very well to the RDPP Pipelined Processors of the 
Present invention. The output Yk at sample time k is given by 
output. If the above process were performed on a von 
Neumann processor, a first calculation would be made 
regarding a good pixel. The determination would then be 
If defective, a new calculation would have to be performed 
to determine the average values of the surrounding pixels. It 
can therefore readily be seen that data path selection through 
conditional multiplexing in conjunction with parallel pipe- 
3 40 made as to whether or not the pixel were sound or defective. 
Yk = z a , ’ X k - l r  
,=O 
where x, is the input at sample time k. The filter coefficients 
ai are stored in registers in the processing elements, such as 45 line processing is an improvement over the prior art. 
constant data registers DR1 and DR2 of FIG. 6. The input What is claimed is: 
samples are delayed in the input data buffer. Accordingly, an 1.  A method of processing data through reconfigurable 
output y, is derived from the respective product of the input data path processor comprising a plurality of independent 
xk-, and each of the four filter coefficients, a,, a,, a2, and a3. processing elements, including first processing element 
Because the output y, at sample time k is determined by SO comprising a first PE output, and a conditional multiplexer 
inputs from the previous sample time k-1, the above four- with a first multiplexer input, a second multiplexer input and 
tap example requires memory buffer for storing and corre- a first multiplexer output, the method comprising the steps: 
lating data being received and processed over a time delay. a. processing a first data set according to a first algorithm 
Sensor nonuniformity correction illustrates the use of con- within the first processing element, wherein the first 
ditional switching. Imaging focal plane arrays typically 55 data set comprises a first processable value and a 
exhibit pixel-by-pixel variation due to manufacturing toler- second processable value; 
ances; in particular, each pixel has a brightness offset due to b, generating a first processed output according to the 
processing of the first data set; leakage or dark current, and a gain variation. To obtain 
accurate data, the sensor must be calibrated. In the calibra- c. generating a first set of arithmetic status bits according tion phase, an estimated offset and gain factor for each pixel 60 
to the processing of the first data set through the first is stored in memory. The actual image is restored accurately algorithm; by correcting the pixel-by-pixel variation created by manu- 
pixel is corrected by multiplying by its corrective gain arithmetic status bit output; 
factor, and adding its corrective offset. 
According to FIG. 17, an array of pixels 600 representing 
incoming data (x) are defined by horizontal and vertical 
facturing tolerances. In operation, the information from each d. sending the first set Of arithmetic status bits to a first 
65 d. evaluating the first set of arithmetic status bits; and 
e. establishing a first data path through the conditional 
multiplexer according to the evaluation of the first set 
US 6,883,084 B1 
33 
of arithmetic status bits, wherein the first data path is 
selected from among a data path connecting the first 
multiplexer input to the first multiplexer output and a 
data path coupling the second multiplexer input to the 
2. The method according to claim 1 wherein the step of 
processing a first data set is preceded by the step of config- 
uring the first processing element. 
3. The method according to claim 2 wherein the step of 
configuring the first processing element comprises the step i o  
of configuring a first plurality of data paths within the first 
processing element. 
4. The method according to claim 2 further comprising a 
logical mask with a mask register and mask logic, wherein 
the step of configuring the first processing element com- is 
prises the step of downloading a binary mask pattern into the 
mask register. 
5. The method according to claim 4 wherein the step of 
evaluating the first set of arithmetic status bits comprises the 
step of comparing the first set of arithmetic status bits to the 20 
binary mask pattern according to the mask logic. 
6. The method according to claim 2 wherein the step of 
configuring the first processing element comprises the step 
of transmitting a first PE configuration message. 
step of configuring a second plurality of data paths inter- 
connecting the plurality of processing elements within the 
reconfigurable data path processor. 
8. The method according to claim 7 wherein the step of 
configuring the second plurality of data paths comprises the 30 
step of transmitting a first pipeline configuration message. 
9. The method according to claim 7 wherein the second 
plurality of data paths are configured within a hierarchical 
network of configurable data paths. 
the steps of  
first multiplexer output. 5 
7. The method according to claim 2 further comprising the zs 
10. The method according to claim 2 further comprising 35 
a. processing a second data set according to a second 
algorithm within the first processing element, wherein 
the second data set comprises a third processable value 
b. generating a second processed output according to the 
c. generating a second set of arithmetic status bits accord- 
and a fourth processable value; 40 
processing of the second data set; 
ing to the processing of the second data set through the 45 
second algorithm; and 
second arithmetic status output. 
d. sending the second set of arithmetic status bits to a 
11. The method according to claim 10 wherein the first 
processing element comprises a first selective multiplexer 
with a third multiplexer input coupled to the first arithmetic 
status output, a fourth multiplexer input coupled to the 
second arithmetic status output, and a second multiplexer 
output, wherein the step of configuring a first plurality of 
data paths further comprises the step of configuring a path 55 
from the third multiplexer input to the second multiplexer 
output. 
12. The method according to claim 10 wherein the first 
algorithm includes an arithmetic logic unit with a first ALU 
input, a second ALU input, and an ALU output, and the 6o 
second algorithm includes a multiplier with a first MUL 
input, a second MUL input and a MUL output, the method 
further comprising the steps: 
a. inputting the first processable value into the first ALU 
b. inputting the second processable value into the second 
input; 65 
ALU input; 
34 
c. inputting the third processable value into the first MUL 
d. inputting the fourth processable value-into the second 
13. The method according to claim 10 wherein the first 
algorithm includes a multiplier and the second algorithm 
includes an arithmetic logic unit. 
14. The method according to claim 3 wherein the pro- 
cessing element comprises a crossbar-switch with a plurality 
of crossbar-switch inputs and a plurality of crossbar-switch 
outputs including a first crossbar switch output and a second 
crossbar switch output, and wherein the step of configuring 
a plurality of data paths within the processing element 
comprises the step of controllably coupling a first input from 
among the plurality of crossbar-switch inputs to a first output 
from among the plurality of crossbar-switch outputs. 
15. The method according to claim 2 wherein the step of 
configuring the first processing element comprises the step 
of downloading a first pre-determined value into a first 
constant-data register. 
16. The method according to claim 12 wherein the first 
algorithm further comprises an first output shifter, a first 
rounding module and a first clipping module, and the second 
algorithm further comprises a second shifter, a second 
rounding module and a second clipping module. 
17. The method according to claim 12 wherein the first 
processable value is derived from an output of the multiplier. 
18. The method according to claim 12 wherein the third 
processable value is derived from an output of the arithmetic 
logic unit. 
19. The method according to claim 13 further comprising 
a first constant data register configured to store a first fixed 
binary value, a second constant data register configured to 
store a second fixed binary value, a first PE input configured 
to receive a first binary input value, a second PE input 
configured to receive a second binary input value, and a third 
PE input configured to receive a third binary input value, the 
method further comprising the steps of  
a. selecting the first processable value from among the 
first fixed binary value and the first binary input value; 
b. selecting the second processable value from among an 
output value derived from the arithmetic logic unit and 
the second binary input value; 
c. selecting the third processable value from among an 
output value derived from the multiplier and the second 
binary input value; and 
d. selecting the fourth processable value from among the 
second fixed binary value and the third binary input 
value. 
20. The method according to claim 2 further comprising 
a. downloading a binary output value from the first 
b. triggering the output register with a fire-PE control 
c. transmitting the binary output value from the output 
21. The method according to claim 20 further comprising 
a second processing element with a fourth PE input and a 
second PE output, and a third processing element with a fifth 
PE input and a third PE output, wherein an output of the first 
PE is coupled to the fourth PE input, and an output of the 
first PE is coupled to the 5th PE input, whereby the first 
processing element forms a source of divergence for a 
parallel processing configuration. 
22. The method according to claim 21 wherein the recon- 
figurable data path processor is a ULP CMOS circuit. 
input; and 
MUL input. 
the steps: 
multiplexer output to an output register; 
signal; and 
register to the first PE output. 
US 6,883,084 B1 
35 36 
23. The method according to claim 22 further comprising 
a. integrating the reconfigurable data path processor into 
b. shooting the spacecraft into outer space. 
24. The method according to claim 20 further comprising 
second processing element with a second PE output 
coupled to an input of the first processing element 
selected from among the first PE input, the second PE 
input and the third PE input, and a third processing 
element with a third PE output coupled to an input of 
the first processing element selected from among the 
first PE input, the second PE input and the third PE 
input, whereby the first processing element forms a 1s 
convergence of a parallel processing configuration. 
25. An ultra low power reconfigurable data path processor 
for processing data, comprising a plurality of processing 
elements, a first processing element comprising: 
30. The ultra low power reconfigurable data path proces- 
a. a second major processing component having a third 
and fourth major input and a second major output; 
b. a second processing component comprising 
i. a second partially processed data input; 
ii. a second processed data output; and 
iii. a second arithmetic status output; and 
c. a first selective multiplexer comprising: 
i. a third multiplexer input; 
ii. a fourth multiplexer input; and 
iii. a second multiplexer output, wherein the first selec- 
tive multiplexer is configurable to selectively estab- 
lish a data path selected from among a third data path 
coupling third multiplexer input to the second mul- 
tiplexer output and a fourth data path coupling the 
fourth multiplexer input to the second multiplexer 
output, and wherein the second major output is 
coupled to the second partially processed data input, 
and wherein the first arithmetic status output is 
coupled to the third multiplexer input and the second 
arithmetic status output is coupled with the fourth 
multiplexer input. 
31. The ultra low Power reconfigurable data Path Proces- 
sor according to claim 30 wherein the first major processing 
component is selected from among a group consisting of 
arithmetic logic units and multipliers, and the second major 
Processing component is selected from among a group 
32. The ultra low power reconfigurable data path proces- 
sor according to claim 31 further comprising a crossbar 
switch comprising a plurality of crossbar inputs including a 
arithmetic status output is configured to transmit a 3s plurality of crossbar outputs including a first crossbar output 
binary status of at least one select arithmetic status and a second crossbar output, wherein the crossbar switch is 
bit generated during data processing of the first configurable to selectively route any crossbar input to any 
processing component, the first arithmetic status crossbar Output. 
input of the first multiplexer. 
the steps: sor according to claim 29 further comprising: 
a spacecraft; and 
a. a conditional multiplexer comprising: 20 
i. a first multiplexer input; 
ii. a second multiplexer input; 
iii. a first multiplexer output; and 
vi. a first multiplexer control configured to select a data 
path according to a binary state of an arithmetic 2s 
status input, the data path selected from among a first 
data path coupling the first multiplexer input with the 
first multiplexer output and a second data path cou- 
pling the second multiplexer input with the first 
multiplexer output; and 3o consisting of arithmetic logic units and multipliers. 
b. a first processing component comprising: 
i. a first partially processed data input; 
ii. a first processed-data output; and 
iii, a first arithmetic status output, wherein the first first crossbar hut and a second crossbar h u t ,  and a 
output being couplable with the arithmetic status 33. The Illtra low power data path proces- 
4o sor according to claim 32 further comprising: 
26. The ultra low power reconfigurable data path proces- 
sor according to claim 25 wherein the first multiplexer 
control comprises a data mask disposed between the arith- 
metic status input and a first multiplexer control input, the 
data mask comprising a mask input, a mask register for 4s 
storing a pre-determined binary mask, mask logic, and a 
mask output, wherein the arithmetic status input is coupled 
to the mask and the mask output is coupled to the first 
multiplexer control input, such that the mask logic is con- 
figured to control a value of the mask output according to a SO 
comparison of a binary state of the at least one select 
arithmetic status bit with select bits within the pre- i. a seventh multiplexer input; 
determined binary mask. 
27. The reconfigurable data path processor of claim 26 
wherein the data mask is integral to the conditional multi- ss 
plexer. 
28. The ultra low power reconfigurable data path proces- 
sor according to claim 26 wherein the at least one select 
arithmetic status bit comprises a plurality of bits, including 
a zero status bit, a negative status bit, an overflow status bit 60 
and an underflow status bit. 
29. The ultra low power reconfigurable data path proces- 
sor according to claim 26 further comprising a first major 
processing component with a first and second major input 
and a first major output, wherein the major first major output 65 
is coupled to the first partially processed data input of the 
first processing component. 
a. a second selective multiplexer comprising 
i. a fifth multiplexer input; 
ii. a sixth multiplexer input; and 
iii. a third multiplexer output, wherein the second 
selective multiplexer is configurable to selectively 
establish a data path selected from among a fifth data 
path coupling the fifth multiplexer input to the third 
multiplexer output and a sixth data path coupling the 
sixth multiplexer input to the third multiplexer 
output, and 
b. a third selective multiplexer comprising 
ii. an eighth multiplexer input; and 
iii. a fourth multiplexer output, wherein the third selec- 
tive multiplexer is configurable to selectively estab- 
lish a data path selected from among a seventh data 
path coupling the seventh multiplexer input to the 
fourth multiplexer output and an eighth data path 
coupling the eighth multiplexer input to the fourth 
multiplexer output, and wherein and wherein a first 
crossbar output is coupled to the fifth multiplexer 
input, the second processed data output is coupled to 
the sixth multiplexer input, the second crossbar 
output is coupled to the seventh multiplexer input, 
the first processed data output is coupled to the 
eighth multiplexer input, the third multiplexer output 
is coupled to the first multiplexer input, and the 
US 6,883,084 B1 
37 38 
fourth multiplexer output is coupled to the second 
multiplexer input. 
34. The reconfigurable data path processor of claim 33 
wherein the first crossbar input is selected from a group 
consisting of a first processing element input, a second s 
processing element input, a third processing element input, 
a first constant data register and a second constant data 
register, and wherein the second crossbar input is selected 
from a group consisting of a first processing element input, 
a second processing element input, a third processing ele- IO 
ment input, a first constant data register and a second 
constant data register. 
35. The reconfigurable data path processor of claim 34 
wherein the first major input is coupled to a terminal selected 
from among the second processed data output and the first 1s 
processing element input, and wherein the second major 
input is coupled to a terminal selected from among the first 
constant data register and the second processing element 
input, and wherein the third major input is couple to a 
terminal selected from among the second constant data 20 
register and the third processing element input, and the 
fourth major input is coupled to a terminal selected from 
among the first processing element input and the first 
processed data output. 
claim 35 further comprising a third crossbar output, wherein 
the first multiplexer output is controllably coupled to a first 
processing element output and the third crossbar output is 
controllably coupled to a second processing element output. 
claim 36 further comprising 
36. The reconfigurable data path processor according to 2s 
37. The reconfigurable data path processor according to 30 
a. a second processing element with a fourth processing 
element input coupled to an output of the first process- 
ing element, thereby forming a ninth data path; and 
b. a third processing element with a fifth processing 
element input coupled to an output of the first process- 
ing element, thereby forming a tenth data path, thereby 
forming a parallel path divergence. 
38. The reconfigurable data path processor according to 
a. a second processing element with a third processing 
element output coupled to an input of the first process- 
ing element, thereby forming a ninth data path; and 
b. a third processing element with a fourth processing 
element output coupled to an input of the first process- 
ing element, thereby forming a tenth data path, thereby 
forming a parallel path convergence. 
39. The reconfigurable data path processor according to 
claim 37 wherein the ninth and tenth data paths are formed 
through a pipeline configuration command. 
40. The reconfigurable data path processor according to 
claim 37 wherein the second processing element and the 
third processing element are configured to process data 
simultaneously. 
41. The reconfigurable data path processor according to 
claim 33 wherein the second selective multiplexer is con- 
figurable to selectively establish a data path selected from 
among a fifth data path coupling the fifth multiplexer input 
to the third multiplexer output and a sixth data path coupling 
the sixth multiplexer input to the third multiplexer output 
according to a processing element configuration message. 
42. The reconfigurable data path processor according to 
claim 37 wherein the first major processing component and 
the second major processing component comprise a radia- 
tion tolerant ultra low power CMOS circuit configured for 
use in outer space. 
claim 36 further comprising 
* * * * *  
