Ultra-efficient (embedded) SoC architectures based on probabilistic by Lakshmi N. Chakrapani et al.
Ultra-Efﬁcient (Embedded) SOC Architectures based on Probabilistic CMOS
(PCMOS) Technology
Lakshmi N. Chakrapani Bilge E. S. Akgul Suresh Cheemalavagu Pinar Korkmaz
Krishna V. Palem Balasubramanian Seshasayee
Center for Research on Embedded Systems and Technology
Georgia Institute of Technology
Atlanta, Georgia, USA 30332.
fnsimhan,palemg@ece.gatech.edu
Abstract
Major impediments to technology scaling in the nanometer
regimeincludepower(orenergy)dissipationand“erroneous”
behavior induced by process variations and noise susceptibil-
ity. In this paper, we demonstrate that CMOS devices whose
behavior is rendered probabilistic by noise (yielding proba-
bilistic CMOS or PCMOS) can be harnessed for ultra low en-
ergy and high performance computation. PCMOS devices are
inherently probabilistic in that they are guaranteed to com-
pute correctly with a probability 1=2 < p < 1 and thus, by
design, they are expected to compute incorrectly with a prob-
ability (1   p). In this paper, we show that PCMOS technol-
ogy yields signiﬁcant improvements, both in the energy con-
sumed as well as in the performance, for probabilistic appli-
cations with broad utility. These beneﬁts are derived using an
application-architecture-technology (A2T) co-design method-
ology introduced here, yielding an entirely novel family of
probabilisticsystem-on-a-chip(PSOC)architectures.Allofour
application and architectural savings are quantiﬁed using the
product of the energy and the performance denoted (energy 
performance): the PCMOS based gains are as high as a sub-
stantial multiplicative factor of over 560 when compared to a
competing energy-efﬁcient CMOS based realization.
1. Introduction
As CMOS technology scales down into the nanometer re-
gion, hurdles introduced by noise and other device perturba-
tions (see Sano [14, 22], Kish [12] and Shepard [23]) pose sev-
eral challenges. The surprising premise that noise can be har-
nessed as a resource, rather than viewed as a hurdle was vali-
dated for the ﬁrst time using foundational principles and the-
oretical models (see Palem [15, 16, 17]). Building on these
foundations, we have designed and studied CMOS devices [5]
that are “unstable” or “noisy”. In earlier work, we demon-
strated for the ﬁrst time that computation based on such
noisy CMOS devices can yield orders of magnitude improve-
ments simultaneously to the energy consumed as well as to
the running time—collectively characterized as the energy-
performance product (EPP)—of an application. The particular
form of CMOS that is affected by ambient (thermal) noise—
we refer to it as probabilistic CMOS or PCMOS—was the sin-
gular innovation through which these improvements were ac-
 This work is supported in part by DARPA under seedling contract
#F30602-02-2-0124, by the DARPA ACIP program under contract
#FA8650-04-C-7126 through a subcontract from USC-ISI and by an
award from Intel Corporation.
complished. The two signiﬁcant contributions of this paper are
(i) the demonstration of PCMOS based ultra efﬁcient (embed-
ded) computing architectures, where efﬁciency is quantiﬁed
through EPP and (ii) a demonstration of the value of this novel
technology in the context of a range of applications.
To demonstrate the utility and the efﬁcacy of PCMOS,
we ﬁrst develop a methodology (akin to hardware software
co-design and described in Section 3) that we refer to as
application-architecture-technology (or A2T) co-design. Our
methodology is aimed at realizing extremely efﬁcient prob-
abilistic system-on-a-chip (PSOC) architectures using PCMOS
devices. As shown in Figure 2, a canonical PSOC architecture
consists of a (conventional) deterministic host processor used
to compute most of the control-intensive components of an
application, whereas the co-processor realized using PCMOS
devices will be used as an energy-performance (EPP) “ac-
celerator”. In this novel co-design methodology, the “prob-
abilistic content” (formalized later as ﬂux) of the algorithm
or application becomes a novel resource to be managed and
treated, much as space requirements, ﬂexibility and IP-reuse
are treated in the traditional co-design context. Furthermore,
as we will see in the sequel, considerations of architectural de-
sign efﬁciency differ signiﬁcantly in the context of PCMOS,
by contrast with those arising in the context of conventional
CMOS based architectures.
Applications based on probabilistic algorithms with sig-
niﬁcant ﬂux beneﬁt the most from PSOC architectures. Prob-
abilistic algorithms have found wide use in a range of em-
bedded applications drawn from speech and pattern recogni-
tion, security and other domains. To evaluate the beneﬁts of
PCMOS based architectures, we considered a set of applica-
tions (Section 3) and four competing architectural realizations
in silicon (Section 2); the associated gains are presented in
Section 4. In Section 5, we study another crucial and novel
aspect of computing architectures that implement probabilis-
tic algorithms. Speciﬁcally, in application domains employing
probabilistic algorithms, independent “probabilistic bits” are
needed in copious quantities. Nevertheless, techniques for pro-
ducing independent random bits are difﬁcult and are an exten-
sive area of study dominated by pseudo random number gen-
erators (PRNG) [19] with several complex approaches yield-
ing unsatisfactory results [8]. Here, we show that in addition to
yielding signiﬁcant gains to the EPP, PCMOS technology also
yields random bits of a high quality. We establish this by per-
forming the tests provided by the National Institute of Stan-
dards and Technology (NIST) [21]. Concluding remarks and
directions for future research are the subject of Section 6.
 
3-9810801-0-6/DATE06 © 2006 EDAA 
 2. Probabilistic system on a chip (PSOC) archi-
tectures
As mentioned in the introduction, the surprising premise
that CMOS devices rendered probabilistic due to noise, are use-
ful and yield energy and performance beneﬁts at the applica-
tion level will now be demonstrated using probabilistic sys-
temonachiparchitectures(PSOCs).Forcompleteness,weﬁrst
present a brief overview of probabilistic CMOS (PCMOS) tech-
nology (for a detailed description, please see [5]) on which
such architectures are based.
2.1. PCMOS technology
By studying PCMOS gates whose outputs are computed
with a probability p < 1, we have shown [5] that the switching
energy (E) grows exponentially with the probability of cor-
rectness (p). In addition, the noise magnitude quantiﬁed as its
RMS and the switching energy E were shown to be quadrati-
cally related. These two relationships characterize the behav-
ior of PCMOS devices. The behaviors were derived from ana-
lyticalmodelsof PCMOS gatesandswitches,andhavebeenex-
tensivelystudiedandveriﬁedusingHSpicesimulations.Inthis
paper, we use these PCMOS gates and their derived switches as
building blocks, to demonstrate their beneﬁts to applications
through PSOC architectures.
2.2. PSOC architectures
To compare PSOC architectures with computing platforms
based on conventional CMOS technology, algorithm and archi-
tecture realizations in four different scenarios (Figure 1) were
considered: (a) when appropriate, the best known determinis-
ticalgorithm,implementedcompletelyinsoftwareandexecut-
ing on a low-energy host processor (in our case a StrongARM
SA-1100), (b) a probabilistic algorithm for realizing the same
application, with pseudo random bits generated by a software
implementation of a well known algorithm [19] (both execut-
ing completely on the host processor), (c) the same probabilis-
tic algorithm executing on the host processor, with a conven-
tional CMOS co-processor (collectively referred to as the “con-
ventional CMOS based SOC” or SOC for succinctness) and (d)
with a functionally identical PCMOS based co-processor or a
PSOC.
SA-1100 Host SA-1100 Host
Deterministic 
Algorithm
Deterministic part of 
Probabilistic Algorithm
Software Based 
Pseudorandom Number 
Generation
(a) (b)
Host
(SA-1100 or ASIC)
Co-Processor(s) 
Based on CMOS 
or PCMOS
(c), (d)
Deterministic part 
of Probabilistic 
Algorithm
Probabilistic and 
Accelerated Parts 
of Probabilistic 
Algorithm
Memory 
mapped 
IO
Figure 1. The host and co-processor realiza-
tions that are compared
These four cases encompass all reasonable alternate imple-
mentations of the application. Throughout this study, the co-
processors are speciﬁc realizations using CMOS (case (c)) and
PCMOS (case (d)), respectively of the probabilistic application
in question; thus the co-processors are application-speciﬁc.
2.3. Performance and energy modeling of PSOC ar-
chitectures
To estimate the performance of PSOC and SOC architec-
tures, the simulator of the Trimaran infrastructure [11] (also
see [4]) has been conﬁgured to determine the number of cy-
cles taken by an application executing on a StrongARM SA-
1100 host. This simulator also records a trace of the activ-
ity on the CMOS and PCMOS components of the PSOC. This
information combined with the performance models of the
co-processors obtained through HSpice simulations of PCMOS
switches yields the PSOC and SOC performance in terms of ex-
ecution time.
The energy consumption of an application executing on a
PSOC or a SOC architecture is the sum of the energy consumed
by the host, the energy consumed by the PCMOS (CMOS)
co-processor(s), and the energy cost for communicating be-
tween the host and the co-processor(s). In this study, the co-
processors are memory mapped and therefore, communica-
tion is realized through load-store instructions executed on
the host. In all cases and to quantify the energy consumed
by the SA-1100 host, the model described in [24] is used.
This model is reported by its authors to be within 3% of
the energy measured on an actual SA-1100 host. The energy
modeling techniques applied to various components of the
PSOC (SOC) architecture are illustrated in Figure 2. Since the
SA-1100 Host
(or full custom design) Memory mapped I/O
Co-Processor(s) 
Based on PCMOS 
or CMOS
Energy model based on 
JouleTrack
Energy models based on HSpice 
simulations
Communication modeled 
through Load/Store instructions
Figure 2. The host and co-processor architec-
ture of a SOC and its energy-performance mod-
eling.
co-processors are application-speciﬁc, the energy consumed
by a particular co-processor varies with the application. The
CMOS based co-processors were designed and synthesized us-
ing TSMC 0:25m process, and the associated energy cost
was determined from HSpice simulations. In the context of co-
processors realized with PCMOS technology, the energy cost
of the co-processor is derived similarly (and through physi-
cal measurements not reported here).
3. The A2T co-design framework
In this study, we consider applications based on probabilis-
tic algorithms that include Bayesian inference [13], Random
Neural Networks [10], Probabilistic Cellular Automata [9]
and Hyper-Encryption [6]. Any PSOC implementation of a
probabilistic application involves partitioning the application
between the host and the (application speciﬁc) PCMOS based
co-processor. Even though the manner in which these applica-
tions are partitioned vary across individual applications, they
followacommontheme,thusallowingustosuggestamethod-
ology. The notion of a core probabilistic step with its associ-
ated probability parameter p is one such theme common to
all of these applications, and across probabilistic algorithms
in general. In our work, this core probabilistic step is identi-
ﬁed by hand and implemented in PCMOS. The deterministic
parts of the application are implemented as software execut-
ing on the host processor or as a customized application spe-
ciﬁc circuit (ASIC) when appropriate. This co-design method-
ology is unique in the sense that as opposed to traditional SOC
designs, several unique algorithm and technology characteris-tics explicitly motivated and grounded in PCMOS and its prob-
abilistic behavior need to be considered, to obtain highly efﬁ-
cient designs.
3.1. Algorithm and technology characteristics inﬂu-
encing co-design
PCMOS is particularly efﬁcient in computing with ultra-low
energy. For example, the energy consumed for generating one
random bit using PCMOS is 0:4 pico Joules [18]. By contrast,
the Park-Miller algorithm [19] implemented in custom hard-
ware in ASIC consumes about 2025 times this amount of en-
ergy. Given this dramatic difference and hence beneﬁt, it is to
be expected that having higher amounts of “probabilistic con-
tent” in the algorithm will yield greater opportunities for de-
riving beneﬁts from PCMOS technology. Thus, the amount of
“probabilistic content”, which we refer to as the application’s
ﬂux, and denoted by F, will be a ﬁgure of merit. Flux is de-
ﬁned as the ratio of probabilistic operations to the total num-
ber of operations of the algorithm.
Though PCMOS is extremely energy efﬁcient, the operating
frequencyofourcurrentdesignislow[18],andhasbeendeter-
mined to be about 1 MHz. By contrast, CMOS based pseudo-
random bit generators produce pseudo-random bits at a rate
as high as 4 million bits per second or more. Given this po-
tential limitation, the peak rate at which an application con-
sumes random bits, or the (peak) application demand band-
width is a characteristic of interest. If the peak application de-
mand bandwidth exceeds the bandwidth of the PCMOS based
design—a design being an element or a building block that is
PCMOS based, the PCMOS devices need to be replicated. Thus
the need for extra bandwidth will be met through parallelism,
and the amount is quantiﬁed as the replication factor R. Based
on these technology and algorithm characteristics, the applica-
tions of interest are partitioned, optimized and implemented as
PSOC designs.
3.2. The suite of applications
In this section, we (due to space constraints) summarize the
suite of applications, their partitioning and optimization, lead-
ing to the design of efﬁcient PSOC architectures.
Bayesian Networks (BN) Bayesian inference is a statisti-
cal inference technique mimicking the human decision mak-
ing process. Hypotheses and their corresponding probability
weights are notions central to this technique. The probabil-
ity weights are interpreted to be the degrees of belief associ-
ated with the corresponding hypotheses. Based on evidences,
the degree of belief in a hypothesis is incremented (or decre-
mented) till it approaches 1 (or 0) in which case the hypoth-
esis is very likely (unlikely). A Bayesian network is used to
perform a task referred to widely as Bayesian inference, and
is modeled as a directed acyclic graph G of nodes V repre-
senting variables and edges E representing dependence re-
lations between the variables. Each variable u uniquely rep-
resented by a node v 2 V can be assigned a value from
a ﬁnite set of values u. Each value  2 u has a con-
ditional probability p(=0 2 0) associated with it, where
0 2 (1  2  3 l) is the string of values of the
variables represented by all of the l parents of u. Variables
whose values are known apriori are called evidences and based
on such evidence, other variables are inferred. The particu-
lar Bayesian networks considered in this study is a part of the
following applications: a hospital patient management system
andprintertroubleshootinginaWindowsoperatingsystemen-
vironment.
.
.
.
. . .
.
.
.
. . .
Module
A row in a module
buffer
d
e
c
o
d
e
r
A
bcol
brow
L
P
metal wires
wire width: L=0.6u
pitch: P=1.6u
Switch Switch Switch Switch Switch Switch Switch
3
7
priority encoder
7
read enable 7-bit Buffer
Figure 3. The co-processor architecture of a
PSOC which implements Bayesian inference.
Partitioning and Optimization We choose the likelihood
weighting algorithm [20] for Bayesian inference. The ran-
dom experiment (used for inference) in this probabilistic al-
gorithm, is implemented in the PCMOS co-processor (consist-
ing of several modules), with the remainder implemented as
software executing on the host. In a Bayesian network G,
the conditional probabilities associated with each value of the
variables of a node are known apriori, and are used to de-
sign a module of PCMOS switches (inverters), one module
per node v in the graph. As an example, consider a node
u with u = f0;1;2g. As before, let 0 be an instance
of the string of values associated with the parents of u. Let
0  p(0=0);p(1=0);p(2=0)  1 be the conditional prob-
abilities associated with 0;1;2 2 u respectively, given that
the parents of the node v have outputs 0 2 0. In our PSOC
architecture, Bayesian inference will be performed by three
PCMOS switches A;B and C corresponding to 0;1;2 respec-
tively.Theinputsfor theseswitchesareﬁxedat0 andthe prob-
ability of correctness associated with A;B;C is by design,
p(0=0),
p(1=
0)
1 p(0=0) and
p(2=
0)
1 p(0=0) p(1=0) respectively. Thus,
when the switches are inspected in the order < A;B;C >,
the value which corresponds to the ﬁrst switch whose output
is the value 1 is the value inferred by node u. In the PSOC de-
sign, the set of switches fA;B;Cg will be referred to as a row
and each distinct switch in this set will be referred to as an el-
ement. Since a row is associated with each element of the set
0, many rows are required to implement the strings associ-
ated with the space of all possible outputs corresponding to
the parents of the node u from 0. These set of rows will be
referred to as a table.
As shown in Figure 3, the PCMOS module corresponding
to a node u implements a table, whose row is indexed by a
particular string 0 of values associated with the parents of u
computed earlier. The number of columns in the table is juj,
where each column corresponds to a value from the set u;
in our example, juj = 3. An element in the table, identi-
ﬁed by <row, column> is a specialized PCMOS switch whose
probability of correctness is computed as indicated above. Fi-
nally a conventional priority encoder is connected to the out-
puts of a row to determine the ﬁnal result of the random ex-
periment; it performs the function of inspecting the values of
a row and choosing the ﬁnal output associated with u.
Random Neural Network (RNN) Following Gelenbe [10],
a random neural network consists of neurons and connections
between the neurons. Information is exchanged between the
neurons in the form of bipolar signal trains. Neurons have
potentials associated with them, which are deﬁned to be the
sums of incoming signals. This potential in turn, inﬂuences
the rate of ﬁring. The particular neural network considered in
this study is used to heuristically determine the vertex-coverof a graph due to Gelenbe and Batty [10].
Partitioning and Optimization The Poisson process
which models the “ﬁring” of a neuron is implemented in the
PCMOS co-processor, with the rest of the computation imple-
mented to execute on the host processor. To realize the Pois-
son process characterizing a neuron ﬁring, the Bernoulli
approximation of a Poisson process [7] is used. As an exam-
ple of a methodological step in our A2T co-design approach,
since the rate at which random bits are required by the host ex-
ceeds the rate at which PCMOS based switches can compute,
the “neurons” in the co-processor of the PSOC are repli-
cated to match the required rate. In the interests of efﬁciency,
and as another example of our A2T methodology, the ap-
plication is restructured to reduce the replication factor R,
by interleaving the demand for random bits and the pro-
cessing of these bits on the host—distributing the ﬁrings
more evenly over the course of the entire application’s exe-
cution. This has the effect of reducing the peak application
demand bandwidth.
Probabilistic Cellular Automata (PCA) are a class of cellu-
lar automata used to model stochastic processes [9]. Cellular
automata consist of cells with local (typically nearest neigh-
bor) communication. Each cell is associated with a state and a
simple transition rule which speciﬁes its next state, based on
its current state and typically, the states of its neighbors. In the
probabilistic string classiﬁcation application due to Fuks [9],
the state of each cell assumes a value of 0 or 1, giving rise
to 8 possible transition rules (each rule has two possible out-
comes, 0 or 1). In addition, each transition rule is probabilis-
tic: for a transition rule i (0  i  7), the probability that the
output state of the rule is 0 is denoted by pi;0 and the probabil-
ity that the output state is 1 is denoted by pi;1.
Partitioning and Optimization Each transition rule is im-
plemented in the co-processor by a PCMOS switch whose in-
put is a 0. The probability of correctness associated with the
ith switch is pi;1. Again, the control-intensive part of choos-
ing a transition rule (based on the state of a cell and the states
of its neighbors) and updating the states upon evaluating the
rules are all implemented on the host processor. Since the rate
at which the transition rules are evaluated exceeds that sup-
ported by PCMOS devices, this structure is again replicated
many times with concomitant optimizations.
Hyper-Encryption (HE) is a provably secure encryption
techniqueproposedbyDingandRabin[6]intheboundedstor-
age model. This scheme consists of generating an encryption
pad based on a publiclyavailable random string  and a shared
secret key between the sender and the receiver. The secret key
S is a sequence of whole numbers S = s1;s2;s3 sk such
that each number 0  si < jj. If [j] is the jth bit of ,
the encryption pad is generated by [s1]  [s2]  :::[sk],
where  denotes the pairwise exclusive OR (XOR) function.
Message encryption is performed by a bit-wise XOR operation
of the encryption pad with the message.
Partitioning and Optimization In the PSOC, the random
string  is generated using PCMOS while the generation of the
encryption pad as well as the encryption are performed by the
host. Both in the context of PCA and Hyper-Encryption (as
shown in Figure 4), the SA-1100 host is also replaced by cus-
tom hardware.
4. Metrics, results and analysis
In order to characterize and quantify beneﬁts derived
through PSOC architectures, we now deﬁne a variety of met-
rics. In the interests of staying within the stipulated page
limits, our development will be brief.
Application gain over SA-1100 gain over CMOS
BN 9:99  10
7 2:71  10
6
RNN 1:25  10
6 2:32  10
4
PCA 4:17  10
4 7:7  10
2
HE 1:56  10
5 2:03  10
3
Table 1. The EPP gain of PCMOS over SA-1100 and
over CMOS for the core probabilistic step
4.1. Metrics for quantifying the application level
beneﬁts
Energy performance product: EPP described earlier, is
deﬁned as the product of the energy consumed by the appli-
cation and its execution time. This metric will be used as the
primary ﬁgure of merit to evaluate alternate implementations,
including SOC and PSOC variants. Given the EPP of two alter-
nate realizations, they can be compared as follows.
Energy performance product gain:  I is the ratio of the
EPP of the baseline denoted by the symbol  to the EPP of a
particular implementation I (e.g., a PSOC or an SOC). This ra-
tio is calculated as follows:
 I =
Energy  Time
EnergyI  TimeI
(1)
For determining  I, and unless otherwise stated, the baseline
(and hence, the numerator of  I) always corresponds to the
casewhentheentirecomputationisperformedonthehostpro-
cessor. The StrongARM SA-1100 serves as the baseline pro-
cessor, and therefore, there is no co-processor. For example,
in the context of the RNN application solving the vertex cover,
the baseline is the StrongARM SA-1100 computing the deter-
ministic as well as the probabilistic content, whereas I is the
combination of the StrongARM SA-1100 as the host comput-
ing the deterministic component and the co-processor comput-
ing the probabilistic components of the application.
Quality of probabilistic implementation: This attribute is
characterized empirically based on the statistical tests from the
NIST suite [21] and will be the subject of Section 5.
4.2. Gains of core probabilistic steps through PCMOS
The application level gains in energy and performance
(when compared to the baseline case where there is no co-
processor) is attributed to the efﬁciency of the co-processor
while executing the core probabilistic operations. We summa-
rize these gains of PCMOS over StrongARM SA-1100, and
over custom CMOS implementation for the core probabilistic
step for each of the applications in Table 1. Each row of this ta-
ble corresponds to one of the four distinct applications of in-
terest to us and the gains achieved per core probabilistic step
are shown there. As can be readily seen from Table 1, these
gains are substantial—orders of magnitude greater—in both
contexts. These per-operation gains would of course be valu-
able at the level of an entire application, only if the applica-
tion embodies signiﬁcant opportunity characterized by its ﬂux
F.
4.3. Application level gains of PCMOS
As summarized in Table 2, gains at the scope of an entire
application range from a factor of about 80 for the PCA appli-
cation, to a factor of about 300 in the context of the RNN appli-
cation. As mentioned earlier, the baseline implementation for
HE, PCA and RNN applications is the StrongARM SA-1100
computing the deterministic as well as the probabilistic con-
tent and I is a PSOC executing an identical probabilistic al-
gorithm. For the BN case, the baseline is the StrongARM SA-1100 computing the deterministic junction tree algorithm and
I is a PSOC executing the likelihood weighting algorithm. A
range of EPP gains are observed whenever multiple data points
are available, for example, in the context of the Bayesian in-
ference where different data points correspond to different net-
works, the ﬂux varies from 0:25 % to 0:75 %. The correspond-
ing gain increases from a factor of 12:5 to an impressive factor
of 291 largely due to increase in ﬂux. Similar increases are ob-
served for the other applications as well, caused by greater ﬂux
values in the application as shown in the table.
Algorithm Flux F (as percentage of total operations)  I
Min Max
BN 0.25%-0.75% 12.5 291
RNN 16.4%-19.7% 226.5 300
PCA 4.19%-5.29% 61 82
HE 12.5% 1.12 1.12
Table 2. Application level ﬂux, maximum and
minimum EPP gains of PCMOS over the base-
line implementation where the implementation
I has a StrongARM SA-1100 host and a PCMOS
based co-processor
4.4. Impact of host efﬁciency
We will now consider and extend the gains to the entire ap-
plication suite. We will delineate the (less obvious) impact of
the efﬁciency of the host processor on the gain of an imple-
mentation  I. Referring back to the Table 2, the striking as-
pect of the gain is evident in the context of the HE application.
From Table 1 we note that the energy consumed by each core
probabilistic step in the context of the HE application is over
a factor of 150000 while using the SA-1100 host, when com-
pared to a PCMOS based design. Furthermore, as seen in Ta-
ble 2, the HE application has high ﬂux—much higher than the
corresponding values of BN and PCA applications. Yet, the HE
application does not seem to demonstrate any gain at all, since
 I = 1:12. We will devote the rest of this section to try and
understand this potential “anomaly”.
Once again, in the interests of staying within the mandated
space limits, we will restrict detailed discussion to the HE ap-
plication. The reason for this “anomaly” is that the Stron-
gARM host is extremely inefﬁcient. Thus the relative sav-
ings of PCMOS, while signiﬁcant are rendered insigniﬁcant
as an overall proportion of the entire application, when the
StrongARM host is included. Thus gains through PCMOS—
the limits being substantial as shown in Table 1—can be truly
achieved only if the amount of effort spend in the co-processor
is comparable in terms of EPP units to that spent in the host. To
verify this hypothesis, a baseline SOC architecture in which the
host processor and the co-processor are both custom ASIC ar-
chitectures (Figure 4) is considered. With this notion, moving
away from a StrongARM host processor to one realized from
custom ASIC logic, amount of energy and running time spent
in the host is considerably lower. Thus and perhaps counter in-
tuitively, increasing the efﬁciency of the competing approach
enhances the value of PCMOS gains at the application level.
In the context of the HE application, and with this change to
the baseline, the gain  I increases to 9:38 - almost an order of
magnitude. Similarly when a baseline with a custom ASIC host
is used, the  I value in the context of the probabilistic cellu-
lar automata application increases to 561. We view this fact
as being extremely favorable for PSOC based designs. Thus,
as host processors become more efﬁcient with future technol-
ogy generations, the gains of PSOC architectures over conven-
tional SOC architectures increase.
Switch Switch Switch Switch Switch Switch Switch Switch
.
.
.
.
.
S
e
c
r
e
t
 
K
e
y
Random String Generated by Specialized PCMOS Switches
Multiplexer
Encryption Pad
XOR
Other Multiplexers
Figure 4. The Custom ASIC host and its PCMOS co-
processor constituting a PSOC implementation, for
Hyper-Encryption
5. The value of PCMOS to quality of randomness
While the EPP gains for applications have been our signif-
icant concern to demonstrate the utility of PCMOS, the qual-
ity of the implementation of a probabilistic algorithm is a
characteristic of interest as well. Random bits of low qual-
ity affect application behavior—from the correctness of Monte
Carlo simulations [8] to the strength of encryption achieved by
schemes such as Hyper-Encryption [6]. We employed statisti-
cal tests from the NIST suite [21] to assess to quality of ran-
domness in a preliminary way. The random sequences in the
case of PCMOS have been produced from physical measure-
ments of a probabilistic inverter fabricated using the 0:25m
TSMC process, whereas the pseudo-random bits derived us-
ing Park-Miller [19] algorithm were evaluated using the out-
put of a custom design simulated using HSpice. In both cases,
p = 0:5.
The results of these comparisons are shown in Figure 5.
Among these tests and to highlight a few, the runs test, is used
to determine a contiguous sequence of bits with a value 1 in
a block. The rank test is used to check the linear dependence,
while the FFT and approximate entropy tests detect period-
icity and frequency of overlapping patterns. In evaluating the
test results, we employed the testing strategy and criteria as
recommended by NIST. Speciﬁcally, the test results shown in
parenthesis in the table are compared against a threshold (the
recommended value being 0:93) used to determine whether
the sequence passes (or fails) a test. The tests are performed
on random bit sequences of length 20;000;000. The result in-
dicates the proportion of sub sequences (tested through iter-
ations) that pass, from the random sequence being tested. As
seen from the ﬁgure, the quality of random sequences gener-
atedby PCMOS ishigherthanthatofthosegeneratedby CMOS,
since more tests in the former case yield a pass result com-
pared to the latter—eleven tests with a pass score in the con-
text of PCMOS whereas seven in the CMOS context, out of a
total of fourteen tests.
6. Conclusion and remarks
We have demonstrated the value of the novel PCMOS tech-
nology within the context of realizing ultra efﬁcient PSOC ar-
chitectures, over a range of applications ubiquitous to embed-FAIL (0.84) PASS (0.98) Frequency
PASS (1.00) PASS (1.00) Serial
FAIL (0.8889) FAIL (0.725) Universal Statistical
PASS (1.00) PASS (1.00) Linear complexity
FAIL (0.0625) FAIL (0.8125) Lempel-Ziv
FAIL (0.00) FAIL (0.8889) Overlapping template
PASS (0.9375) PASS (0.93) Non-overlapping template
FAIL (0.00) PASS (1.00) Rank
PASS (1.00) PASS (1.00) Long-run
FAIL (0.92) PASS (0.98) Approximate entropy
PASS (1.00) PASS (1.00) FFT
PASS (0.96) PASS (0.98) Runs
FAIL (0.86) PASS (0.98) Cumulative sum
PASS (0.98) PASS (1.00) Block-frequency
CMOS PCMOS Test
(result > 0.93) ˘ PASS
(result < 0.93) ˘ FAIL
Figure 5. Comparison of quality of randomiza-
tion for PRNG and PCMOS.
ded computing and beyond. The improvements that we were
able to demonstrate were orders of magnitude over applica-
tion speciﬁc CMOS designs. Next, we wish to explore a larger
suite of applications and associated PSOC architectures, signif-
icantly from the signal processing (DSP) domain wherein the
probability of correctness p at the device level manifests itself
naturally as the signal-to-noise ratio at the level of a computa-
tional kernel such as a ﬁlter. Another interesting and valuable
intellectual direction to pursue in the future involves a thor-
ough and in-depth exploration and study of the quality of ran-
domness. We note in passing that PCMOS has the ability of
producing random bits as opposed to the pseudo-random bits
that conventional random number generators produce. Mea-
suring the quality of randomness remains a challenge, if ap-
proaches other than empirical—such as those advocated by
NIST—are sought, since the complexity of provably correct
tests for randomness can be undecidable [3]. This question of
generating pseudo-random bits has deep roots in computer sci-
ence with connections to important questions about the inher-
ent difﬁculty of computations (see Blum and Micali [1]). In a
sense, A2T co-design methodology based on PCMOS, and thus
asourceofhighqualityrandombitscanbeviewedasatechno-
logicalresponsetothesigniﬁcantchallengeembodiedinYao’s
comment from 1982—“If, in an application, it is possible to
isolate some simple randomness properties that can guarantee
success, then a statistical test based on the desired random-
ness properties can be used to screen and select a appropri-
ate generator. This, however, is seldom the case. Furthermore,
the performance of a pseudo-random number generator under
a particular statistical test is usually hard to determine ana-
lytically, and often has to rely on empirical evidence.” [25].
Finally an independent and equally interesting direction, in-
volves investigating the applicability of the ideas, methods and
constructs presented here to the overarching question of realiz-
ing reliable computing from unreliable elements—such “prob-
abilistic designs” are considered central to sustaining Moore’s
law in the nanometer regime of CMOS based architectures [2].
References
[1] M. Blum and S. Micali. How to generate cryptographically
strong sequences of pseudo-random bits. SIAM J. Comput.,
13(4):850–864, 1984.
[2] S. Borkar, T. Karnik, S. Narendra, J. Tschanz, A. Keshavarzi,
and V. De. Parameter variations and impact on circuits and
microarchitecture. 40th Design Automation Conference, pages
338–342, 2003.
[3] G. Chaitin. Algorithmic information theory. IBM Journal of
Research and Development, pages 350–359, 1977.
[4] L. N. Chakrapani, J. Gyllenhaal, W. mei W. Hwu, S. A. Mahlke,
K. V. Palem, and R. M. Rabbah. Trimaran: An Infrastructure
for Research in Instruction-Level Parallelism, volume 3602.
Springer-Verlag Berlin Heidelberg, Aug. 2005.
[5] S. Cheemalavagu, P. Korkmaz, K. V. Palem, B. E. S. Akgul, and
L. N. Chakrapani. A probabilistic CMOS switch and its realiza-
tion by exploiting noise. Proceedings of the IFIP international
conference on very large scale integration, 2005.
[6] Y. Z. Ding and M. O. Rabin. Hyper-Encryption and everlast-
ing security. Lecture Notes In Computer Science; Proceedings
of the 19th Annual Symposium on Theoretical Aspects of Com-
puter Science, 2285:1–26, 2002.
[7] W. Feller. An Introduction to Probability Theory and its Appli-
cations. Wiley Eastern Limited, 1984.
[8] A. M. Ferrenberg, D. P. Landau, and Y. J. Wong. Monte carlo
simulations: Hidden errors from “good” random number gener-
ators. Phys. Rev. Let, 69:3382–3384, 1992.
[9] H. Fuks. Non-deterministic density classiﬁation with diffusive
probabilistic cellular automata. Physical Review E, Statistical,
Nonlinear, and Soft Matter Physics, 66, 2002.
[10] E. Gelenbe and F. Batty. Minimum graph covering with the ran-
dom neural network model. In Neural Networks: Advances and
Applications, volume 2, 1992.
[11] http://www.trimaran.org. Trimaran: An infrastructure for re-
search in instruction-level parallelism.
[12] L. B. Kish. End of Moore’s law: thermal (noise) death of in-
tegration in micro and nano electronics. Physics Letters A,
305:144–149, 2002.
[13] D. MacKay. Bayesian interpolation. Neural Computation, 4(3),
1992.
[14] K. Natori and N. Sano. Scaling limit of digital circuits due
to thermal noise. Journal of Applied Physics, 83:5019–5024,
1998.
[15] K. V. Palem. Energy aware algorithm design via probabilis-
tic computing: from algorithms and models to Moores law and
novel (semiconductor) devices. In Proc. Intl. Conf. on Com-
pilers, Architecture and Synthesis for Embedded Systems, pages
113–117, San Jose, California, 2003.
[16] K.V.Palem. Proofasexperiment:Probabilisticalgorithmsfrom
a thermodynamic perspective. In Proc. Intl. Symposium on Ver-
iﬁcation (Theory and Practice),, Taormina, Sicily, June 2003.
[17] K. V. Palem. Energy aware computing through probabilistic
switching: A study of limits. IEEE Transactions on Comput-
ers, 54(9):1123–1137, 2005.
[18] K. V. Palem, L. N. Chakrapani, B. E. S. Akgul, and P. Kork-
maz. Realizing ultra-low energy application speciﬁc soc archi-
tectures through novel probabilistic CMOS (PCMOS) technol-
ogy. In Proceedings of The International Conference on Solid
State Devices and Materials (SSDM), Sept. 2005.
[19] S. Park and K. W. Miller. Random number generators: good
ones are hard to ﬁnd. Communications of the ACM, 31, 1988.
[20] A. Pfeffer. Probabilistic Reasoning for Complex Systems. PhD
thesis, Stanford Univeristy, 2000.
[21] Random Number Generation and Testing.
http://csrc.nist.gov/rng/.
[22] N. Sano. Increasing importance of electronic thermal noise in
sub-0.1mm Si-MOSFETs. The IEICE Transactions on Elec-
tronics, E83-C:1203–1211, 2000.
[23] K. L. Shepard and V. Narayanan. Conquering noise in deep-
submicron digital ICs. IEEE Design and Test of Computers,
15:51–62, 1998.
[24] A. Sinha and A. P. Chandrakasan. Jouletrack a web based tool
for software energy proﬁling. Proceedings of the 38th confer-
ence on Design automation, pages 220–225, 2001.
[25] A. Yao. Theory and application of trapdoor functions. Proceed-
ings of the 23rd symposium on the foundations of computer sci-
ence, pages 80–91, 1982.