Synthesis of application specific processor architectures for ultra-low energy consumption by Kazmierski, T J & Leech, Charles
Proceedings of the 5th Small Systems Simula on Symposium 2014, Niš, Serbia, 12-14 February 2014
Synthesis of applica on speciﬁc processor
architectures for ultra-low energy consump on
Tom J. Kazmierski and Charles Leech
Electronics and Computer Science, Faculty of Physical Sciences and Engineering
University of Southampton, SO17 1BJ, UK
{tjk,cl19g10}@ecs.soton.ac.uk
Abstract—In this paper we suggest that further en-
ergy savings can be achieved by a new approach to syn-
thesis of embedded processor cores, where the architec-
ture is tailored to the algorithms that the core executes.
In the context of embedded processor synthesis, both
single-core and many-core, the types of algorithms and
demands on the execution eﬃciency are usually known
at the chip design time. This knowledge can be utilised
at the design stage to synthesise architectures opti-
mised for energy consumption. Firstly, we present an
overview of both traditional energy saving techniques
and new developments in architectural approaches to
energy-eﬃcient processing. Secondly, we propose a pi-
coMIPS architecture that serves as an architectural
template for energy-eﬃcient synthesis. As a case study,
we show how the picoMIPS architecture can be tailored
to an energy eﬃcient execution of the DCT algorithm.
I. Introduction
Much research has been recently devoted to the de-
velopment of energy eﬃcient technologies in single-core
and many-core processor systems leading to further sav-
ings in power consumption. Both traditional power sav-
ing techniques as well as novel architectures, including
heterogeneous many-core architectures and reconﬁgurable
architectures have been developed. The new research has
been stimulated largely by the fact that the introduction
of multi-core structures to processor architectures caused
a signiﬁcant increase in the power consumption of these
systems. In addition, the gap between the average power
and peak power has widened as the level of core integration
increases [1].
Many energy eﬃciency and power saving technologies
are already integrated into processor architectures in order
to reduce power dissipation and extend battery life, espe-
cially in mobile devices. A combination of technologies is
most commonly implemented to achieve the best energy
eﬃciency whilst still allowing the system to meet perfor-
mance targets [2]. Techniques to increase energy eﬃciency
can be applied at many development levels from architec-
ture co-design and code compilation to task scheduling,
run-time management and application design [3]. Tradi-
tional techniques include Dynamic Voltage and Frequency
Scaling (DVFS), clock gating and clock distribution and
power domains. DVFS is a technique used to control the
power consumption of a processor through ﬁne adjustment
of the clock frequency and supply voltage levels [1][2][3][4].
High levels are used when meeting performance targets is
a priority and low levels (known as CPU throttling) are
used when energy eﬃciency is most important or high
performance is not required. When the supply voltage
is lowered and the frequency reduced, the execution of
instructions by the processor is slower but performed more
energy eﬃciently due to the extension of delays in the
pipeline stages.
Further savings are achieved by the use of power do-
mains, where regions of a system or a processor that are
controlled from a single supply can be completely powered
down in order to minimise power consumption without
entirely removing the power supply to the system. Power
domains can be used dynamically and in conjunction with
clock gating. The ARM Cortex-A15 MPCore processor
supports multiple power domains both for the core and for
the surrounding logic [6]. Figure 1 shows these domains,
labelled Processor and Non-Processor, that allow large
parts of the processor to be deactivated. Smaller internal
domains, such as CK_GCLKCR, are implemented to
allow smaller sections to be deactivated for ﬁner perfor-
mance and power variations.
Modelling and simulation of many-core processors is also
an important area as it allows to understand better the
complex interactions that occur inside a system and cause
power and energy consumption [9], [10], [11], [12], [13].
For example, the model created by Basmadjian et al. [10]
is tailored for many-core architectures in that it accounts
for resource sharing and power saving mechanisms.
In this paper we suggest that further energy savings
can be achieved by a new approach to synthesis of em-
bedded processor cores, where the architecture is tailored
to the algorithms that the core exectutes. In the context
of embedded processor synthesis, both single-core and
many-core, the types of algorithms and demands on the
execution eﬃciency are usually known at the chip design
time. This knowledge can be utilised at the design stage.
As a case study, we propose in section III a picoMIPS
architecture that can be tailored to an energy eﬃcient
execution of the DCT algorithm.
94Proceedings of the 5th Small Systems Simula on Symposium 2014, Niš, Serbia, 12-14 February 2014
Fig. 1: The ARM Cortex-A15 features multiple power domains for the core and surrounding logic, reprinted from [6].
II. Recent developments in energy efficient
architectures
A. Pipeline Balancing
Pipeline balancing (PLB) is now an established tech-
nique used to dynamically adjust the resources of the
pipeline of a processor such that it retains performance
while reducing power consumption [14]. Power balanced
pipelines is a concept in which the power disparity of
pipeline stages is reduced by assigning diﬀerent delays
to each microarchitectural pipestage while guaranteeing a
certain level of performance/throughput ratio [15]. Static
power balancing is performed during design time to iden-
tify power heavy circuitry in pipestages for which con-
sumption remains fairly constant for diﬀerent programs
and reallocate cycle time accordingly. Dynamic power
balancing is implemented on top of this to manage power
ﬂuctuations within each workload and further reduce the
total power cost. Power savings are also greater at lower
frequencies. The delay constraints on microarchitectural
pipeline stages can be modiﬁed in order to make them
more power eﬃcient, in a similar way to DVFS, when
the performance demand of the application is relaxed [15].
PLB can also operate in response to instruction per cycle
(IPC) variations within a program [14]. Here the PLB
mechanism dynamically reduces the issue width of the
pipeline to save power or increases it to boost throughput.
B. Caches and Interconnects
It is not only the design of the processor’s internal
circuitry that is important in maintaining energy eﬃ-
ciency. Careful co-design of the interconnect, caches and
the processor cores is required to achieve high performance
and energy eﬃciency [16]. High level of integration that is
inherent in multiple-processor systems can be utilised to
educe the interconnect power consumption by improving
cache coherence protocols [17]. An average of 16.3% of
L2 cache accesses could be optimised and as every ac-
cess consumes time and power, an average 9.3% power
reduction is recorded while increasing system performance
by 1.4% [17]. Recently a new methodology has been pro-
posed [10] for estimating the power consumption of multi-
core processors. It takes into account resource sharing and
power saving mechanism on top of the power consumption
of each core.
C. Energy Eﬃciency techniques in Heterogeneous Multi-
core Architectures
A heterogeneous or asymmetric multi-core architecture
is composed of cores of varying size and complexity which
are designed to complement each other in terms of per-
formance and energy eﬃciency [8]. A typical system will
implement a small core to process simple tasks, in an en-
ergy eﬃcient way, while a larger core provides higher per-
formance processing for when computationally demanding
tasks are presented. The cores represent diﬀerent points
95Proceedings of the 5th Small Systems Simula on Symposium 2014, Niš, Serbia, 12-14 February 2014
in the power/performance design space and signiﬁcant
energy eﬃciency beneﬁts can be achieved by dynamically
allocating application execution to the most appropriate
core [18]. A task matching or switching system is also
implemented to intelligently assign tasks to cores; bal-
ancing a performance demand against maintaining system
energy eﬃciency. These systems are particularly good at
saving power whilst handling a diverse workload where
ﬂuctuations of high and low computational demand are
common [19].
A heterogeneous architecture can be created in many
diﬀerent ways and many alternative have been developed
due to the heavy research interest in this area. Modiﬁ-
cations to general purpose processors, such as asymmetric
core sizes [13], custom accelerators [20], varied caches sizes
[21] and heterogeneity within each core [22][7], have all
been demonstrated to introduce heterogeneous features
into a system.
One of the most prominent and successful heterogeneous
architectures to date is the ARM big.LITTLE system.
This is a production example of a heterogeneous mul-
tiprocessor system consisting of a compact and energy
eﬃcient “LITTLE” Cortex-A7 processor coupled with a
higher performance “big” Cortex-A15 processor [19]. The
system is designed with the dynamic usage patterns of
modern smart phones in mind where there are typically
periods of high intensity processing followed by longer
periods of low intensity processing [23]. Low intensity
tasks, such as texting and audio, can be handled by the
A7 processor enabling a mobile device to save battery life.
When a period of high intensity occurs, the A15 processor
can be activated to increase the system’s throughput and
meet tighter performance deadlines. A power saving of up
to 70% is advertised for a light workload, where the A7
processor can handle all of the tasks, and a 50% saving for
medium workloads where some tasks will require allocation
to the A15 processor.
Kumar et al present an alternative implementation
where two architectures from the Alpha family, the EV5
and EV6, are combined to be more energy and area eﬃ-
cient than a homogeneous equivalent [8][18]. They demon-
strate that a much higher throughput can be achieved due
to the ability of a heterogeneous multi-core architecture to
better exploit changes in thread-level parallelism as well as
inter- and intra- thread diversity [8]. In [18], they evaluate
the system in terms of its power eﬃciency indicating a
39% average energy reduction for only a 3% performance
drop [18].
Composite Cores is a microarchitectural design that re-
duces the migration overhead of task switching by bringing
heterogeneity inside each individual core [22]. The design
contains 2 separate backend modules, called Engines, one
of which features a deeper and more complex out-of-order
pipeline, tailored for higher performance, while the other
features a smaller, compact in-order pipeline designed with
energy eﬃciency in mind. Figure Due to the high level of
Fig. 2: The microarchitecture for Composite Cores, fea-
turing two Engines, reprinted from [22].
hardware resource sharing and the small Engine state,
the migration overhead is brought down from the order
of 20,000 instructions to 2000 instructions. This greatly
reduces the energy expenditure associated with migration
and also allows more of the task to be run in an eﬃcient
mode. Their results show that the system can achieve an
energy saving of 18% using dynamic task migration whilst
only suﬀering a 5% performance loss.
Using both a heterogeneous architecture and hardware
reconﬁguration, a technique called Dynamic Core Mor-
phing (DCM) is developed by Rodrigues et al to allow
the shared hardware of a few tightly coupled cores to
be morphed at run-time [7]. The cores all feature a
baseline conﬁguration but reconﬁguration can trigger the
re-assignment of high performance functional units to
diﬀerent cores to speed up execution. The eﬃciency of
the system can lead to performance per watt gains of
up to 43% and an average saving of 16% compared to
a homogeneous static architecture.
The energy eﬃciency beneﬁts of heterogeneity can only
be exploited with the correct assignment of tasks or
applications to each core [9] [24][25][26][12]. Tasks must
be assigned in order to maximise energy eﬃciency whilst
ensuring performance deadlines are met. Awan et al per-
form scheduling in two phases to improve energy eﬃciency;
task allocation to minimise active energy consumption and
exchange of higher energy states to lower, more energy ef-
ﬁcient sleep states [9]. Alternatively, Calcado et al propose
division of tasks into m-threads to introduce ﬁne-grain
parallelism below thread level [27]. Moreover, Saha et al
include power and temperature models into an adaptive
task partitioning mechanism in order to allocate task
according to their actual utilisations rather than based
on a worst case execution time [12]. Simulation results
conﬁrm that the mechanism is eﬀective in minimising
energy consumption by 55% and reduces task migrations
by 60% over alternative task partitioning schemes.
96Proceedings of the 5th Small Systems Simula on Symposium 2014, Niš, Serbia, 12-14 February 2014
Tasks assignment can also be performed in response to
program phases which naturally occur during execution
when the resource demands of the application change.
Phase detection is used by Jooya and Analoui to dynam-
ically re-assigning programs for each phase to improve
the performance and power dissipation of heterogeneous
multi-core processors [25]. Programs are proﬁled in dy-
namic time intervals in order to detect phase changes.
Sawalha et al also propose an online scheduling technique
that dynamically adjusts the program-to-core assignment
as application behaviour changes between phases with an
aim to maximise energy eﬃciency [26]. Simulated eval-
uation of the scheduler shows energy saving of 16% on
average and up to 29% reductions in energy-delay product
can be achieved as compared to static assignments.
D. Energy Eﬃciency techniques in Reconﬁgurable Multi-
core Architectures
Reconﬁgurability is another property that has the po-
tential to increase the energy and area eﬃciency of pro-
cessors and systems on chip by introducing adaptability
and hardware ﬂexibility into the architecture. Building
on the innovations that heterogeneous architectures bring,
reconﬁgurable architectures aim to achieve both energy
eﬃciency and high performance but within the same
processor and therefore meet the requirements of many
embedded systems. The ﬂexible heterogeneous Multi-Core
processor (FMC) is an example of the fusion of these two
architectures that can deliver both a high throughput for
uniform parallel applications and high performance for
ﬂuctuating general purpose workloads [28]. Reconﬁgurable
architectures are dynamic, adjusting their complexity,
speed and performance level in response to the currently
executing application. With this property in mind, we dis-
regard systems that are statically reconﬁgurable but ﬁxed
while operating, such as traditional FPGAs, considering
only architectures that are run-time reconﬁgurable.
E. Dynamic Partial Reconﬁguration
FPGA manufacturers such as Xilinx and Altera now
oﬀer a mechanism called Dynamic Partial Reconﬁguration
(DPR) [29] or Self-Reconﬁguration (DPSR) [30] to enable
reconﬁguration during run-time of the circuits within
an FPGA, allowing a region of the design to change
dynamically while other areas remain active [31]. The
FPGA’s architecture is partitioned into a static region
consisting of ﬁxed logic, control circuits and an embedded
processor that control and monitor the system. The rest of
the design space is allocated to a dynamic/reconﬁgurable
region containing a reconﬁgurable logic fabric that can be
formed into any circuit whenever hardware acceleration is
required.
PDR/PDSR presents energy eﬃciency opportunities
over ﬁxed architectures. PDR enables the system to react
dynamically to changes in the structure or performance
and power constraints of the application, allowing it to
address ineﬃciencies in the allocation of resources and
more accurately implement changing software routines as
dynamic hardware accelerators [29]. These circuits can
then be easily removed or gated when they are no longer
required to reduce power consumption [32]. PDR can
also increase the performance of an FPGA based system
because it permits the continued operation of portions of
the dynamic region unaﬀected by reconﬁguration tasks.
Therefore, it allows multiple applications to be run in
parallel on a single FPGA [30]. This property also im-
proves the hardware eﬃciency of the system as, where
separate devices were required, diﬀerent tasks can now
be implemented on a single FPGA, reducing power con-
sumption and board dimensions. In addition, PDR reduces
reconﬁguration times due to the fact that only small
modiﬁcation are made to the bitstream over time and the
entire design does not need to be reloaded for each change.
A study into the power consumption patterns of DPSR
programming was conducted by Bonamy et al[11] to
investigate to what degree the sharing of silicon area
between multiple accelerators will help to reduce power
consumption. However, many parameters must be con-
sidered to assess whether the performance improvement
outweighs preventative factors such as reconﬁguration
overhead, accelerator area and idle power consumption
and as such any gain can be diﬃcult to evaluate. Their
results show complex variations in power usage at diﬀerent
stages during reconﬁguration that is dependent on factors
like the previous conﬁguration and the contents of the
conﬁgured circuit. In response to these experiments, three
power models are proposed to help analyse the trade-
oﬀ between implementing tasks as dynamically reconﬁg-
urable, in static conﬁguration or in full software execution.
Despite clear beneﬁts, several disadvantages become ap-
parent with this form of reconﬁgurable technology. As was
shown above, the power consumption overhead associated
with programming new circuits can eﬀectively imposed a
minimum size or usage time on circuits for implementation
to be validated. In addition, a baseline power and area cost
is also always created due to the large static region which
continuously consumes power and can contain unnecessary
hardware. Finally, the FPGA interconnect reduces the
speed and increases the power consumption of the circuit
compared to an ASIC implementation because of an in-
creased gate count required to give the system ﬂexibility.
F. Composable and Partitionable Architectures
Partitioning and composition are techniques employed
by some dynamically reconﬁgurable systems to provide
adaptive parallel granularity [33]. Composition involves
synthesising a larger logical processor from smaller pro-
cessing elements when higher performance computation
or greater instruction or thread level parallelism (ILP or
TLP) is required. Partitioning on the other hand will
divide up a large design in the most appropriate way and
assign shared hardware resources to individual cores to
97Proceedings of the 5th Small Systems Simula on Symposium 2014, Niš, Serbia, 12-14 February 2014
meet the needs of an application.
Composable Lightweight Processors (CLP) is an exam-
ple of a ﬂexible architectural approach to designing a
Chip Multiprocessor (CMP) where low-power processor
cores can be aggregated together dynamically to form
larger single-threaded processors [33]. The system has an
advantage over other reconﬁgurable techniques in that
there are no monolithic structure spanning the cores which
instead communicate using a microarchitectural protocol.
In tests against a ﬁxed-granularity processor, the CLP has
been shown to provide a 42% performance improvement
whilst being on average 3.4 times as area eﬃcient and 2
times as power eﬃcient.
Core Fusion is a similar technique to CLP in that it
allows multiple processors to be dynamically allocated to a
single instruction window and operated as if there were one
larger processor [34]. The main diﬀerence from CLP is that
Core Fusion operates on conventional RISC or CISC ISAs
giving it an advantage over CLP in terms of compatibility.
However, this also requires that the standard structures in
these ISAs are present and so can limit the scalability of
the architecture.
G. Coarse Grained Reconﬁgurable Array Architectures
Coarse-Grained Reconﬁgurable Array (CGRA) archi-
tectures represent an important class of programmable
system that act as an intermediate state between ﬁxed
general purpose processors and ﬁne-grain reconﬁgurable
FPGAs. They are designed to be reconﬁgurable at a
module or block level rather than at the gate level in
order to trade-oﬀ ﬂexibility for reduced reconﬁguration
time [35].
One example of a CGRA designed with energy eﬃciency
as the priority is the Ultra Low Power Samsung Reconﬁg-
urable Processor (ULP-SRP) presented by Changmoo et
al [36]. Intended for biomedical applications as a mobile
healthcare solution, the ULP-SRP is a variation of the
ADRES processor [37] and uses 3 run-time switch-able
power modes and automatic power gating to optimise the
energy consumption of the device. Experimental results
when running a low power monitoring application show a
46.1% energy consumption reduction compared to previ-
ous works.
III. Case Study - picoMIPS
The picoMIPS architecture proposed here is a RISC
microprocessor with a minimised instruction set architec-
ture (ISA). Each implementation will contain only the
necessary datapath elements in order to maximise area
eﬃciency as the priority. For example, the instruction
decoder will only recognise instructions that the user
speciﬁes and the ALU will only perform the required logic
or arithmetic functions. Due to the correlation between
logic gate count and power consumption, energy eﬃciency
is also maximised in the processor therefore the system is
designed to perform a speciﬁc task in the most eﬃcient
processor-based form.
By synthesising the picoMIPS as a microprocessor, a
baseline conﬁguration is established upon which function-
ality can be added or removed, in the form of instruc-
tions or functions, while incurring only minimal changes
to the area consumption of the design. If the task was
implemented as a speciﬁc dedicated hardware circuit, any
changes to the functionality could have a large inﬂuence
on the area consumption of the design. Figure 3 shows an
example conﬁguration for the picoMIPS which can accom-
modate the majority of the simple RISC instructions. It is
a Harvard architecture, with separate program and data
memories, although the designer may choose to exclude a
data memory entirely. The user can also specify the widths
of each data bus to avoid unnecessary opcode bits from
wasting logic gates.
The picoMIPS has also been implemented to perform
the DCT and inverse DCT (IDCT) in a multi-core context
[38]. A homogeneous architecture was deployed with the
same single core structure, as in ﬁgure 3, being replicated
3 times. The cores are connected via a data bus to a
distribution module as shown in ﬁgure 4 where block
data is transferred to each core in turn. This structure
theoretically triples the throughput of the system as it
can process multiple data blocks in parallel.
As a microprocessor architecture, the picoMIPS can
implement many of the technologies discussed in the
Introduction to improve energy eﬃciency. Clock gating,
power domains and DVFS will all beneﬁt the system
however the area overhead of implementing them must
ﬁrst be considered as necessary. Pipeline balancing and
caching can be integrated into more complex picoMIPS
architectures however these are performance focused im-
provements and so are not priorities in the picoMIPS
concept. The expansion of the system to multi-core is
also one that can be employed to improve performance.
Moreover, a heterogeneous architecture could be imple-
mented to allow the picoMIPS to process multiple dif-
ferent applications simultaneously using several tailored
ISAs. Reconﬁgurability can also be applied to picoMIPS
to create an architecture which can be speciﬁc to each
application that is executed, eﬀectively creating a general
purpose yet application speciﬁc processor. This property
would require run-time synthesis algorithms to detect and
develop the instructions and functional units that are
required, before executing the application.
IV. Conclusion
The principles of the picoMIPS processor have been im-
plemented in a few undergraduate projects to demonstrate
the concept of minimal architecture synthesis and how
it can be used to produce an application speciﬁc, energy
eﬃciency processor. A number of examples were used to
demonstrate the validity of this approach in both, single-
core and many-core designs. In addition to the discrete
98Proceedings of the 5th Small Systems Simula on Symposium 2014, Niš, Serbia, 12-14 February 2014
Fig. 3: An example implementation of the picoMIPS architecture.
Fig. 4: A Multi-core implementation of the picoMIPS architecture.
cosine transform (DCT) algorithm presented above, a
stage in JPEG compression was synthesised for FPGA
implementation into a processor architecture based on the
picoMIPS concept, as well as various image manipulation
algorithms. Evaluation of results from this work still con-
tinues but it is evident that resulting processors are more
area eﬃcient than corresponding FPGA soft-cores or a
GPP due to the removal of unnecessary circuitry. Such
synthesised processors can also be compared to a dedicated
ASIC hardware implementation. An ASIC implementa-
tions are likely to have a much higher performance and
throughput of data however this is at the cost of area
and energy eﬃciency. The picoMIPS therefore represents
a balance between the two, sacriﬁcing some performance
for area and energy eﬃciency beneﬁts.
References
[1] C. Isci, A. Buyuktosunoglu, C.-Y. Chen, P. Bose, and
M. Martonosi, “An Analysis of Eﬃcient Multi-Core Global
Power Management Policies: Maximizing Performance for a
Given Power Budget,” in Microarchitecture, 2006. MICRO-39.
39th Annual IEEE/ACM International Symposium on, 2006,
pp. 347–358.
[2] V. Hanumaiah and S. Vrudhula, “Energy-eﬃcient Operation
of Multi-core Processors by DVFS, Task Migration and Active
Cooling,” Computers, IEEE Transactions on, vol. PP, no. 99,
pp. 1–1, 2012.
[3] B. de Abreu Silva and V. Bonato, “Power/performance opti-
mization in FPGA-based asymmetric multi-core systems,” in
Field Programmable Logic and Applications (FPL), 2012 22nd
International Conference on, 2012, pp. 473–474.
[4] K. Wonyoung, M. Gupta, G.-Y. Wei, and D. Brooks, “System
level analysis of fast, per-core DVFS using on-chip switching
regulators,” in High Performance Computer Architecture, 2008.
HPCA 2008. IEEE 14th International Symposium on, 2008, pp.
123–134.
99Proceedings of the 5th Small Systems Simula on Symposium 2014, Niš, Serbia, 12-14 February 2014
[5] P. Bassett and M. Saint-Laurent, “Energy eﬃcient design tech-
niques for a digital signal processor,” in IC Design Technology
(ICICDT), 2012 IEEE International Conference on, 2012, pp.
1–4.
[6] ARM, ARM Cortex-A15 MPCore Processor Technical
Reference Manual, ARM, June 2013, pages 53 - 63. [Online].
Available: http://infocenter.arm.com/help/topic/com.arm.doc.
ddi0438i/DDI0438I_cortex_a15_r4p0_trm.pdf
[7] R. Rodrigues, A. Annamalai, I. Koren, S. Kundu, and O. Khan,
“Performance Per Watt Beneﬁts of Dynamic Core Morphing in
Asymmetric Multicores,” in Parallel Architectures and Compi-
lation Techniques (PACT), 2011 International Conference on,
2011, pp. 121–130.
[8] R. Kumar, D. Tullsen, P. Ranganathan, N. Jouppi, and
K. Farkas, “Single-ISA heterogeneous multi-core architectures
for multithreaded workload performance,” in Computer Archi-
tecture, 2004. Proceedings. 31st Annual International Sympo-
sium on, 2004, pp. 64–75.
[9] M. Awan and S. Petters, “Energy-aware partitioning of tasks
onto a heterogeneous multi-core platform,” in Real-Time and
Embedded Technology and Applications Symposium (RTAS),
2013 IEEE 19th, 2013, pp. 205–214.
[10] R. Basmadjian and H. De Meer, “Evaluating and modeling
power consumption of multi-core processors,” in Future Energy
Systems: Where Energy, Computing and Communication Meet
(e-Energy), 2012 Third International Conference on, 2012,
pp. 1–10. [Online]. Available: http://ieeexplore.ieee.org/xpl/
articleDetails.jsp?arnumber=6221107
[11] R. Bonamy, D. Chillet, S. Bilavarn, and O. Sentieys, “Power
consumption model for partial and dynamic reconﬁguration,”
in Reconﬁgurable Computing and FPGAs (ReConFig), 2012
International Conference on, 2012, pp. 1–8.
[12] S. Saha, J. Deogun, and Y. Lu, “Adaptive energy-eﬃcient task
partitioning for heterogeneous multi-core multiprocessor real-
time systems,” in High Performance Computing and Simulation
(HPCS), 2012 International Conference on, 2012, pp. 147–153.
[13] D. . Woo and H.-H. Lee, “Extending Amdahl’s Law for Energy-
Eﬃcient Computing in the Many-Core Era,” Computer, vol. 41,
no. 12, pp. 24–31, 2008.
[14] R. Bahar and S. Manne, “Power and energy reduction via
pipeline balancing,” in Computer Architecture, 2001. Proceed-
ings. 28th Annual International Symposium on, 2001, pp.
218–229.
[15] J. Sartori, B. Ahrens, and R. Kumar, “Power balanced
pipelines,” in High Performance Computer Architecture
(HPCA), 2012 IEEE 18th International Symposium on, 2012,
pp. 1–12.
[16] R. Kumar, V. Zyuban, and D. Tullsen, “Interconnections in
multi-core architectures: understanding mechanisms, overheads
and scaling,” in Computer Architecture, 2005. ISCA ’05. Pro-
ceedings. 32nd International Symposium on, 2005, pp. 408–419.
[17] H. Zeng, J. Wang, G. Zhang, and W. Hu, “An interconnect-
aware power eﬃcient cache coherence protocol for CMPs,” in
Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE
International Symposium on, 2008, pp. 1–11.
[18] R. Kumar, K. Farkas, N. Jouppi, P. Ranganathan, and
D. Tullsen, “Single-ISA heterogeneous multi-core architectures:
the potential for processor power reduction,” in Microarchitec-
ture, 2003. MICRO-36. Proceedings. 36th Annual IEEE/ACM
International Symposium on, 2003, pp. 81–92.
[19] P. Greenhalgh, “big.LITTLE Processing with ARM Cortex-A15
& Cortex-A7,” ARM, Tech. Rep., September 2011.
[20] H. M. Waidyasooriya, Y. Takei, M. Hariyama, and
M. Kameyama, “FPGA implementation of heterogeneous
multicore platform with SIMD/MIMD custom accelerators,”
in Circuits and Systems (ISCAS), 2012 IEEE International
Symposium on, 2012, pp. 1339–1342.
[21] B. de Abreu Silva, L. Cuminato, and V. Bonato, “Reducing the
overall cache miss rate using diﬀerent cache sizes for Heteroge-
neous Multi-core Processors,” in Reconﬁgurable Computing and
FPGAs (ReConFig), 2012 International Conference on, 2012,
pp. 1–6.
[22] A. Lukefahr, S. Padmanabha, R. Das, F. Sleiman, R. Dreslinski,
T. Wenisch, and S. Mahlke, “Composite Cores: Pushing Het-
erogeneity Into a Core,” in Microarchitecture (MICRO), 2012
45th Annual IEEE/ACM International Symposium on, 2012,
pp. 317–328.
[23] B. Jeﬀ, “Advances in big.LITTLE Technology for Power and
Energy Savings,” ARM, Tech. Rep., September 2012.
[24] S. Zhang and K. Chatha, “Automated techniques for energy
eﬃcient scheduling on homogeneous and heterogeneous chip
multi-processor architectures,” in Design Automation Confer-
ence, 2008. ASPDAC 2008. Asia and South Paciﬁc, 2008, pp.
61–66.
[25] A. Z. Jooya and M. Analoui, “Program phase detection in
heterogeneous multi-core processors,” in Computer Conference,
2009. CSICC 2009. 14th International CSI, 2009, pp. 219–224.
[26] L. Sawalha and R. Barnes, “Energy-Eﬃcient Phase-Aware
Scheduling for Heterogeneous Multicore Processors,” in Green
Technologies Conference, 2012 IEEE, 2012, pp. 1–6.
[27] F. Calcado, S. Louise, V. David, and A. Merigot, “Eﬃcient Use
of Processing Cores on Heterogeneous Multicore Architecture,”
in Complex, Intelligent and Software Intensive Systems, 2009.
CISIS ’09. International Conference on, 2009, pp. 669–674.
[28] M. Pericas, A. Cristal, F. Cazorla, R. Gonzalez, D. Jimenez,
and M. Valero, “A Flexible Heterogeneous Multi-Core Archi-
tecture,” in Parallel Architecture and Compilation Techniques,
2007. PACT 2007. 16th International Conference on, 2007, pp.
13–24.
[29] M. Santambrogio, “From Reconﬁgurable Architectures to Self-
Adaptive Autonomic Systems,” in Computational Science and
Engineering, 2009. CSE ’09. International Conference on,
vol. 2, 2009, pp. 926–931.
[30] J. Zalke and S. Pandey, “Dynamic Partial Reconﬁgurable Em-
bedded System to Achieve Hardware Flexibility Using 8051
Based RTOS on Xilinx FPGA,” in Advances in Computing,
Control, Telecommunication Technologies, 2009. ACT ’09. In-
ternational Conference on, 2009, pp. 684–686.
[31] S. Bhandari, S. Subbaraman, S. Pujari, F. Cancare, F. Bruschi,
M. Santambrogio, and P. Grassi, “High Speed Dynamic Partial
Reconﬁguration for Real Time Multimedia Signal Processing,”
in Digital System Design (DSD), 2012 15th Euromicro Confer-
ence on, 2012, pp. 319–326.
[32] S. Liu, R. Pittman, A. Forin, and J.-L. Gaudiot, “On energy
eﬃciency of reconﬁgurable systems with run-time partial recon-
ﬁguration,” in Application-speciﬁc Systems Architectures and
Processors (ASAP), 2010 21st IEEE International Conference
on, 2010, pp. 265–272.
[33] K. Changkyu, S. Sethumadhavan, M. S. Govindan, N. Ran-
ganathan, D. Gulati, D. Burger, and S. Keckler, “Composable
Lightweight Processors,” in Microarchitecture, 2007. MICRO
2007. 40th Annual IEEE/ACM International Symposium on,
2007, pp. 381–394.
[34] E. Ipek, M. Kirman, N. Kirman, and J. F. Martinez,
“Core fusion: accommodating software diversity in chip
multiprocessors,” in Proceedings of the 34th annual
international symposium on Computer architecture, ser. ISCA
’07. New York, NY, USA: ACM, 2007, pp. 186–197. [Online].
Available: http://doi.acm.org/10.1145/1250662.1250686
[35] Z. Rakossy, T. Naphade, and A. Chattopadhyay, “Design and
analysis of layered coarse-grained reconﬁgurable architecture,”
in Reconﬁgurable Computing and FPGAs (ReConFig), 2012
International Conference on, 2012, pp. 1–6.
[36] K. Changmoo, C. Mookyoung, C. Yeongon, M. Konijnenburg,
R. Soojung, and K. Jeongwook, “ULP-SRP: Ultra low power
Samsung Reconﬁgurable Processor for biomedical applications,”
in Field-Programmable Technology (FPT), 2012 International
Conference on, 2012, pp. 329–334.
[37] F. J. Veredas, M. Scheppler, W. Moﬀat, and B. Mei, “Custom
implementation of the coarse-grained reconﬁgurable ADRES
architecture for multimedia purposes,” in Field Programmable
Logic and Applications, 2005. International Conference on,
2005, pp. 106–111.
[38] G. Liu, “Fpga implementation of 2d-dct/idct algorithm using
multi-core picomips,” Master’s thesis, University of Southamp-
ton, School of Electronics and Computer Science, September
2013.
100