Marrying Many-core Accelerators and InfiniBand for a New Commodity
  Processor by Solnushkin, Konstantin S. & Tsujita, Yuichi
Marrying Many-core Accelerators and InfiniBand
for a New Commodity Processor
Konstantin S. Solnushkin
konstantin@solnushkin.org
Yuichi Tsujita
∗
Kinki University, Japan
yuichi_tsujita@fw.ipsj.or.jp
ABSTRACT
During the last 15 years, the supercomputing industry has
been using mass-produced, off-the-shelf components to build
cluster computers. Such components are not perfect for
HPC purposes, but are cheap due to effect of scale in their
production. The coming exa-scale era changes the land-
scape: exa-scale computers will contain components in quan-
tities large enough to justify their custom development and
production.
We propose a new heterogeneous processor, equipped with
a network controller and designed specifically for HPC. We
then show how it can be used for enterprise computing mar-
ket, guaranteeing its widespread adoption and therefore low
production costs.
Categories and Subject Descriptors
C.1 [Computer Systems Organization]: Processor Ar-
chitectures
Keywords
High-performance computing, Economics
1. INTRODUCTION
In 1990s, commodity off-the-shelf components allowed to
build inexpensive but powerful cluster computers, disrupt-
ing the supercomputing market. Those components were
not perfect for HPC, but were readily available and cheap.
Current attempts to use commodity components are still fo-
cused on taking technologies initially created for enterprise
computing and painfully fitting them into the Procrustean
bed of HPC.
However, the future exa-scale computers will have so many
identical building blocks such as CPUs – on the order of
millions [1] – that it becomes feasible to amortise their cus-
tom design and manufacturing costs over large production
batches. Therefore we suggest to capitalise on this trend by
designing a new commodity processor, with HPC being its
primary workload.
At the same time, the enterprise computing market is signif-
icantly bigger than HPC market. Thus, for widest adoption,
we should design the processor to make it suitable for data
centre workloads as well, resulting in a unified architecture.
∗Present address: RIKEN, AICS, Japan
We believe that architecture of the new CPU should be
based on the many-core paradigm, while getting rid of inef-
ficiencies found in some of the current implementations. For
example, the Intel Xeon Phi product received positive feed-
back from early adopters; however, the accelerator board
is not “standalone”, as it needs to be plugged into a “host”
computer. This requires the use of a separate host CPU
which leads to decreased density and loss of flexibility. Even
more importantly, the accelerator communicates with the
outside world through the host’s network adaptor via PCIe
connection, which adds to latency.
The current Xeon Phi implementation is not very useful for
generic data centre applications, either. Configurations with
up to 8 accelerator boards in a server were demonstrated,
but in this case each accelerator receives only a share of
network bandwidth.
We propose remedying the situation by integrating a many-
core unit with a general-purpose multi-core CPU and a net-
work adaptor, thus turning the board into a standalone and
bootable server and cluster compute node. We then argue
why this product can become useful not just to HPC but
to a significantly broader user base, including data centre
environments and desktop workstations. As predicted by
HiPEAC [2], heterogeneous chips, similar to the one pro-
posed in this paper, will be prevalent in the future.
2. RELATEDWORK
DAL Project (“Defying Amdahl’s Law”) [3] explores future
many-core architectures where several complex cores are ac-
companied by hundreds of simple cores on the same chip.
The project’s proposal is to intermittently clock complex
cores at a high frequency, subject to thermal dissipation
limit of the chip, to speed up execution of sequential parts
of applications, while parallel parts continue running on the
array of simple cores. The goal of the project is to improve
microarchitecture so that the running thread can quickly
migrate from a simple core to the complex one, and vice
versa. As the heterogeneous CPU that we propose in this
paper also features a mix of complex and simple cores, mi-
croarchitectural advances provided by the DAL project are
applicable in our case.
DEEP Project (“Dynamical Exascale Entry Platform”) [4]
proposes to build a “cluster of accelerators” – that is, an ar-
ray of independent, bootable accelerator boards connected
with a high-speed network. This array is then connected
ar
X
iv
:1
30
7.
01
00
v1
  [
cs
.D
C]
  2
9 J
un
 20
13
to a conventional HPC cluster. Compute jobs are then di-
verted to the part of this complex which is most suitable
for their execution. The structure of this “cluster of accel-
erators” is not intended for workloads other than HPC. Our
approach is different: while we also propose to have indepen-
dent and bootable accelerator boards, our CPU will contain
cores complex enough to run nearly any workload, ranging
from HPC to data mining and from CAE to data centre
tasks.
“Project Denver”, proposed by NVIDIA [5], intends to build
a heterogeneous processor that couples general-purpose cores
based on ARM architecture with CUDA-programmed GPU
cores. However, the GPU part is unable to run data cen-
tre server workloads. Another NVIDIA project is “Echelon”
[6], which its authors claim will become a general-purpose
system suitable for data-intensive and HPC workloads and
based on long instruction word (LIW) architecture.
“Runnemede” chip, proposed by Intel [7], employs a large
number of simple cores called execution engines, controlled
by a smaller number of general-purpose cores called control
engines. Execution borrows ideas from dataflow architec-
tures.
“Mont-Blanc Project” [8] investigates the use of commodity,
readily-available ARM processors such as used in cell phones
for HPC purposes.
3. STRUCTURE AND FUNCTIONS
3.1 Integration Scenarios
We describe three possible integration scenarios. In the sim-
plest case, two mass-produced, off-the-shelf chips – one for a
multi-core CPU, and one for a many-core accelerator – are
placed on a single board, accompanied by a network chip
and memory chips. The board is bootable and serves as a
standalone cluster compute node. This is a step forward
from the current Intel Xeon Phi implementation that needs
a host computer to be plugged into, for booting and com-
municating with the outside world.
However, this scenario lacks efficient communication between
the multi-core and many-core parts. Hence the second, more
advanced integration scenario is to put multi-core and many-
core parts into a single IC package, or possibly on the same
die, with a network adapter still implemented as a separate
chip. Fast communication between two parts of a tandem
via an on-chip network will allow the multi-core part to per-
form I/O and MPI delegation functions for the many-core.
Finally, the third scenario integrates the network adaptor
on the same chip. Deep integration, resulting in a System-
on-a-Chip (SoC), will incur custom design and fabrication
costs which are best offset by mass production. However,
we believe that applicability of this SoC for a very broad
market will help amortise costs and turn it into a new sort
of inexpensive commodity hardware.
To make the SoC suitable for desktop workstations as well,
we propose to add a simple GPU and some commonly used
accelerators, such as hardware-assisted encryption units; this
will not take much of the die space. These units may re-
main unused for some workloads (being so-called “dark sil-
icon”; similar to how floating-points units of current CPUs
are mostly unused in data-mining and server workloads [9]).
Adding an FPGA unit on the chip, similar to the “Xilinx
Zynq” product, will open yet more field-programming flexi-
bility.
3.2 Use Cases
The many-core part will consist of identical multithreaded
simple low-power cores tailored to floating-point computa-
tions, like in existing many-core implementations such as
Intel Xeon Phi and GPUs.
Three use cases for utilising the proposed chip in different
environments are presented on Figure 1. In each of these
scenarios, the general-purpose multi-core unit doesn’t have
to be very powerful in terms of floating-point performance,
as it is assumed that for intensive computing tasks the soft-
ware will be able to utilise the many-core part (via, say,
OpenCL).
For data centre workloads, cores in the many-core part can
be visible to the operating system as individual CPUs for
easy task scheduling. This usage, involving multithreaded
simple cores, is similar to Sun’s “UltraSPARC T2” CPU and
a recent Hewlett-Packard’s “Project Moonshot” [10].
To reduce requirements on memory bandwidth, cache mem-
ory will be required for legacy applications; we propose to
implement it using eDRAM technology. The effectiveness
of this approach was proven by IBM’s POWER7 proces-
sor [11]. Alternatively, for newer applications that prefer to
manage memory accesses on their own, and don’t require
hardware caching logic, this on-chip memory can be config-
ured as software-controlled scratch pads, leading to energy
savings [6, 7].
The crucial question is how many cores the many-core part
should have. Chip’s I/O bandwidth is limited by its pin
count, and we should aim for ample bandwidth per core, as
the chip’s primary purpose is HPC. Therefore we shouldn’t
be tempted to place as many cores as possible. Other limits
on the number of cores are related to heat dissipation and
yield of the semiconductor fabrication process.
Our choice of network is InfiniBand, which currently can be
recognised as commodity technology. It has the following
benefits: reasonable technical characteristics, clear technol-
ogy roadmap, hardware is produced on a large scale, and
it supports a variety of network topologies. For HPC and
data centre environments, each 18 boards with the proposed
CPU can be connected, via a backplane, to the InfiniBand
switch chip, which would further connect them to the rest
of the fabric (in fat-tree, torus or other topologies).
Besides the aforementioned components, the motherboard
will contain DRAM memory modules as well as an optional
flash-based module for scratch storage, creating an addi-
tional level of memory hierarchy, as was featured in “Gor-
don” supercomputer [12].
When running at full speed, the proposed chip is best cooled
with water. The feasibility and reliability of this approach
on the large scale was verified with the“SuperMUC”machine
Figure 1: Three use cases for the proposed system with corresponding workloads: (a) HPC environments,
(b) generic data centre environments, (c) desktop computing environments
[13]. To facilitate waste heat reuse in water-cooled environ-
ments, mid-scale installations (∼100 kW and higher) are
preferred [14]. However, in desktop computing water cool-
ing is usually not available, and aggressive automatic power
throttling (via under-clocking) of the many-core part will
be required during its operation to prevent accidental chip
overheating (“dim silicon” approach, according to Taylor’s
classification [9]).
4. SOFTWARE ECOSYSTEM
Instead of the ubiquitous but outdated and proprietary X86
architecture, it is tempting to utilise open-source architec-
ture in the proposed system. This will promote research and
collaboration, including parties from the private sector, and
can steer competition.
A good candidate is the “OpenSPARC” architecture and the
“OpenSPARC T2” microarchitecture implementation. (For
example, “SPARC64 VIIIfx” CPU, as found in the Japanese
“K Computer”, is also based on SPARC architecture [15]).
To make a leap to the many-core, the same trick could be
employed that Intel used for Xeon Phi: stripping less-needed
features such as out-of-order execution while simultaneously
widening floating-point units.
The OpenSPARC architecture is supported by GNU/Linux
operating system and the GCC (GNU Compiler Collection).
In the HPC environment, the amount of efforts required to
build the ecosystem for the new CPU (upgrading the Linux
kernel, porting MPI implementations and several important
numerical libraries and advanced compilers) is lower com-
pared to the server segment, where additional usual data
centre applications will need to be ported for market ac-
ceptance. Proliferation in the desktop segment is difficult
to achieve unless Microsoft Windows and device drivers are
ported to OpenSPARC. Instead, a GNU/Linux distribution
such as Ubuntu running on OpenSPARC could be refined to
a notable point.
Using OpenSPARC as the base architecture may not lead
to designs achieving as low performance per watt as Intel
Xeon Phi or IBM BlueGene/Q designs, although heat reuse
Leader process on each node is managing inter-node communications. 
Forwader Forwarder Forwarder 
Allgather communication (Multi-core side) 
１ １ １ 
2 
3 3 3 Gathered data Gathered data Gathered data 
Copy or share Copy or share Copy or share 
Leader 
process 
Leader 
process 
Leader 
process 
<Multi-core side> 
<Many-core side> 
<Multi-core side> <Multi-core side> 
<Many-core side> <Many-core side> 
Collected 
data 
Collected 
data 
Collected 
data 
NIC NIC NIC 
Figure 2: Optimisation scheme in collective commu-
nications, in a hybrid system
methods will alleviate this problem. What is more impor-
tant, however, is that the proposed CPU will find use in
many niches; therefore its mass production will keep its price
low. Additionally, methods and technologies from the HPC
niche will be directly applicable in the data centre field due
to unified architecture.
There is a risk that a product that tries to fit all three niches
– HPC, server and desktop – might fit in none. However,
the market is diluted by assorted products that appear every
year. Perhaps it’s time for reunification. Adherence to open-
source policies at all stages – in both accepting and giving
– will help offset the costs and reach the goal faster.
5. PERSPECTIVEVIEWOFCOLLECTIVE
OPERATION IMPLEMENTATIONS
Flat MPI is no longer scalable on recent accelerator-based
hybrid systems where many-core processors are provided in
the form of a PCIe card [16]. Thus we propose a hybrid
system, with many-core and multi-core processors within
the same compute node. However, MPI is still useful for
most applications.
Therefore we describe here a perspective view of the MPI
library software stack, especially for collective operations.
Existing software tools for designing computer clusters [17]
Leader process on each node is managing inter-node communications. 
Allgather communication (among leader processes) 
１ 
2 
3 Gathered data 
Copy or share 
Leader 
process 
１ 
3 Gathered data 
Leader 
process 
１ 
3 Gathered data 
Leader 
process 
Collected 
data 
Collected 
data 
Collected 
data 
Copy or share 
Copy or share 
NIC NIC NIC 
Figure 3: Optimisation scheme in collective commu-
nications, many-core processors only
can be enhanced to accompany their designs with hints on
optimal MPI process placement among cluster nodes, taking
network proximity into account. This will facilitate efficient
operation of multi-layer MPI communicators and collective
operations.
Providing an MPI library for such a hybrid system may at-
tract attention of existing application users, allowing them
to easily exploit parallelism of the hybrid systems. Here we
show the following schemes, with MPI Allgather operation
as an example.
1. A delegation mechanism as shown in Figure 2
• Considering the recent hybrid architecture with
many-core and multi-core processors within the
same node
• Aggregating collective communication and I/O re-
quests from the many-core side, followed by real
operations by a forwarder process on the multi-
core side
• Gathered data are then copied to non-leader pro-
cesses, or accessed by them in a shared manner.
2. Multi-layered system available on many-core only sys-
tem (multi-core unit not available), shown in Figure 3.
Here we have the following assumptions about this sys-
tem:
• Complicated hierarchy in memory architecture
• Shortage of available memory per core
Therefore a multi-layered MPI communicator man-
agement is beneficial. Based on this scheme, we
may form groups within compute nodes:
• Every group has internal collective communica-
tions.
• Every leader process manages communications on
behalf of the associated group.
• Gathered data are then copied to non-leader pro-
cesses, or accessed by them in a shared manner.
In both cases, some kind of shared memory management
mechanism is required inside a compute node.
Cached 
information 
Cached 
information 
Persistent 
information 
Persistent 
information 
Top layer 
routing table 
. . . . . 
Leader 
process 
Leader 
process 
Figure 4: Communication optimisation based on
hardware affinity
Another optimisation will come from making use of hard-
ware affinity in process placement and communication topol-
ogy. “hwloc” [18] or “likwid” [19] are the candidates for un-
derstanding the memory hierarchy. These or similar tools
can be used to facilitate effective process mapping or dis-
tributed MPI communicator information management, as
shown in Figure 4.
APIs of such hardware affinity tools can be used in pro-
cess management scheme of an extended MPI library. In
order to minimise local communicator information on each
node, only information about processes in the local group
is kept inside the same node. Information of external pro-
cesses is initially queried through communicator manage-
ment scheme, and cache mechanism will be implemented to
reuse connection information.
Combining interconnection fabrics inside the same chip may
lead to smaller communication latency, because we can elim-
inate PCIe access overhead. However, this approach has
technical challenges associated with custom fabrication. Hav-
ing many functions in the same chip leads to a higher electric
power consumption of the chip; this is yet another challenge.
6. CONCLUSIONS
In this paper we propose a new commodity processor, de-
signed with HPC in mind but suitable for a much broader set
of workloads. We review a perspective approach to the MPI
library software stack, where a multi-core unit performs col-
lective operations on behalf of the many-core unit. We also
note that the appeal of the resulting hardware solution to
the data centre market will justify its large scale production,
keeping costs low.
7. ACKNOWLEDGEMENTS
The work was presented at the International Conference on
Computational Science (ICCS 2013) in Barcelona, Spain.
Travel support for Konstantin S. Solnushkin was provided
by the National Research University ITMO, St. Petersburg,
Russia under agreement №11.G34.31.0019.
8. REFERENCES
[1] J. Dongarra, P. Beckman, T. Moore, P. Aerts,
G. Aloisio, J. Andre, D. Barkai, J. Berthou, T. Boku,
B. Braunschweig, et al., The international exascale
software project roadmap, International Journal of
High Performance Computing Applications 25 (1)
(2011) 3.
[2] M. Duranton, S. Yehia, B. De Sutter,
K. De Bosschere, A. Cohen, B. Falsafi, G. Gaydadjiev,
M. Katevenis, J. Maebe, H. Munk, N. Navarro,
A. Ramirez, O. Temam, M. Valero, The HiPEAC
vision, http://www.hipeac.net/roadmap (2010).
[3] DAL Project, Defying Amdahl’s law,
http://www.irisa.fr/alf/dal/.
[4] DEEP Project, Dynamic exascale entry platform,
http://www.deep-project.eu/.
[5] B. Dally, “Project Denver” processor to usher in new
era of computing,
http://blogs.nvidia.com/2011/01/
project-denver-processor-to-usher-in-new-era-of-computing/
(January 2011).
[6] S. W. Keckler, W. J. Dally, B. Khailany, M. Garland,
D. Glasco, GPUs and the future of parallel computing,
Micro, IEEE 31 (5) (2011) 7–17.
[7] N. P. Carter, A. Agrawal, S. Borkar, R. Cledat,
H. David, D. Dunning, J. Fryman, I. Ganev, R. A.
Golliver, R. Knauerhase, R. Lethin, B. Meister, A. K.
Mishra, W. R. Pinfold, J. Teller, J. Torrellas,
N. Vasilache, G. Venkatesh, J. Xu, Runnemede: An
architecture for ubiquitous high-performance
computing, in: High Performance Computer
Architecture (HPCA), 2013 IEEE 19th International
Symposium on, 2013.
[8] N. Rajovic, L. Vilanova, C. Villavieja, N. Puzovic,
A. Ramirez, The low power architecture approach
towards exascale computing, Journal of
Computational
Sciencedoi:10.1016/j.jocs.2013.01.002.
[9] M. B. Taylor, Is dark silicon useful? Harnessing the
four horsemen of the coming dark silicon apocalypse,
in: Proceedings of the 49th Annual Design
Automation Conference, ACM, 2012, pp. 1131–1136.
[10] T. P. Morgan, HP Project Moonshot hurls ARM
servers into the heavens, http://www.theregister.
co.uk/2011/11/01/hp_redstone_calxeda_servers/
(November 2011).
[11] R. Kalla, B. Sinharoy, W. J. Starke, M. Floyd,
POWER7: IBM’s next-generation server processor,
IEEE Micro 30 (2) (2010) 7–15.
[12] S. Strande, P. Cicotti, R. Sinkovits, W. Young,
R. Wagner, M. Tatineni, E. Hocks, A. Snavely,
M. Norman, Gordon: design, performance, and
experiences deploying and supporting a data intensive
supercomputer, in: Proceedings of the 1st Conference
of the Extreme Science and Engineering Discovery
Environment: Bridging from the eXtreme to the
campus and beyond, ACM, 2012, p. 3.
[13] M. Brehm, A. Auweter, H. Huber, T. Wilde, Energy
efficient HPC systems: Concepts, procurement &
installation, in: Proceedings of International
Supercomputing Conference, ISC’12, 2012.
[14] K. S. Solnushkin, Fruits of computing: Redefining
’Green’ in HPC energy usage,
http://clusterdesign.org/2012/08/
fruits-of-computing-redefining-green-in-hpc-energy-usage/
(August 2012).
[15] T. Yoshida, M. Hondo, R. Kan, G. Sugizaki,
SPARC64 VIIIfx: CPU for the K computer, Fujitsu
Sci. Tech. J 48 (3) (2012) 274–279.
[16] K. Yoshinaga, Y. Tsujita, A. Hori, M. Sato,
M. Namiki, Y. Ishikawa, Delegation-based MPI
communications for a hybrid parallel computer with
many-core architecture, in: Recent Advances in the
Message Passing Interface, LNCS 7490, Springer,
2012, pp. 47–56.
[17] K. S. Solnushkin, Computer cluster design automation
using web services, in: Proceedings of International
Supercomputing Conference, ISC’12, 2012.
URL http://konstantin.solnushkin.org
[18] F. Broquedis, J. Clet-Ortega, S. Moreaud,
N. Furmento, B. Goglin, G. Mercier, S. Thibault,
R. Namyst, hwloc: A generic framework for managing
hardware affinities in HPC applications, in: Parallel,
Distributed and Network-Based Processing (PDP),
2010 18th Euromicro International Conference on,
IEEE, 2010, pp. 180–186.
[19] J. Treibig, G. Hager, G. Wellein, Likwid: A
lightweight performance-oriented tool suite for x86
multicore environments, in: Parallel Processing
Workshops (ICPPW), 2010 39th International
Conference on, IEEE, 2010, pp. 207–216.
