158 research outputs found
Programming the Adapteva Epiphany 64-core Network-on-chip Coprocessor
In the construction of exascale computing systems energy efficiency and power
consumption are two of the major challenges. Low-power high performance
embedded systems are of increasing interest as building blocks for large scale
high- performance systems. However, extracting maximum performance out of such
systems presents many challenges. Various aspects from the hardware
architecture to the programming models used need to be explored. The Epiphany
architecture integrates low-power RISC cores on a 2D mesh network and promises
up to 70 GFLOPS/Watt of processing efficiency. However, with just 32 KB of
memory per eCore for storing both data and code, and only low level inter-core
communication support, programming the Epiphany system presents several
challenges. In this paper we evaluate the performance of the Epiphany system
for a variety of basic compute and communication operations. Guided by this
data we explore strategies for implementing scientific applications on memory
constrained low-powered devices such as the Epiphany. With future systems
expected to house thousands of cores in a single chip, the merits of such
architectures as a path to exascale is compared to other competing systems.Comment: 14 pages, submitted to IJHPCA Journal special editio
HARDWARE DESIGN OF MESSAGE PASSING ARCHITECTURE ON HETEROGENEOUS SYSTEM
Heterogeneous multi/many-core chips are commonly used in todayâs top tier supercomputers. Similar heterogeneous processing elements â or, computation ac- celerators â are commonly found in FPGA systems. Within both multi/many-core chips and FPGA systems, the on-chip network plays a critical role by connecting these processing elements together. However, The common use of the on-chip network is for point-to-point communication between on-chip components and the memory in- terface. As the system scales up with more nodes, traditional programming methods, such as MPI, cannot effectively use the on-chip network and the off-chip network, therefore could make communication the performance bottleneck.
This research proposes a MPI-like Message Passing Engine (MPE) as part of the on-chip network, providing point-to-point and collective communication primitives in hardware. On one hand, the MPE improves the communication performance by offloading the communication workload from the general processing elements. On the other hand, the MPE provides direct interface to the heterogeneous processing ele- ments which can eliminate the data path going around the OS and libraries. Detailed experimental results have shown that the MPE can significantly reduce the com- munication time and improve the overall performance, especially for heterogeneous computing systems because of the tight coupling with the network. Additionally, a hybrid âMPI+Xâ computing system is tested and it shows MPE can effectively of- fload the communications and let the processing elements play their strengths on the computation
A review of High Performance Computing foundations for scientists
The increase of existing computational capabilities has made simulation
emerge as a third discipline of Science, lying midway between experimental and
purely theoretical branches [1, 2]. Simulation enables the evaluation of
quantities which otherwise would not be accessible, helps to improve
experiments and provides new insights on systems which are analysed [3-6].
Knowing the fundamentals of computation can be very useful for scientists, for
it can help them to improve the performance of their theoretical models and
simulations. This review includes some technical essentials that can be useful
to this end, and it is devised as a complement for researchers whose education
is focused on scientific issues and not on technological respects. In this
document we attempt to discuss the fundamentals of High Performance Computing
(HPC) [7] in a way which is easy to understand without much previous
background. We sketch the way standard computers and supercomputers work, as
well as discuss distributed computing and discuss essential aspects to take
into account when running scientific calculations in computers.Comment: 33 page
On the acceleration of wavefront applications using distributed many-core architectures
In this paper we investigate the use of distributed graphics processing unit (GPU)-based architectures to accelerate pipelined wavefront applicationsâa ubiquitous class of parallel algorithms used for the solution of a number of scientific and engineering applications. Specifically, we employ a recently developed port of the LU solver (from the NAS Parallel Benchmark suite) to investigate the performance of these algorithms on high-performance computing solutions from NVIDIA (Tesla C1060 and C2050) as well as on traditional clusters (AMD/InfiniBand and IBM BlueGene/P). Benchmark results are presented for problem classes A to C and a recently developed performance model is used to provide projections for problem classes D and E, the latter of which represents a billion-cell problem. Our results demonstrate that while the theoretical performance of GPU solutions will far exceed those of many traditional technologies, the sustained application performance is currently comparable for scientific wavefront applications. Finally, a breakdown of the GPU solution is conducted, exposing PCIe overheads and decomposition constraints. A new k-blocking strategy is proposed to improve the future performance of this class of algorithm on GPU-based architectures
Optimizing Xeon Phi for Interactive Data Analysis
The Intel Xeon Phi manycore processor is designed to provide high performance
matrix computations of the type often performed in data analysis. Common data
analysis environments include Matlab, GNU Octave, Julia, Python, and R.
Achieving optimal performance of matrix operations within data analysis
environments requires tuning the Xeon Phi OpenMP settings, process pinning, and
memory modes. This paper describes matrix multiplication performance results
for Matlab and GNU Octave over a variety of combinations of process counts and
OpenMP threads and Xeon Phi memory modes. These results indicate that using
KMP_AFFINITY=granlarity=fine, taskset pinning, and all2all cache memory mode
allows both Matlab and GNU Octave to achieve 66% of the practical peak
performance for process counts ranging from 1 to 64 and OpenMP threads ranging
from 1 to 64. These settings have resulted in generally improved performance
across a range of applications and has enabled our Xeon Phi system to deliver
significant results in a number of real-world applications.Comment: 6 pages, 5 figures, accepted in IEEE High Performance Extreme
Computing (HPEC) conference 201
- âŠ