1,136 research outputs found
Mixing multi-core CPUs and GPUs for scientific simulation software
Recent technological and economic developments have led to widespread availability of
multi-core CPUs and specialist accelerator processors such as graphical processing units
(GPUs). The accelerated computational performance possible from these devices can be very
high for some applications paradigms. Software languages and systems such as NVIDIA's
CUDA and Khronos consortium's open compute language (OpenCL) support a number of
individual parallel application programming paradigms. To scale up the performance of some
complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and
very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica-
tions using threading approaches and multi-core CPUs to control independent GPU devices.
We present speed-up data and discuss multi-threading software issues for the applications
level programmer and o er some suggested areas for language development and integration
between coarse-grained and ne-grained multi-thread systems. We discuss results from three
common simulation algorithmic areas including: partial di erential equations; graph cluster
metric calculations and random number generation. We report on programming experiences
and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs;
a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and
trends in multi-core programming for scienti c applications developers
The GPU vs Phi Debate: Risk Analytics Using Many-Core Computing
The risk of reinsurance portfolios covering globally occurring natural
catastrophes, such as earthquakes and hurricanes, is quantified by employing
simulations. These simulations are computationally intensive and require large
amounts of data to be processed. The use of many-core hardware accelerators,
such as the Intel Xeon Phi and the NVIDIA Graphics Processing Unit (GPU), are
desirable for achieving high-performance risk analytics. In this paper, we set
out to investigate how accelerators can be employed in risk analytics, focusing
on developing parallel algorithms for Aggregate Risk Analysis, a simulation
which computes the Probable Maximum Loss of a portfolio taking both primary and
secondary uncertainties into account. The key result is that both hardware
accelerators are useful in different contexts; without taking data transfer
times into account the Phi had lowest execution times when used independently
and the GPU along with a host in a hybrid platform yielded best performance.Comment: A modified version of this article is accepted to the Computers and
Electrical Engineering Journal under the title - "The Hardware Accelerator
Debate: A Financial Risk Case Study Using Many-Core Computing"; Blesson
Varghese, "The Hardware Accelerator Debate: A Financial Risk Case Study Using
Many-Core Computing," Computers and Electrical Engineering, 201
2HOT: An Improved Parallel Hashed Oct-Tree N-Body Algorithm for Cosmological Simulation
We report on improvements made over the past two decades to our adaptive
treecode N-body method (HOT). A mathematical and computational approach to the
cosmological N-body problem is described, with performance and scalability
measured up to 256k () processors. We present error analysis and
scientific application results from a series of more than ten 69 billion
() particle cosmological simulations, accounting for
floating point operations. These results include the first simulations using
the new constraints on the standard model of cosmology from the Planck
satellite. Our simulations set a new standard for accuracy and scientific
throughput, while meeting or exceeding the computational efficiency of the
latest generation of hybrid TreePM N-body methods.Comment: 12 pages, 8 figures, 77 references; To appear in Proceedings of SC
'1
Reconfigurable computing for large-scale graph traversal algorithms
This thesis proposes a reconfigurable computing approach for supporting parallel processing in large-scale graph traversal algorithms. Our approach is based on a reconfigurable hardware architecture which exploits the capabilities of both FPGAs (Field-Programmable Gate Arrays) and a multi-bank parallel memory subsystem.
The proposed methodology to accelerate graph traversal algorithms has been applied to three case studies, revealing that application-specific hardware customisations can benefit performance. A summary of our four contributions is as follows.
First, a reconfigurable computing approach to accelerate large-scale graph traversal algorithms. We propose a reconfigurable hardware architecture which decouples computation and communication while keeping multiple memory requests in flight at any given time, taking advantage of the high bandwidth of multi-bank memory subsystems.
Second, a demonstration of the effectiveness of our approach through two case studies: the breadth-first search algorithm, and a graphlet counting algorithm from bioinformatics. Both case studies involve graph traversal, but each of them adopts a different graph data representation.
Third, a method for using on-chip memory resources in FPGAs to reduce off-chip memory accesses for accelerating graph traversal algorithms, through a case-study of the All-Pairs Shortest-Paths algorithm. This case study has been applied to process human brain network data.
Fourth, an evaluation of an approach based on instruction-set extension for FPGA design against many-core GPUs (Graphics Processing Units), based on a set of benchmarks with different memory access characteristics. It is shown that while GPUs excel at streaming applications, the proposed approach can outperform GPUs in applications with poor locality characteristics, such as graph traversal problems.Open Acces
Hardware Implementation of Deep Network Accelerators Towards Healthcare and Biomedical Applications
With the advent of dedicated Deep Learning (DL) accelerators and neuromorphic
processors, new opportunities are emerging for applying deep and Spiking Neural
Network (SNN) algorithms to healthcare and biomedical applications at the edge.
This can facilitate the advancement of the medical Internet of Things (IoT)
systems and Point of Care (PoC) devices. In this paper, we provide a tutorial
describing how various technologies ranging from emerging memristive devices,
to established Field Programmable Gate Arrays (FPGAs), and mature Complementary
Metal Oxide Semiconductor (CMOS) technology can be used to develop efficient DL
accelerators to solve a wide variety of diagnostic, pattern recognition, and
signal processing problems in healthcare. Furthermore, we explore how spiking
neuromorphic processors can complement their DL counterparts for processing
biomedical signals. After providing the required background, we unify the
sparsely distributed research on neural network and neuromorphic hardware
implementations as applied to the healthcare domain. In addition, we benchmark
various hardware platforms by performing a biomedical electromyography (EMG)
signal processing task and drawing comparisons among them in terms of inference
delay and energy. Finally, we provide our analysis of the field and share a
perspective on the advantages, disadvantages, challenges, and opportunities
that different accelerators and neuromorphic processors introduce to healthcare
and biomedical domains. This paper can serve a large audience, ranging from
nanoelectronics researchers, to biomedical and healthcare practitioners in
grasping the fundamental interplay between hardware, algorithms, and clinical
adoption of these tools, as we shed light on the future of deep networks and
spiking neuromorphic processing systems as proponents for driving biomedical
circuits and systems forward.Comment: Submitted to IEEE Transactions on Biomedical Circuits and Systems (21
pages, 10 figures, 5 tables
GHOST: A Graph Neural Network Accelerator using Silicon Photonics
Graph neural networks (GNNs) have emerged as a powerful approach for
modelling and learning from graph-structured data. Multiple fields have since
benefitted enormously from the capabilities of GNNs, such as recommendation
systems, social network analysis, drug discovery, and robotics. However,
accelerating and efficiently processing GNNs require a unique approach that
goes beyond conventional artificial neural network accelerators, due to the
substantial computational and memory requirements of GNNs. The slowdown of
scaling in CMOS platforms also motivates a search for alternative
implementation substrates. In this paper, we present GHOST, the first
silicon-photonic hardware accelerator for GNNs. GHOST efficiently alleviates
the costs associated with both vertex-centric and edge-centric operations. It
implements separately the three main stages involved in running GNNs in the
optical domain, allowing it to be used for the inference of various widely used
GNN models and architectures, such as graph convolution networks and graph
attention networks. Our simulation studies indicate that GHOST exhibits at
least 10.2x better throughput and 3.8x better energy efficiency when compared
to GPU, TPU, CPU and multiple state-of-the-art GNN hardware accelerators
- …