309 research outputs found
Power, Performance, and Energy Management of Heterogeneous Architectures
abstract: Many core modern multiprocessor systems-on-chip offers tremendous power and performance
optimization opportunities by tuning thousands of potential voltage, frequency
and core configurations. Applications running on these architectures are becoming increasingly
complex. As the basic building blocks, which make up the application, change during
runtime, different configurations may become optimal with respect to power, performance
or other metrics. Identifying the optimal configuration at runtime is a daunting task due
to a large number of workloads and configurations. Therefore, there is a strong need to
evaluate the metrics of interest as a function of the supported configurations.
This thesis focuses on two different types of modern multiprocessor systems-on-chip
(SoC): Mobile heterogeneous systems and tile based Intel Xeon Phi architecture.
For mobile heterogeneous systems, this thesis presents a novel methodology that can
accurately instrument different types of applications with specific performance monitoring
calls. These calls provide a rich set of performance statistics at a basic block level while the
application runs on the target platform. The target architecture used for this work (Odroid
XU3) is capable of running at 4940 different frequency and core combinations. With the
help of instrumented application vast amount of characterization data is collected that provides
details about performance, power and CPU state at every instrumented basic block
across 19 different types of applications. The vast amount of data collected has enabled
two runtime schemes. The first work provides a methodology to find optimal configurations
in heterogeneous architecture using classifiers and demonstrates an average increase
of 93%, 81% and 6% in performance per watt compared to the interactive, ondemand and
powersave governors, respectively. The second work using same data shows a novel imitation
learning framework for dynamically controlling the type, number, and the frequencies
of active cores to achieve an average of 109% PPW improvement compared to the default
governors.
This work also presents how to accurately profile tile based Intel Xeon Phi architecture
while training different types of neural networks using open image dataset on deep learning
framework. The data collected allows deep exploratory analysis. It also showcases how
different hardware parameters affect performance of Xeon Phi.Dissertation/ThesisMasters Thesis Engineering 201
Acceleration of stereo-matching on multi-core CPU and GPU
This paper presents an accelerated version of a
dense stereo-correspondence algorithm for two different parallelism
enabled architectures, multi-core CPU and GPU. The
algorithm is part of the vision system developed for a binocular
robot-head in the context of the CloPeMa 1 research project.
This research project focuses on the conception of a new clothes
folding robot with real-time and high resolution requirements
for the vision system. The performance analysis shows that
the parallelised stereo-matching algorithm has been significantly
accelerated, maintaining 12x and 176x speed-up respectively
for multi-core CPU and GPU, compared with non-SIMD singlethread
CPU. To analyse the origin of the speed-up and gain
deeper understanding about the choice of the optimal hardware,
the algorithm was broken into key sub-tasks and the performance
was tested for four different hardware architectures
Evaluation of DVFS techniques on modern HPC processors and accelerators for energy-aware applications
Energy efficiency is becoming increasingly important for computing systems,
in particular for large scale HPC facilities. In this work we evaluate, from an
user perspective, the use of Dynamic Voltage and Frequency Scaling (DVFS)
techniques, assisted by the power and energy monitoring capabilities of modern
processors in order to tune applications for energy efficiency. We run selected
kernels and a full HPC application on two high-end processors widely used in
the HPC context, namely an NVIDIA K80 GPU and an Intel Haswell CPU. We evaluate
the available trade-offs between energy-to-solution and time-to-solution,
attempting a function-by-function frequency tuning. We finally estimate the
benefits obtainable running the full code on a HPC multi-GPU node, with respect
to default clock frequency governors. We instrument our code to accurately
monitor power consumption and execution time without the need of any additional
hardware, and we enable it to change CPUs and GPUs clock frequencies while
running. We analyze our results on the different architectures using a simple
energy-performance model, and derive a number of energy saving strategies which
can be easily adopted on recent high-end HPC systems for generic applications
Acceleration of stereo-matching on multi-core CPU and GPU
This paper presents an accelerated version of a
dense stereo-correspondence algorithm for two different parallelism
enabled architectures, multi-core CPU and GPU. The
algorithm is part of the vision system developed for a binocular
robot-head in the context of the CloPeMa 1 research project.
This research project focuses on the conception of a new clothes
folding robot with real-time and high resolution requirements
for the vision system. The performance analysis shows that
the parallelised stereo-matching algorithm has been significantly
accelerated, maintaining 12x and 176x speed-up respectively
for multi-core CPU and GPU, compared with non-SIMD singlethread
CPU. To analyse the origin of the speed-up and gain
deeper understanding about the choice of the optimal hardware,
the algorithm was broken into key sub-tasks and the performance
was tested for four different hardware architectures
BriskStream: Scaling Data Stream Processing on Shared-Memory Multicore Architectures
We introduce BriskStream, an in-memory data stream processing system (DSPSs)
specifically designed for modern shared-memory multicore architectures.
BriskStream's key contribution is an execution plan optimization paradigm,
namely RLAS, which takes relative-location (i.e., NUMA distance) of each pair
of producer-consumer operators into consideration. We propose a branch and
bound based approach with three heuristics to resolve the resulting nontrivial
optimization problem. The experimental evaluations demonstrate that BriskStream
yields much higher throughput and better scalability than existing DSPSs on
multi-core architectures when processing different types of workloads.Comment: To appear in SIGMOD'1
- …