2,860 research outputs found
Machine learning based mapping of data and streaming parallelism to multi-cores
Multi-core processors are now ubiquitous and are widely seen as the most viable means
of delivering performance with increasing transistor densities. However, this potential
can only be realised if the application programs are suitably parallel. Applications
can either be written in parallel from scratch or converted from existing sequential
programs. Regardless of how applications are parallelised, the code must be efficiently
mapped onto the underlying platform to fully exploit the hardware’s potential.
This thesis addresses the problem of finding the best mappings of data and streaming
parallelism—two types of parallelism that exist in broad and important domains
such as scientific, signal processing and media applications. Despite significant
progress having been made over the past few decades, state-of-the-art mapping approaches
still largely rely upon hand-crafted, architecture-specific heuristics. Developing
a heuristic by hand, however, often requiresmonths of development time. Asmulticore
designs become increasingly diverse and complex, manually tuning a heuristic
for a wide range of architectures is no longer feasible. What are needed are innovative
techniques that can automatically scale with advances in multi-core technologies.
In this thesis two distinct areas of computer science, namely parallel compiler design
and machine learning, are brought together to develop new compiler-based mapping
techniques. Using machine learning, it is possible to automatically build highquality
mapping schemes, which adapt to evolving architectures, with little human
involvement.
First, two techniques are proposed to find the best mapping of data parallelism.
The first technique predicts whether parallel execution of a data parallel candidate is
profitable on the underlying architecture. On a typical multi-core platform, it achieves
almost the same (and sometimes a better) level of performance when compared to the
manually parallelised code developed by independent experts. For a profitable candidate,
the second technique predicts how many threads should be used to execute
the candidate across different program inputs. The second technique achieves, on average,
over 96% of the maximum available performance on two different multi-core
platforms.
Next, a new approach is developed for partitioning stream applications. This approach
predicts the ideal partitioning structure for a given stream application. Based
on the prediction, a compiler can rapidly search the program space (without executing
any code) to generate a good partition. It achieves, on average, a 1.90x speedup over
the already tuned partitioning scheme of a state-of-the-art streaming compiler
GraphR: Accelerating Graph Processing Using ReRAM
This paper presents GRAPHR, the first ReRAM-based graph processing
accelerator. GRAPHR follows the principle of near-data processing and explores
the opportunity of performing massive parallel analog operations with low
hardware and energy cost. The analog computation is suit- able for graph
processing because: 1) The algorithms are iterative and could inherently
tolerate the imprecision; 2) Both probability calculation (e.g., PageRank and
Collaborative Filtering) and typical graph algorithms involving integers (e.g.,
BFS/SSSP) are resilient to errors. The key insight of GRAPHR is that if a
vertex program of a graph algorithm can be expressed in sparse matrix vector
multiplication (SpMV), it can be efficiently performed by ReRAM crossbar. We
show that this assumption is generally true for a large set of graph
algorithms. GRAPHR is a novel accelerator architecture consisting of two
components: memory ReRAM and graph engine (GE). The core graph computations are
performed in sparse matrix format in GEs (ReRAM crossbars). The
vector/matrix-based graph computation is not new, but ReRAM offers the unique
opportunity to realize the massive parallelism with unprecedented energy
efficiency and low hardware cost. With small subgraphs processed by GEs, the
gain of performing parallel operations overshadows the wastes due to sparsity.
The experiment results show that GRAPHR achieves a 16.01x (up to 132.67x)
speedup and a 33.82x energy saving on geometric mean compared to a CPU baseline
system. Com- pared to GPU, GRAPHR achieves 1.69x to 2.19x speedup and consumes
4.77x to 8.91x less energy. GRAPHR gains a speedup of 1.16x to 4.12x, and is
3.67x to 10.96x more energy efficiency compared to PIM-based architecture.Comment: Accepted to HPCA 201
Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures
As many-core accelerators keep integrating more processing units, it becomes increasingly more difficult for a parallel application to make effective use of all available resources. An effective way for improving hardware utilization is to exploit spatial and temporal sharing of the heterogeneous processing units by multiplexing computation and communication tasks - a strategy known as heterogeneous streaming. Achieving effective heterogeneous streaming requires carefully partitioning hardware among tasks, and matching the granularity of task parallelism to the resource partition. However, finding the right resource partitioning and task granularity is extremely challenging, because there is a large number of possible solutions and the optimal solution varies across programs and datasets. This article presents an automatic approach to quickly derive a good solution for hardware resource partition and task granularity for task-based parallel applications on heterogeneous many-core architectures. Our approach employs a performance model to estimate the resulting performance of the target application under a given resource partition and task granularity configuration. The model is used as a utility to quickly search for a good configuration at runtime. Instead of hand-crafting an analytical model that requires expert insights into low-level hardware details, we employ machine learning techniques to automatically learn it. We achieve this by first learning a predictive model offline using training programs. The learnt model can then be used to predict the performance of any unseen program at runtime. We apply our approach to 39 representative parallel applications and evaluate it on two representative heterogeneous many-core platforms: a CPU-XeonPhi platform and a CPU-GPU platform. Compared to the single-stream version, our approach achieves, on average, a 1.6x and 1.1x speedup on the XeonPhi and the GPU platform, respectively. These results translate to over 93% of the performance delivered by a theoretically perfect predictor
Mapping parallel programs to heterogeneous CPU/GPU architectures using a Monte Carlo Tree Search
The single core processor, which has dominated for over 30 years, is now obsolete with recent trends increasing towards parallel systems, demanding a huge shift in programming techniques and practices. Moreover, we are rapidly moving towards an age where almost all programming will be targeting parallel systems. Parallel hardware is rapidly evolving, with large heterogeneous systems, typically comprising a mixture of CPUs and GPUs, becoming the mainstream. Additionally, with this increasing heterogeneity comes increasing complexity: not only does the programmer have to worry about where and how to express the parallelism, they must also express an efficient mapping of resources to the available system. This generally requires in-depth expert knowledge that most application programmers do not have. In this paper we describe a new technique that derives, automatically, optimal mappings for an application onto a heterogeneous architecture, using a Monte Carlo Tree Search algorithm. Our technique exploits high-level design patterns, targeting a set of well-specified parallel skeletons. We demonstrate that our MCTS on a convolution example obtained speedups that are within 5% of the speedups achieved by a hand-tuned version of the same application.Postprin
- …