3,329 research outputs found
An Efficient Thread Mapping Strategy for Multiprogramming on Manycore Processors
The emergence of multicore and manycore processors is set to change the
parallel computing world. Applications are shifting towards increased
parallelism in order to utilise these architectures efficiently. This leads to
a situation where every application creates its desirable number of threads,
based on its parallel nature and the system resources allowance. Task
scheduling in such a multithreaded multiprogramming environment is a
significant challenge. In task scheduling, not only the order of the execution,
but also the mapping of threads to the execution resources is of a great
importance. In this paper we state and discuss some fundamental rules based on
results obtained from selected applications of the BOTS benchmarks on the
64-core TILEPro64 processor. We demonstrate how previously efficient mapping
policies such as those of the SMP Linux scheduler become inefficient when the
number of threads and cores grows. We propose a novel, low-overhead technique,
a heuristic based on the amount of time spent by each CPU doing some useful
work, to fairly distribute the workloads amongst the cores in a
multiprogramming environment. Our novel approach could be implemented as a
pragma similar to those in the new task-based OpenMP versions, or can be
incorporated as a distributed thread mapping mechanism in future manycore
programming frameworks. We show that our thread mapping scheme can outperform
the native GNU/Linux thread scheduler in both single-programming and
multiprogramming environments.Comment: ParCo Conference, Munich, Germany, 201
Design Principles for Sparse Matrix Multiplication on the GPU
We implement two novel algorithms for sparse-matrix dense-matrix
multiplication (SpMM) on the GPU. Our algorithms expect the sparse input in the
popular compressed-sparse-row (CSR) format and thus do not require expensive
format conversion. While previous SpMM work concentrates on thread-level
parallelism, we additionally focus on latency hiding with instruction-level
parallelism and load-balancing. We show, both theoretically and experimentally,
that the proposed SpMM is a better fit for the GPU than previous approaches. We
identify a key memory access pattern that allows efficient access into both
input and output matrices that is crucial to getting excellent performance on
SpMM. By combining these two ingredients---(i) merge-based load-balancing and
(ii) row-major coalesced memory access---we demonstrate a 4.1x peak speedup and
a 31.7% geomean speedup over state-of-the-art SpMM implementations on
real-world datasets.Comment: 16 pages, 7 figures, International European Conference on Parallel
and Distributed Computing (Euro-Par) 201
Minimizing Energy Use of Mixed-Fleet Public Transit for Fixed-Route Service
Public transit can have significantly lower environmental impact than
personal vehicles; however, it still uses a substantial amount of energy,
causing air pollution and greenhouse gas emission. While electric vehicles
(EVs) can reduce energy use, most public transit agencies have to employ them
in combination with conventional, internal-combustion engine vehicles due to
the high upfront costs of EVs. To make the best use of such a mixed fleet of
vehicles, transit agencies need to optimize route assignments and charging
schedules, which presents a challenging problem for large public transit
networks. We introduce a novel problem formulation to minimize fuel and
electricity use by assigning vehicles to transit trips and scheduling them for
charging while serving an existing fixed-route transit schedule. We present an
integer program for optimal discrete-time scheduling, and we propose
polynomial-time heuristic algorithms and a genetic algorithm for finding
solutions for larger networks. We evaluate our algorithms on the transit
service of a mid-size U.S. city using operational data collected from public
transit vehicles. Our results show that the proposed algorithms are scalable
and achieve near-minimum energy use
A general framework of multi-population methods with clustering in undetectable dynamic environments
Copyright @ 2011 IEEETo solve dynamic optimization problems, multiple
population methods are used to enhance the population diversity for an algorithm with the aim of maintaining multiple populations in different sub-areas in the fitness landscape. Many experimental studies have shown that locating and tracking multiple relatively good optima rather than a single global optimum is an effective idea in dynamic environments. However, several challenges need to be addressed when multi-population methods are applied, e.g.,
how to create multiple populations, how to maintain them in different sub-areas, and how to deal with the situation where changes can not be detected or predicted. To address these issues, this paper investigates a hierarchical clustering method to locate and track multiple optima for dynamic optimization problems. To deal with undetectable dynamic environments, this
paper applies the random immigrants method without change detection based on a mechanism that can automatically reduce redundant individuals in the search space throughout the run. These methods are implemented into several research areas, including particle swarm optimization, genetic algorithm, and differential evolution. An experimental study is conducted based on the moving peaks benchmark to test the performance with several other algorithms from the literature. The experimental
results show the efficiency of the clustering method for locating and tracking multiple optima in comparison with other algorithms based on multi-population methods on the moving peaks
benchmark
Auto-Parallelizing Large Models with Rhino: A Systematic Approach on Production AI Platform
We present Rhino, a system for accelerating tensor programs with automatic
parallelization on AI platform for real production environment. It transforms a
tensor program written for a single device into an equivalent distributed
program that is capable of scaling up to thousands of devices with no user
configuration. Rhino firstly works on a semantically independent intermediate
representation of tensor programs, which facilitates its generalization to
unprecedented applications. Additionally, it implements a task-oriented
controller and a distributed runtime for optimal performance. Rhino explores on
a complete and systematic parallelization strategy space that comprises all the
paradigms commonly employed in deep learning (DL), in addition to strided
partitioning and pipeline parallelism on non-linear models. Aiming to
efficiently search for a near-optimal parallel execution plan, our analysis of
production clusters reveals general heuristics to speed up the strategy search.
On top of it, two optimization levels are designed to offer users flexible
trade-offs between the search time and strategy quality. Our experiments
demonstrate that Rhino can not only re-discover the expert-crafted strategies
of classic, research and production DL models, but also identify novel
parallelization strategies which surpass existing systems for novel models
Automatic methods for distribution of data-parallel programs on multi-device heterogeneous platforms
This thesis deals with the problem of finding effective methods for programming and distributing data-parallel applications for heterogeneous multiprocessor systems. These systems are ubiquitous today. They range from embedded devices with low power consumption to high performance distributed systems. The demand for these systems is growing steadily. This is due to the growing number of data-intensive applications and the general growth of digital applications. Systems with multiple devices offer higher performance but unfortunately add complexity to the software development for such systems. Programming heterogeneous multiprocessor systems present several unique challenges compared to single device systems.
The first challenge is the programmability of such systems. Despite constant innovations in programming languages and frameworks, they are still limited. They are either platform specific, like CUDA which supports only NVIDIA GPUs, or applied at a low level of abstraction, such as OpenCL. Application developers that design OpenCL programs must manually distribute data to the different devices and synchronize the distributed computations. These capabilities have an impact on the productivity of the developers. To reduce the programming complexity and the development time, this thesis introduces two approaches that automatically distribute and synchronize the data-parallel workloads. Another challenge is the multi-device hardware utilization. In contrast to single-device platforms, the application optimization process for a multi-device system is even more complicated.
The application designers need to apply not only optimization strategies specific for a single-device architecture. They need also focus on the careful workload balancing between all the platform processors. For the balancing problem, this thesis proposes a method based on the platform model. The platform model is created with machine learning techniques. Using machine learning, this thesis builds automatically a reliable platform model, which is portable and adaptable to different platform setups, with a minimum manual involvement of the programmers
- …