14,103 research outputs found
Dynamic load balancing in parallel KD-tree k-means
One among the most influential and popular data mining methods is the k-Means algorithm for cluster analysis.
Techniques for improving the efficiency of k-Means have been
largely explored in two main directions. The amount of computation can be significantly reduced by adopting geometrical constraints and an efficient data structure, notably a multidimensional binary search tree (KD-Tree). These techniques allow to reduce the number of distance computations the algorithm performs at each iteration. A second direction is parallel processing, where data and computation loads are distributed over many processing nodes. However, little work has been done to provide a parallel formulation of the efficient sequential techniques based on KD-Trees. Such approaches are expected to have an irregular distribution of computation load and can suffer from load imbalance. This issue has so far limited the adoption of these efficient k-Means variants in parallel computing environments. In this work, we provide a parallel formulation of the KD-Tree based k-Means algorithm for distributed memory systems and address its load balancing
issue. Three solutions have been developed and tested. Two
approaches are based on a static partitioning of the data set and a third solution incorporates a dynamic load balancing policy
Adaptive Parallel Iterative Deepening Search
Many of the artificial intelligence techniques developed to date rely on
heuristic search through large spaces. Unfortunately, the size of these spaces
and the corresponding computational effort reduce the applicability of
otherwise novel and effective algorithms. A number of parallel and distributed
approaches to search have considerably improved the performance of the search
process. Our goal is to develop an architecture that automatically selects
parallel search strategies for optimal performance on a variety of search
problems. In this paper we describe one such architecture realized in the
Eureka system, which combines the benefits of many different approaches to
parallel heuristic search. Through empirical and theoretical analyses we
observe that features of the problem space directly affect the choice of
optimal parallel search strategy. We then employ machine learning techniques to
select the optimal parallel search strategy for a given problem space. When a
new search task is input to the system, Eureka uses features describing the
search space and the chosen architecture to automatically select the
appropriate search strategy. Eureka has been tested on a MIMD parallel
processor, a distributed network of workstations, and a single workstation
using multithreading. Results generated from fifteen puzzle problems, robot arm
motion problems, artificial search spaces, and planning problems indicate that
Eureka outperforms any of the tested strategies used exclusively for all
problem instances and is able to greatly reduce the search time for these
applications
A Massively Parallel Algorithm for the Approximate Calculation of Inverse p-th Roots of Large Sparse Matrices
We present the submatrix method, a highly parallelizable method for the
approximate calculation of inverse p-th roots of large sparse symmetric
matrices which are required in different scientific applications. We follow the
idea of Approximate Computing, allowing imprecision in the final result in
order to be able to utilize the sparsity of the input matrix and to allow
massively parallel execution. For an n x n matrix, the proposed algorithm
allows to distribute the calculations over n nodes with only little
communication overhead. The approximate result matrix exhibits the same
sparsity pattern as the input matrix, allowing for efficient reuse of allocated
data structures.
We evaluate the algorithm with respect to the error that it introduces into
calculated results, as well as its performance and scalability. We demonstrate
that the error is relatively limited for well-conditioned matrices and that
results are still valuable for error-resilient applications like
preconditioning even for ill-conditioned matrices. We discuss the execution
time and scaling of the algorithm on a theoretical level and present a
distributed implementation of the algorithm using MPI and OpenMP. We
demonstrate the scalability of this implementation by running it on a
high-performance compute cluster comprised of 1024 CPU cores, showing a speedup
of 665x compared to single-threaded execution
Evaluating kernels on Xeon Phi to accelerate Gysela application
This work describes the challenges presented by porting parts ofthe Gysela
code to the Intel Xeon Phi coprocessor, as well as techniques used for
optimization, vectorization and tuning that can be applied to other
applications. We evaluate the performance of somegeneric micro-benchmark on Phi
versus Intel Sandy Bridge. Several interpolation kernels useful for the Gysela
application are analyzed and the performance are shown. Some memory-bound and
compute-bound kernels are accelerated by a factor 2 on the Phi device compared
to Sandy architecture. Nevertheless, it is hard, if not impossible, to reach a
large fraction of the peek performance on the Phi device,especially for
real-life applications as Gysela. A collateral benefit of this optimization and
tuning work is that the execution time of Gysela (using 4D advections) has
decreased on a standard architecture such as Intel Sandy Bridge.Comment: submitted to ESAIM proceedings for CEMRACS 2014 summer school version
reviewe
Automated CNN pipeline generation for heterogeneous architectures
Heterogeneity is a vital feature in emerging processor chip designing. Asymmetric multicore-clusters such as high-performance cluster and power efficient cluster are common in modern edge devices. One example is Intel\u27s Alder Lake featuring Golden Cove high-performance cores and Gracemont power-efficient cores. Chiplet-based technology allows organization of multi cores in form of multi-chip-modules, thus housing large number of cores in a processor. Interposer based packaging has enabled embedding High Bandwidth Memory (HBM) on chip and reduced transmission latency and energy consumption of chiplet-chiplet interconnect.\ua0For Instance Intel\u27s XeHPC Ponte Vecchio package integrates multi-chip GPU organization along with HBM modules.Since new devices feature heterogeneity at the level of cores, memory and on-chip interconnect, it has become important to steer optimization at application level in order to leverage the new heterogeneous, high-performing and power-efficient features of underlying computing platforms. An important high-performance application paradigm is Convolution Neural Networks (CNN). CNNs are widely used in many practical applications. The pipelined parallel implementation of CNN is favored for inference on edge devices. In this Licentiate thesis we present a novel scheme for automatic scheduling of CNN pipelines on heterogeneous devices. A pipeline schedule is a configuration that provides information on depth of pipeline, grouping of CNN layers into pipeline stages and mapping of pipeline stages onto computing units. We utilize simple compile-time hints which consists of workload information of individual CNN layers and performance hints of computing units.The proposed approach provides near optimal solution for a throughput maximizing pipeline. We model the problem as a design space exploration technique. We developed a time-efficient design space navigation through heuristics extracted from the knowledge of CNN structure and underlying computing platform. The proposed search scheme converges faster and utilizes real-time performance measurements as fitness values. The results demonstrate that the proposed scheme converges faster and can scale when used with larger networks and computing platforms. Since the scheme utilizes online performance measurements, one of the challenges is to avoid expensive configurations during online tuning. The results demonstrate that on average, ~80\% of the tested configurations are sub-optimal solutions.Another challenge is to reduce convergence time. The experiments show that proposed approach is 35x faster than stochastic optimization algorithms. Since the design space is large and complex, We show that the proposed scheme explores only ~0.1% of the total design space in case of large CNNs (having 50+ layers) and results in near-optimal solution
- …