692 research outputs found
A High-Throughput Solver for Marginalized Graph Kernels on GPU
We present the design and optimization of a linear solver on General Purpose GPUs for the efficient and high-throughput evaluation of the marginalized graph kernel between pairs of labeled graphs. The solver implements a preconditioned conjugate gradient (PCG) method to compute the solution to a generalized Laplacian equation associated with the tensor product of two graphs. To cope with the gap between the instruction throughput and the memory bandwidth of current generation GPUs, our solver forms the tensor product linear system on-the-fly without storing it in memory when performing matrix-vector dot product operations in PCG. Such on-the-fly computation is accomplished by using threads in a warp to cooperatively stream the adjacency and edge label matrices of individual graphs by small square matrix blocks called tiles, which are then staged in registers and the shared memory for later reuse. Warps across a thread block can further share tiles via the shared memory to increase data reuse. We exploit the sparsity of the graphs hierarchically by storing only non-empty tiles using a coordinate format and nonzero elements within each tile using bitmaps. Besides, we propose a new partition-based reordering algorithm for aggregating nonzero elements of the graphs into fewer but denser tiles to improve the efficiency of the sparse format.We carry out extensive theoretical analyses on the graph tensor product primitives for tiles of various density and evaluate their performance on synthetic and real-world datasets. Our solver delivers three to four orders of magnitude speedup over existing CPU-based solvers such as GraKeL and GraphKernels. The capability of the solver enables kernel-based learning tasks at unprecedented scales
Efficient CFD code implementation for the ARM-based Mont-Blanc architecture
Since 2011, the European project Mont-Blanc has been focused on enabling ARM-based technology for HPC, developing both hardware platforms and system software. The latest Mont-Blanc prototypes use system-on-chip (SoC) devices that combine a CPU and a GPU sharing a common main memory. Specific developments of parallel computing software and well-suited implementation approaches are of crucial importance for such a heterogeneous architecture in order to efficiently exploit its potential.
This paper is devoted to the optimizations carried out in the TermoFluids CFD code to efficiently run it on the Mont-Blanc system. The underlying numerical method is based on an unstructured finite-volume discretization of the Navier–Stokes equations for the numerical simulation of incompressible turbulent flows. It is implemented using a portable and modular operational approach based on a minimal set of linear algebra operations. An architecture-specific heterogeneous multilevel MPI+OpenMP+OpenCL implementation of such kernels is proposed. It includes optimizations of the storage formats, dynamic load balancing between the CPU and GPU devices and hiding of communication overheads by overlapping computations and data transfers. A detailed performance study shows time reductions of up to on the kernels’ execution with the new heterogeneous implementation, its scalability on up to 128 Mont-Blanc nodes and the energy savings (around ) achieved with the Mont-Blanc system versus the high-end hybrid supercomputer MinoTauro.The research leading to these results has received funding from the European Community’s Seventh Framework Programme
[FP7/2007–2013] and Horizon 2020 under the Mont-Blanc Project (www.montblanc-project.eu), grant agreement n 288777, 610402 and 671697. The work has been financially supported by the Ministerio de Ciencia e Innovación, Spain (ENE- 2014-60577-R), the Russian Science Foundation, project 15-11-30039, CONICYT Becas Chile Doctorado 2012, the Juan de la Cierva posdoctoral grant (IJCI-2014-21034), and the Initial Training Network SEDITRANS
(GA number: 607394), implemented within the 7th Framework Programme of the European Commission under call FP7-PEOPLE-
2013-ITN. Our calculations have been performed on the resources of the Barcelona Supercomputing Center. The authors thankfully acknowledge these institutions.Peer ReviewedPostprint (published version
Recent Advances in Graph Partitioning
We survey recent trends in practical algorithms for balanced graph
partitioning together with applications and future research directions
The Bandwidth minimization problem
Mestrado em MĂ©todos Quantitativos para a DecisĂŁo EconĂłmica e EmpresarialEsta dissertação tem como objetivo comparar o desempenho de duas heurĂsticas com a resolução de um modelo exato de programação linear inteira na determinação de soluções admissĂveis do problema de minimização da largura de banda para matrizes esparsas simĂ©tricas. As heurĂsticas consideradas foram o algoritmo de Cuthill e McKee e o algoritmo Node Centroid com Hill Climbing.
As duas heurĂsticas foram implementadas em VBA e foram avaliadas tendo por base o tempo de execução e a proximidade do valor das soluções admissĂveis obtidas ao valor da solução Ăłtima ou minorante. As soluções Ăłtimas e os minorantes para as diversas instâncias consideradas foram obtidos atravĂ©s da execução do cĂłdigo para mĂşltiplas instâncias e atravĂ©s da resolução do problema de Programação Linear Inteira com recurso ao Excel OpenSolver e ao software de otimização CPLEX. Como inputs das heurĂsticas foram utilizadas matrizes com dimensĂŁo entre 4Ă—4 e 5580Ă—5580, diferentes dispersões de elementos nĂŁo nulos e diferentes pontos de partida.This dissertation intends to compare the performance of two heuristics with the resolution on the exact linear integer program model on the search for admissible solutions of the bandwidth minimization problem for sparse symmetric matrices. The chosen heuristics were the Cuthill and McKee algorithm and the Node Centroid with Hill Climbing algorithm.
Both heuristics were implemented in VBA and they were rated taking into consideration the execution time in seconds, the relative proximity of the value obtained to the value of the optimal solution or lower bound. Optimal solutions and lower bounds were obtained through the execution of the code for several instances and trough the resolution of the integer linear problem using the Excel Add-In OpenSolver and the optimization software CPLEX. The inputs for the heuristics were matrices of dimension between 4Ă—4 and 5580Ă—5580, different dispersion of non-null elements and different initialization parameters.info:eu-repo/semantics/publishedVersio
Data Mining Using the Crossing Minimization Paradigm
Our ability and capacity to generate, record and store multi-dimensional, apparently
unstructured data is increasing rapidly, while the cost of data storage is going down. The data recorded is not perfect, as noise gets introduced in it from different sources. Some of the basic forms of noise are incorrect recording of values and missing values. The formal study of discovering useful hidden information in the data is called Data Mining.
Because of the size, and complexity of the problem, practical data mining problems are
best attempted using automatic means.
Data Mining can be categorized into two types i.e. supervised learning or classification and unsupervised learning or clustering. Clustering only the records in a database (or data matrix) gives a global view of the data and is called one-way clustering. For a detailed analysis or a local view, biclustering or co-clustering or two-way clustering is required involving the simultaneous clustering of the records and the attributes.
In this dissertation, a novel fast and white noise tolerant data mining solution is
proposed based on the Crossing Minimization (CM) paradigm; the solution works for
one-way as well as two-way clustering for discovering overlapping biclusters. For
decades the CM paradigm has traditionally been used for graph drawing and VLSI
(Very Large Scale Integration) circuit design for reducing wire length and congestion. The utility of the proposed technique is demonstrated by comparing it with other biclustering techniques using simulated noisy, as well as real data from Agriculture, Biology and other domains.
Two other interesting and hard problems also addressed in this dissertation are (i) the
Minimum Attribute Subset Selection (MASS) problem and (ii) Bandwidth
Minimization (BWM) problem of sparse matrices. The proposed CM technique is
demonstrated to provide very convincing results while attempting to solve the said
problems using real public domain data.
Pakistan is the fourth largest supplier of cotton in the world. An apparent anomaly has
been observed during 1989-97 between cotton yield and pesticide consumption in
Pakistan showing unexpected periods of negative correlation. By applying the
indigenous CM technique for one-way clustering to real Agro-Met data (2001-2002), a possible explanation of the anomaly has been presented in this thesis
- …