    Efficient Computation of K-Nearest Neighbor Graphs for Large High-Dimensional Data Sets on GPU Clusters

    The k-Nearest Neighbor Graph (k-NNG) and the related k-Nearest Neighbor (k-NN) methods have a wide variety of applications in areas such as bioinformatics, machine learning, data mining, clustering analysis, and pattern recognition. Our application of interest is manifold embedding. Due to the large dimensionality of the input data (\u3c15k), spatial subdivision based techniques such OBBs, k-d tree, BSP etc., are not viable. The only alternative is the brute-force search, which has two distinct parts. The first finds distances between individual vectors in the corpus based on a pre-defined metric. Given the distance matrix, the second step selects k nearest neighbors for each member of the query data set. This thesis presents the development and implementation of a distributed exact k-Nearest Neighbor Graph (k-NNG) construction method. The proposed method uses Graphics Processing Units (GPUs) and exploits multiple levels of parallelism for distributed computational systems using GPUs. It is scalable for different cluster sizes, with each compute node in the cluster containing multiple GPUs. The distance computation is formulated as a basic matrix multiplication and reduction operation. The optimized CUBLAS matrix multiplication library is used for this purpose. Various distance metrics such as Euclidian, cosine, and Pearson are supported. For k-NNG construction, two different methods are presented. The first is based on an approach called batch index sorting to build the k-NNG with three sorting operations. This method uses the optimized radix sort implementation in the Thrust library for GPU. The second is an efficient implementation using the latest GPU functionalities of a variant of the quick select algorithm. Overall, the batch index sorting based k-NNG method is approximately 13x faster than a distributed MATLAB implementation. The quick select algorithm itself has a 5x speedup over state-of-the art GPU methods. This has enabled the processing of k-NNG construction on a data set containing 20 million image vectors, each with dimension 15,000, as part of a manifold embedding technique for analyzing the conformations of biomolecules

    Simulación de modelos orientados al individuo

    Los bancos de peces es un grupo social organizado sin la presencia de un líder. Esta organización se atribuye a dos patrones de comportamiento: atracción biosocial y orientación paralela. Este sistema puede modelarse mediante la aproximación del Modelo orientado al Individuo, donde la conducta de cada individuo por separado define el comportamiento grupal de todos los individuos. El objetivo del trabajo es mejorar el rendimiento del simulador mediante una programación híbrida que aproveche las alternativas de paralelismo en el cómputo que ofrecen las recientes arquitecturas multicore en sistemas de altas prestaciones.Els bancs de peixos és un grup social organitzat sense la presència d'un líder. Aquesta organització s'atribueix a dos patrons de comportament: atracció biosocial i orientació paral·lela. Aquest sistema pot modelar mitjançant l'aproximació del model orientat a l'individu, on la conducta de cada individu per separat defineix el comportament grupal de tots els individus. L'objectiu del treball és millorar el rendiment del simulador mitjançant una programació híbrida que aprofiti les alternatives de paral·lelisme en el còmput que ofereixen les recents arquitectures multicore en sistemes d'altes prestacions

    Factores de rendimiento en aplicaciones híbridas

    En el entorno actual, diversas ramas de las ciencias, tienen la necesidad de auxiliarse de la computación de altas prestaciones para la obtención de resultados a relativamente corto plazo. Ello es debido fundamentalmente, al alto volumen de información que necesita ser procesada y también al costo computacional que demandan dichos cálculos. El beneficio al realizar este procesamiento de manera distribuida y paralela, logra acortar los tiempos de espera en la obtención de los resultados y de esta forma posibilita una toma decisiones con mayor anticipación. Para soportar ello, existen fundamentalmente dos modelos de programación ampliamente extendidos: el modelo de paso de mensajes a través de librerías basadas en el estándar MPI, y el de memoria compartida con la utilización de OpenMP. Las aplicaciones híbridas son aquellas que combinan ambos modelos con el fin de aprovechar en cada caso, las potencialidades específicas del paralelismo en cada uno. Lamentablemente, la práctica ha demostrado que la utilización de esta combinación de modelos, no garantiza necesariamente una mejoría en el comportamiento de las aplicaciones. Por lo tanto, un análisis de los factores que influyen en el rendimiento de las mismas, nos beneficiaría a la hora de implementarlas pero también, sería un primer paso con el fin de llegar a predecir su comportamiento. Adicionalmente, supondría una vía para determinar que parámetros de la aplicación modificar con el fin de mejorar su rendimiento. En el trabajo actual nos proponemos definir una metodología para la identificación de factores de rendimiento en aplicaciones híbridas y en congruencia, la identificación de algunos factores que influyen en el rendimiento de las mismas.En l'entorn actual, diverses branques de les ciències, tenen la necessitat de recolzar-se en la computació d'altes prestacions per a l'obtenció de resultats en un relatiu curt temps. Això és degut bàsicament, a l'alt volum d'informació que necessita ser processada i també al cost computacional que demanen aquests càlculs. El benefici al realitzar aquests processaments de forma distribuïda i paral·lela, és que s'aconsegueix escurçar els temps d'espera en l'obtenció de resultats i d'aquest forma possibilita una presa de decisions amb major anticipació. Per aconseguir això, existeixen fundamentalment dos models de programació àmpliament estesos: el model de pas de missatges mitjançant llibreries basades en l'estàndar MPI, i el model de memòria compartida amb la utilització de OpenMP. Les aplicacions híbrides són aquelles que combinen d'ambdós models amb la finalitat d'aprofitar en cada cas, les potencialitats específiques de paral·lelisme. Lamentablement, la pràctica ha demostrat que la utilització d'aquesta combinació de models, no garantitza necessàriament un millor comportament de les aplicacions. Per tant, un anàlisi dels factors que influeixen en el rendiment, pot beneficiar a l'hora d'implementarles, però també, pot ser un primer pas per aconseguir predir el comportament. Adicionalment, pot suposar una via per a determinar els paràmetres de l'aplicació a modificar amb la finalitat de millor el rendiment. En el treball actual es proposa definir una metodologia per a la identificació de factors de rendiment en aplicacions híbrides, i en congruència, la identificació de factors que influeixen en el rendiment.In the current environment, various branches of science are in need of auxiliary high-performance computing to obtain relatively short-term results. This is due mainly to the high volume of information that needs to be processed and the computational cost demanded by these calculations. The benefit to perform this processing using distributed and parallel programming mechanism achieves shorter waiting times in obtaining the results and thus allows making decisions sooner. To support this, there are basically two widely spread programming models: the model of message passing, through based on the standard libraries MPI, and shared memory model with the use of OpenMP. Hybrid applications are those that combine both models in order to take in each case, the specific potential of parallelism of each one. Unfortunately, experience has shown that using this combination of models, does not necessarily guarantee an improvement in the behavior of applications. Therefore, an analysis of the factors that influence the performance of hybrid applications will help us to improve his performance base on modifying the original code. Besides, it will be the first step in the long way to predict their behavior. Additionally, it would be a way to determine which parameters of the application have to be modified to improve the performance. In the current work, we propose a methodology to identify performance factors in hybrid applications and in consequence, the identification of factors that influence the performance of them

    Modeling Energy Consumption of High-Performance Applications on Heterogeneous Computing Platforms

    Achieving Exascale computing is one of the current leading challenges in High Performance Computing (HPC). Obtaining this next level of performance will allow more complex simulations to be run on larger datasets and offer researchers better tools for data processing and analysis. In the dawn of Big Data, the need for supercomputers will only increase. However, these systems are costly to maintain because power is expensive. Thus, a better understanding of power and energy consumption is required such that future hardware can benefit. Available power models accurately capture the relationship to the number of cores and clock-rate, however the relationship between workload and power is less understood. Thus, investigation and analysis of power measurements has been a focal point in this work with the aim to improve the general understanding of energy consumption in the context of HPC. This dissertation investigates power and energy consumption of many different parallel applications on several hardware platforms while varying a number of execution characteristics. Multicore and manycore hardware devices are investigated in homogeneous and heterogeneous computing environments. Further, common techniques for reducing power and energy consumption are employed to each of these devices. Well-known power and performance models have been combined to form the Execution-Phase model, which may be used to quantify energy contributions based on execution phase and has been used to predict energy consumption to within 10%. However, due to limitations in the measurement procedure, a less intrusive approach is required. The Empirical Mode Decomposition (EMD) and Hilbert-Huang Transform analysis technique has been applied in innovative ways to model, analyze, and visualize power and energy measurements. EMD is widely used in other research areas, including earthquake, brain-wave, speech recognition, and sea-level rise analysis and this is the first it has been applied to power traces to analyze the complex interactions occurring within HPC systems. Probability distributions may be used to represent power and energy traces, thereby providing an alternative means of predicting energy consumption while retaining the fact that power is not constant over time. Further, these distributions may be used to define the cost of a workload for a given computing platform

    Exploring Scheduling for On-demand File Systems and Data Management within HPC Environments

    ADEPT Runtime/Scalability Predictor in support of Adaptive Scheduling

    A job scheduler determines the order and duration of the allocation of resources, e.g. CPU, to the tasks waiting to run on a computer. Round-Robin and First-Come-First-Serve are examples of algorithms for making such resource allocation decisions. Parallel job schedulers make resource allocation decisions for applications that need multiple CPU cores, on computers consisting of many CPU cores connected by different interconnects. An adaptive parallel scheduler is a parallel scheduler that is capable of adjusting its resource allocation decisions based on the current resource usage and demand. Adaptive parallel schedulers that decide the numbers of CPU cores to allocate to a parallel job provide more flexibility and potentially improve performance significantly for both local and grid job scheduling compared to non-adaptive schedulers. A major reason why adaptive schedulers are not yet used practically is due to lack of knowledge of the scalability curves of the applications, and high cost of existing white-box approaches for scalability prediction. We show that a runtime and scalability prediction tool can be developed with 3 requirements: accuracy comparable to white-box methods, applicability, and robustness. Applicability depends only on knowledge feasible to gain in a production environment. Robustness addresses anomalous behaviour and unreliable predictions. We present ADEPT, a speedup and runtime prediction tool that satisfies all criteria for both single problem size and across different problem sizes of a parallel application. ADEPT is also capable of handling anomalies and judging reliability of its predictions. We demonstrate these using experiments with MPI and OpenMP implementations of NAS benchmarks and seven real applications

    The Inter-cloud meta-scheduling

    Inter-cloud is a recently emerging approach that expands cloud elasticity. By facilitating an adaptable setting, it purposes at the realization of a scalable resource provisioning that enables a diversity of cloud user requirements to be handled efficiently. This study’s contribution is in the inter-cloud performance optimization of job executions using metascheduling concepts. This includes the development of the inter-cloud meta-scheduling (ICMS) framework, the ICMS optimal schemes and the SimIC toolkit. The ICMS model is an architectural strategy for managing and scheduling user services in virtualized dynamically inter-linked clouds. This is achieved by the development of a model that includes a set of algorithms, namely the Service-Request, Service-Distribution, Service-Availability and Service-Allocation algorithms. These along with resource management optimal schemes offer the novel functionalities of the ICMS where the message exchanging implements the job distributions method, the VM deployment offers the VM management features and the local resource management system details the management of the local cloud schedulers. The generated system offers great flexibility by facilitating a lightweight resource management methodology while at the same time handling the heterogeneity of different clouds through advanced service level agreement coordination. Experimental results are productive as the proposed ICMS model achieves enhancement of the performance of service distribution for a variety of criteria such as service execution times, makespan, turnaround times, utilization levels and energy consumption rates for various inter-cloud entities, e.g. users, hosts and VMs. For example, ICMS optimizes the performance of a non-meta-brokering inter-cloud by 3%, while ICMS with full optimal schemes achieves 9% optimization for the same configurations. The whole experimental platform is implemented into the inter-cloud Simulation toolkit (SimIC) developed by the author, which is a discrete event simulation framework

    Energy-aware performance engineering in high performance computing

    Advances in processor design have delivered performance improvements for decades. As physical limits are reached, however, refinements to the same basic technologies are beginning to yield diminishing returns. Unsustainable increases in energy consumption are forcing hardware manufacturers to prioritise energy efficiency in their designs. Research suggests that software modifications will be needed to exploit the resulting improvements in current and future hardware. New tools are required to capitalise on this new class of optimisation. This thesis investigates the field of energy-aware performance engineering. It begins by examining the current state of the art, which is characterised by ad-hoc techniques and a lack of standardised metrics. Work in this thesis addresses these deficiencies and lays stable foundations for others to build on. The first contribution made includes a set of criteria which define the properties that energy-aware optimisation metrics should exhibit. These criteria show that current metrics cannot meaningfully assess the utility of code or correctly guide its optimisation. New metrics are proposed to address these issues, and theoretical and empirical proofs of their advantages are given. This thesis then presents the Power Optimised Software Envelope (POSE) model, which allows developers to assess whether power optimisation is worth pursuing for their applications. POSE is used to study the optimisation characteristics of codes from the Mantevo mini-application suite running on a Haswell-based cluster. The results obtained show that of these codes TeaLeaf has the most scope for power optimisation while PathFinder has the least. Finally, POSE modelling techniques are extended to evaluate the system-wide scope for energy-aware performance optimisation. System Summary POSE allows developers to assess the scope a system has for energy-aware software optimisation independent of the code being run