110 research outputs found

    Scalable learning for geostatistics and speaker recognition

    Get PDF
    With improved data acquisition methods, the amount of data that is being collected has increased severalfold. One of the objectives in data collection is to learn useful underlying patterns. In order to work with data at this scale, the methods not only need to be effective with the underlying data, but also have to be scalable to handle larger data collections. This thesis focuses on developing scalable and effective methods targeted towards different domains, geostatistics and speaker recognition in particular. Initially we focus on kernel based learning methods and develop a GPU based parallel framework for this class of problems. An improved numerical algorithm that utilizes the GPU parallelization to further enhance the computational performance of kernel regression is proposed. These methods are then demonstrated on problems arising in geostatistics and speaker recognition. In geostatistics, data is often collected at scattered locations and factors like instrument malfunctioning lead to missing observations. Applications often require the ability interpolate this scattered spatiotemporal data on to a regular grid continuously over time. This problem can be formulated as a regression problem, and one of the most popular geostatistical interpolation techniques, kriging is analogous to a standard kernel method: Gaussian process regression. Kriging is computationally expensive and needs major modifications and accelerations in order to be used practically. The GPU framework developed for kernel methods is extended to kriging and further the GPU's texture memory is better utilized for enhanced computational performance. Speaker recognition deals with the task of verifying a person's identity based on samples of his/her speech - "utterances". This thesis focuses on text-independent framework and three new recognition frameworks were developed for this problem. We proposed a kernelized Renyi distance based similarity scoring for speaker recognition. While its performance is promising, it does not generalize well for limited training data and therefore does not compare well to state-of-the-art recognition systems. These systems compensate for the variability in the speech data due to the message, channel variability, noise and reverberation. State-of-the-art systems model each speaker as a mixture of Gaussians (GMM) and compensate for the variability (termed "nuisance"). We propose a novel discriminative framework using a latent variable technique, partial least squares (PLS), for improved recognition. The kernelized version of this algorithm is used to achieve a state of the art speaker ID system, that shows results competitive with the best systems reported on in NIST's 2010 Speaker Recognition Evaluation

    Large scale geostatistics with locally varying anisotropy

    Get PDF
    Classical geostatistical methods are based on the hypothesis of stationarity, which allows to apply repetitive sampling in different locations of the spatial domain, in order to obtain enough information to infer cumulative distributions. In case of non stationarity, anisotropy is observed in the underlying physical phenomena. This feature manifest itself as preferential directions of continuity in the phenomena, i.e. properties are more continuous in one orientation than in another. In the case of local anisotropy, each location of the domain in study presents different preferential directions of continuity. The locally varying anisotropy (LVA) approach in geostatistics allows to incorporate a field of local anisotropy parameters defined for each domain point. With this additional input, more realistic spatial simulations can be generated, including geological features to the computational model such as folds, veins, faults, among others. Since the seminal article published by Boisvert and Deutsch (2011), to the best of the author's knowledge, no further analysis or public code improvements were developed. This is in part because acceleration and parallelization techniques must be applied to the inner kernels of the baseline LVA codes. Large execution time is needed to generate small-scale domain simulations, making large-scale domain simulations a prohibitive task. The contributions of this thesis are accelerating and parallelizing classical and LVA-based geostatistical simulation methods, particularly sequential simulation, which is one of the most common and computationally intensive methods in the field. This fact was recently remarked by some of the main authors in the field, Gómez-Hernández and Srivastava (2021), which shows the relevance of this work today. Two main parallel algorithms and an optimized version of a kd-tree search implementation are presented, all of them applied to both classical and LVA-based sequential simulation implementations. The first parallel algorithm is related to the parallel simulation of different domain points, after rearranging the order of simulation but preserving the exact results of a single-thread execution. The second parallel algorithm is related to the parallel search of neighbour points in the domain, which will be used to build data dependencies for the parallel simulation of points. The optimized kd-tree search was used in each test case in order to reduce the computational complexity of neighbour search tasks. Its modified implementation reduces the number of branching instructions and introduces specialized code sections to accelerate the execution. The main focus is on multi-core architectures using OpenMP and optimization techniques applied to Fortran and C++ codes. Additionally, acceleration and parallelization techniques were also applied to auxiliary applications, such as shortest path and variogram calculation on hybrid CPU/GPU architectures using Fortran, C++ and CUDA codes. In the last application, an analytical and heuristic model was developed to estimate the optimal workload distribution between CPU and GPU in the hybrid context. The overall results of this work are a set of applications that will allow researchers and practitioners to accelerate dramatically the execution of their experiments and simulations, being sgsim, sisim, sgs-lva and sisim-lva the accelerated codes presented. Final speedup results of 11x and 50x are obtained for non-LVA codes using 16 threads, and 56x and 1822x are obtained for LVA codes using 20 threads. These tools can be combined with other geostatistical tools, in order to improve the existing landscape of open source codes that can be used in practical scenarios.Los métodos geoestadísticos clásicos se basan en la hipótesis de la estacionariedad, que permite aplicar muestreos repetitivos en diferentes lugares del dominio espacial, con el fin de obtener información suficiente para inferir distribuciones acumuladas. En caso de no estacionariedad, se observa anisotropía en los fenómenos físicos subyacentes. Esta característica se manifiesta como direcciones preferenciales de continuidad en los fenómenos, es decir, las propiedades son más continuas en una orientación que en otra. En el caso de la anisotropía local, cada ubicación del dominio en estudio puede presentar diferentes direcciones preferenciales de continuidad. El enfoque de anisotropía localmente variable (LVA) en geoestadística permite incorporar un campo de parámetros de anisotropía locales definidos para cada punto de dominio. Con esta entrada adicional, se pueden generar simulaciones espaciales más realistas, incluyendo características geológicas al modelo computacional como pliegues, vetas, fallas, entre otras. Desde el artículo seminal publicado por Boisvert y Deutsch (2011), según el conocimiento del autor, no se han desarrollado más análisis ni mejoras en el código público. Esto se debe en parte a que se deben aplicar técnicas de aceleración y paralelización a los núcleos internos de los códigos LVA de referencia. Se necesita mucho tiempo de ejecución para generar simulaciones de dominio a pequeña escala, lo que hace que las simulaciones de dominio a gran escala sean una tarea prohibitiva. Las contribuciones de esta tesis consisten en acelerar y paralelizar métodos de simulación geoestadística clásicos y basados en LVA, particularmente la simulación secuencial, que es uno de los métodos más comunes e intensivos en computación en el campo. Este hecho fue señalado recientemente por algunos de los principales autores en el campo, Gómez-Hernández y Srivastava (2021), lo que demuestra la relevancia de este trabajo en la actualidad. Se presentan dos algoritmos paralelos principales y una versión optimizada de una implementación de búsqueda de árbol kd, todos ellos aplicados a implementaciones de simulación secuencial clásicas y basadas en LVA. El primer algoritmo paralelo está relacionado con la simulación paralela de diferentes puntos del dominio, después de reorganizar el orden de simulación pero conservando los resultados exactos de una ejecución de un solo hilo. El segundo algoritmo paralelo está relacionado con la búsqueda paralela de puntos vecinos en el dominio, que se utilizará para resolver dependencias de datos para la simulación paralela de puntos. La búsqueda optimizada de kd-tree se utilizó en cada caso de prueba para reducir la complejidad computacional de las tareas de búsqueda de vecinos. Su implementación modificada reduce el número de instrucciones branching e introduce código especializado para acelerar la ejecución. El foco principal está en arquitecturas multi-núcleo usando OpenMP y técnicas de optimización aplicadas a códigos Fortran y C++. Además, también se aplicaron técnicas de aceleración y paralelización a aplicaciones auxiliares, como el cálculo de la ruta más corta en un grafo y el cálculo de variogramas en arquitecturas híbridas CPU/GPU utilizando códigos Fortran, C++ y CUDA. En la última aplicación, se desarrolló un modelo analítico y heurístico para estimar la distribución óptima de la carga de trabajo entre CPU y GPU en el contexto híbrido. Los resultados generales de este trabajo son un conjunto de aplicaciones que permitirán a los investigadores y profesionales acelerar la ejecución de sus experimentos, siendo sgsim, sisim, sgs-lva y sisim-lva los códigos acelerados. Se obtienen resultados finales de aceleración de 11x y 50x para códigos que no son LVA usando 16 hilos, y se obtienen 56x y 1822x para códigos LVA usando 20 hilos. Estas herramientas se pueden combinar con otras herramientas geoestadícasPostprint (published version

    Automatic Performance Optimization on Heterogeneous Computer Systems using Manycore Coprocessors

    Get PDF
    Emerging computer architectures and advanced computing technologies, such as Intel’s Many Integrated Core (MIC) Architecture and graphics processing units (GPU), provide a promising solution to employ parallelism for achieving high performance, scalability and low power consumption. As a result, accelerators have become a crucial part in developing supercomputers. Accelerators usually equip with different types of cores and memory. It will compel application developers to reach challenging performance goals. The added complexity has led to the development of task-based runtime systems, which allow complex computations to be expressed as task graphs, and rely on scheduling algorithms to perform load balancing between all resources of the platforms. Developing good scheduling algorithms, even on a single node, and analyzing them can thus have a very high impact on the performance of current HPC systems. Load balancing strategies, at different levels, will be critical to obtain an effective usage of the heterogeneous hardware and to reduce the impact of communication on energy and performance. Implementing efficient load balancing algorithms, able to manage heterogeneous hardware, can be a challenging task, especially when a parallel programming model for distributed memory architecture. In this paper, we presents several novel runtime approaches to determine the optimal data and task partition on heterogeneous platforms, targeting the Intel Xeon Phi accelerated heterogeneous systems

    ELASTIC CLOUD COMPUTING ARCHITECTURE AND SYSTEM FOR HETEROGENEOUS SPATIOTEMPORAL COMPUTING

    Get PDF

    Beyond 16GB: Out-of-Core Stencil Computations

    Get PDF
    Stencil computations are a key class of applications, widely used in the scientific computing community, and a class that has particularly benefited from performance improvements on architectures with high memory bandwidth. Unfortunately, such architectures come with a limited amount of fast memory, which is limiting the size of the problems that can be efficiently solved. In this paper, we address this challenge by applying the well-known cache-blocking tiling technique to large scale stencil codes implemented using the OPS domain specific language, such as CloverLeaf 2D, CloverLeaf 3D, and OpenSBLI. We introduce a number of techniques and optimisations to help manage data resident in fast memory, and minimise data movement. Evaluating our work on Intel's Knights Landing Platform as well as NVIDIA P100 GPUs, we demonstrate that it is possible to solve 3 times larger problems than the on-chip memory size with at most 15\% loss in efficienc

    Task-based Runtime Optimizations Towards High Performance Computing Applications

    Get PDF
    The last decades have witnessed a rapid improvement of computational capabilities in high-performance computing (HPC) platforms thanks to hardware technology scaling. HPC architectures benefit from mainstream advances on the hardware with many-core systems, deep hierarchical memory subsystem, non-uniform memory access, and an ever-increasing gap between computational power and memory bandwidth. This has necessitated continuous adaptations across the software stack to maintain high hardware utilization. In this HPC landscape of potentially million-way parallelism, task-based programming models associated with dynamic runtime systems are becoming more popular, which fosters developers’ productivity at extreme scale by abstracting the underlying hardware complexity. In this context, this dissertation highlights how a software bundle powered by a task-based programming model can address the heterogeneous workloads engendered by HPC applications., i.e., data redistribution, geospatial modeling and 3D unstructured mesh deformation here. Data redistribution aims to reshuffle data to optimize some objective for an algorithm, whose objective can be multi-dimensional, such as improving computational load balance or decreasing communication volume or cost, with the ultimate goal of increasing the efficiency and therefore reducing the time-to-solution for the algorithm. Geostatistical modeling, one of the prime motivating applications for exascale computing, is a technique for predicting desired quantities from geographically distributed data, based on statistical models and optimization of parameters. Meshing the deformable contour of moving 3D bodies is an expensive operation that can cause huge computational challenges in fluid-structure interaction (FSI) applications. Therefore, in this dissertation, Redistribute-PaRSEC, ExaGeoStat-PaRSEC and HiCMA-PaRSEC are proposed to efficiently tackle these HPC applications respectively at extreme scale, and they are evaluated on multiple HPC clusters, including AMD-based, Intel-based, Arm-based CPU systems and IBM-based multi-GPU system. This multidisciplinary work emphasizes the need for runtime systems to go beyond their primary responsibility of task scheduling on massively parallel hardware system for servicing the next-generation scientific applications
    corecore