64 research outputs found

    Massively Parallel Construction of Radix Tree Forests for the Efficient Sampling of Discrete Probability Distributions

    Full text link
    We compare different methods for sampling from discrete probability distributions and introduce a new algorithm which is especially efficient on massively parallel processors, such as GPUs. The scheme preserves the distribution properties of the input sequence, exposes constant time complexity on the average, and significantly lowers the average number of operations for certain distributions when sampling is performed in a parallel algorithm that requires synchronization afterwards. Avoiding load balancing issues of na\"ive approaches, a very efficient massively parallel construction algorithm for the required auxiliary data structure is complemented

    Weighted Random Sampling - Alias Tables on the GPU

    Get PDF

    High-throughput machine learning algorithms

    Get PDF
    The field of machine learning has become strongly compute driven, such that emerging research and applications require larger amounts of specialised hardware or smarter algorithms to advance beyond the state-of-the-art. This thesis develops specialised techniques and algorithms for a subset of computationally difficult machine learning problems. The applications under investigation are quantile approximation in the limited-memory data streaming setting, interpretability of decision tree ensembles, efficient sampling methods in the space of permutations, and the generation of large numbers of pseudorandom permutations. These specific applications are investigated as they represent significant bottlenecks in real-world machine learning pipelines, where improvements to throughput have significant impact on the outcomes of machine learning projects in both industry and research. To address these bottlenecks, we discuss both theoretical improvements, such as improved convergence rates, and hardware/software related improvements, such as optimised algorithm design for high throughput hardware accelerators. Some contributions include: the evaluation of bin-packing methods for efficiently scheduling small batches of dependent computations to GPU hardware execution units, numerically stable reduction operators for higher-order statistical moments, and memory bandwidth optimisation for GPU shuffling. Additionally, we apply theory of the symmetric group of permutations in reproducing kernel Hilbert spaces, resulting in improved analysis of Monte Carlo methods for Shapley value estimation and new, computationally more efficient algorithms based on kernel herding and Bayesian quadrature. We also utilise reproducing kernels over permutations to develop a novel statistical test for the hypothesis that a sample of permutations is drawn from a uniform distribution. The techniques discussed lie at the intersection of machine learning, high-performance computing, and applied mathematics. Much of the above work resulted in open source software used in real applications, including the GPUTreeShap library [38], shuffling primitives for the Thrust parallel computing library [2], extensions to the Shap package [31], and extensions to the XGBoost library [6]

    Efficient configuration space construction and optimization

    Get PDF
    The configuration space is a fundamental concept that is widely used in algorithmic robotics. Many applications in robotics, computer-aided design, and related areas can be reduced to computational problems in terms of configuration spaces. In this dissertation, we address three main computational challenges related to configuration spaces: 1) how to efficiently compute an approximate representation of high-dimensional configuration spaces; 2) how to efficiently perform geometric, proximity, and motion planning queries in high dimensional configuration spaces; and 3) how to model uncertainty in configuration spaces represented by noisy sensor data. We present new configuration space construction algorithms based on machine learning and geometric approximation techniques. These algorithms perform collision queries on many configuration samples. The collision query results are used to compute an approximate representation for the configuration space, which quickly converges to the exact configuration space. We highlight the efficiency of our algorithms for penetration depth computation and instance-based motion planning. We also present parallel GPU-based algorithms to accelerate the performance of optimization and search computations in configuration spaces. In particular, we design efficient GPU-based parallel k-nearest neighbor and parallel collision detection algorithms and use these algorithms to accelerate motion planning. In order to extend configuration space algorithms to handle noisy sensor data arising from real-world robotics applications, we model the uncertainty in the configuration space by formulating the collision probabilities for noisy data. We use these algorithms to perform reliable motion planning for the PR2 robot.Doctor of Philosoph

    LIPIcs, Volume 274, ESA 2023, Complete Volume

    Get PDF
    LIPIcs, Volume 274, ESA 2023, Complete Volum

    27th Annual European Symposium on Algorithms: ESA 2019, September 9-11, 2019, Munich/Garching, Germany

    Get PDF

    Acta Cybernetica : Volume 25. Number 2.

    Get PDF

    Model-Based High-Dimensional Pose Estimation with Application to Hand Tracking

    Get PDF
    This thesis presents novel techniques for computer vision based full-DOF human hand motion estimation. Our main contributions are: A robust skin color estimation approach; A novel resolution-independent and memory efficient representation of hand pose silhouettes, which allows us to compute area-based similarity measures in near-constant time; A set of new segmentation-based similarity measures; A new class of similarity measures that work for nearly arbitrary input modalities; A novel edge-based similarity measure that avoids any problematic thresholding or discretizations and can be computed very efficiently in Fourier space; A template hierarchy to minimize the number of similarity computations needed for finding the most likely hand pose observed; And finally, a novel image space search method, which we naturally combine with our hierarchy. Consequently, matching can efficiently be formulated as a simultaneous template tree traversal and function maximization

    LIPIcs, Volume 244, ESA 2022, Complete Volume

    Get PDF
    LIPIcs, Volume 244, ESA 2022, Complete Volum

    Towards Closing the Programmability-Efficiency Gap using Software-Defined Hardware

    Full text link
    The past decade has seen the breakdown of two important trends in the computing industry: Moore’s law, an observation that the number of transistors in a chip roughly doubles every eighteen months, and Dennard scaling, that enabled the use of these transistors within a constant power budget. This has caused a surge in domain-specific accelerators, i.e. specialized hardware that deliver significantly better energy efficiency than general-purpose processors, such as CPUs. While the performance and efficiency of such accelerators are highly desirable, the fast pace of algorithmic innovation and non-recurring engineering costs have deterred their widespread use, since they are only programmable across a narrow set of applications. This has engendered a programmability-efficiency gap across contemporary platforms. A practical solution that can close this gap is thus lucrative and is likely to engender broad impact in both academic research and the industry. This dissertation proposes such a solution with a reconfigurable Software-Defined Hardware (SDH) system that morphs parts of the hardware on-the-fly to tailor to the requirements of each application phase. This system is designed to deliver near-accelerator-level efficiency across a broad set of applications, while retaining CPU-like programmability. The dissertation first presents a fixed-function solution to accelerate sparse matrix multiplication, which forms the basis of many applications in graph analytics and scientific computing. The solution consists of a tiled hardware architecture, co-designed with the outer product algorithm for Sparse Matrix-Matrix multiplication (SpMM), that uses on-chip memory reconfiguration to accelerate each phase of the algorithm. A proof-of-concept is then presented in the form of a prototyped 40 nm Complimentary Metal-Oxide Semiconductor (CMOS) chip that demonstrates energy efficiency and performance per die area improvements of 12.6x and 17.1x over a high-end CPU, and serves as a stepping stone towards a full SDH system. The next piece of the dissertation enhances the proposed hardware with reconfigurability of the dataflow and resource sharing modes, in order to extend acceleration support to a set of common parallelizable workloads. This reconfigurability lends the system the ability to cater to discrete data access and compute patterns, such as workloads with extensive data sharing and reuse, workloads with limited reuse and streaming access patterns, among others. Moreover, this system incorporates commercial cores and a prototyped software stack for CPU-level programmability. The proposed system is evaluated on a diverse set of compute-bound and memory-bound kernels that compose applications in the domains of graph analytics, machine learning, image and language processing. The evaluation shows average performance and energy-efficiency gains of 5.0x and 18.4x over the CPU. The final part of the dissertation proposes a runtime control framework that uses low-cost monitoring of hardware performance counters to predict the next best configuration and reconfigure the hardware, upon detecting a change in phase or nature of data within the application. In comparison to prior work, this contribution targets multicore CGRAs, uses low-overhead decision tree based predictive models, and incorporates reconfiguration cost-awareness into its policies. Compared to the best-average static (non-reconfiguring) configuration, the dynamically reconfigurable system achieves a 1.6x improvement in performance-per-Watt in the Energy-Efficient mode of operation, or the same performance with 23% lower energy in the Power-Performance mode, for SpMM across a suite of real-world inputs. The proposed reconfiguration mechanism itself outperforms the state-of-the-art approach for dynamic runtime control by up to 2.9x in terms of energy-efficiency.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/169859/1/subh_1.pd
    corecore