Search CORE

53 research outputs found

Optimising Memory Management for Belief Propagation in Junction Trees using GPGPUs

Author: Bistaffa Filippo
Bombieri Nicola
Farinelli Alessandro
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2014
Field of study

Belief Propagation (BP) in Junction Trees (JT) is one of the most popular approaches to compute posteriors in Bayesian Networks (BN). Such approach has significant computational requirements that can be addressed by using highly parallel architectures (i.e., General Purpose Graphic Processing Units) to parallelise the message update phases of BP. In this paper, we propose a novel approach to parallelise BP with GPGPUs, which focuses on optimising the memory layout of the BN tables so to achieve better performance in terms of increased speedup, reduced data transfers between the host and the GPGPU, and scalability. Our empirical comparison with the state of the art approach on standard datasets confirms significant improvements in speedups (up to +594%), and scalability (as our method can operate on networks whose potential tables exceed the global memory of the GPGPU)

Catalogo dei prodotti della ricerca

Recommended from our members

Brief survey on computational solutions for Bayesian inference

Author: Alves JD
Dias J
Ferreira JF
Lobo J
Publication venue
Publication date: 01/01/2015
Field of study

In this paper, we present a brief review of research work attempting to tackle the issue of tractability in Bayesian inference, including an analysis of the applicability and trade-offs of each proposed solution. In recent years, the Bayesian approach has become increasingly popular, endowing autonomous systems with the ability to deal with uncertainty and incompleteness. However, these systems are also expected to be efficient, while Bayesian inference in general is known to be an NP-hard problem, making it paramount to develop approaches dealing with this complexity in order to allow the implementation of usable Bayesian solutions. Novel computational paradigms and also major developments in massively parallel computation technologies, such as multi-core processors, GPUs and FPGAs, provide us with an inkling of the roadmap in Bayesian computation for upcoming years

Nottingham Trent Institutional Repository (IRep)

Parallel Model Counting with CUDA: Algorithm Engineering for Efficient Hardware Utilization

Author: Fichte Johannes K.
Hecher Markus
Roland Valentin
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 27th International Conference on Principles and Practice of Constraint Programming (CP 2021)
Publication date: 01/01/2021
Field of study

Propositional model counting (MC) and its extensions as well as applications in the area of probabilistic reasoning have received renewed attention in recent years. As a result, also the need for quickly solving counting-based problems with automated solvers is critical for certain areas. In this paper, we present experiments evaluating various techniques in order to improve the performance of parallel model counting on general purpose graphics processing units (GPGPUs). Thereby, we mainly consider engineering efficient algorithms for model counting on GPGPUs that utilize the treewidth of a propositional formula by means of dynamic programming. The combination of our techniques results in the solver GPUSAT3, which is based on the programming framework Cuda that -compared to other frameworks- shows superior extensibility and driver support. When combining all findings of this work, we show that GPUSAT3 not only solves more instances of the recent Model Counting Competition 2020 (MCC 2020) than existing GPGPU-based systems, but also solves those significantly faster. A portfolio with one of the best solvers of MCC 2020 and GPUSAT3 solves 19% more instances than the former alone in less than half of the runtime

Dagstuhl Research Online Publication Server

Dynamic Scheduling for Energy Minimization in Delay-Sensitive Stream Mining

Author: Andreopoulos Y
Deligiannis N
Islam MA
Ren S
van der Schaar M
Publication venue: IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Publication date: 13/08/2014
Field of study

Numerous stream mining applications, such as visual detection, online patient monitoring, and video search and retrieval, are emerging on both mobile and high-performance computing systems. These applications are subject to responsiveness (i.e., delay) constraints for user interactivity and, at the same time, must be optimized for energy efficiency. The increasingly heterogeneous power-versus-performance profile of modern hardware presents new opportunities for energy saving as well as challenges. For example, employing low-performance processing nodes can save energy but may violate delay requirements, whereas employing high-performance processing nodes can deliver a fast response but may unnecessarily waste energy. Existing scheduling algorithms balance energy versus delay assuming constant processing and power requirements throughout the execution of a stream mining task and without exploiting hardware heterogeneity. In this paper, we propose a novel framework for dynamic scheduling for energy minimization (DSE) that leverages this emerging hardware heterogeneity. By optimally determining the processing speeds for hardware executing classifiers, DSE minimizes the average energy consumption while satisfying an average delay constraint. To assess the performance of DSE, we build a face detection application based on the Viola-Jones classifier chain and conduct experimental studies via heterogeneous processor system emulation. The results show that, under the same delay requirement, DSE reduces the average energy consumption by up to 50% in comparison to conventional scheduling that does not exploit hardware heterogeneity. We also demonstrate that DSE is robust against processing node switching overhead and model inaccuracy

Crossref

UCL Discovery

Exceeding Conservative Limits: A Consolidated Analysis on Modern Hardware Margins

Author: Chatzidimitriou Athanasios
Gizopoulos Dimitris
Kestelman Adrian Cristal
Leng Jingwen
Papadimitriou George
Reddi Vijay Janapa
Salami Behzad
Unsal Osman S.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/06/2020
Field of study

Modern large-scale computing systems (data centers, supercomputers, cloud and edge setups and high-end cyber-physical systems) employ heterogeneous architectures that consist of multicore CPUs, general-purpose many-core GPUs, and programmable FPGAs. The effective utilization of these architectures poses several challenges, among which a primary one is power consumption. Voltage reduction is one of the most efficient methods to reduce power consumption of a chip. With the galloping adoption of hardware accelerators (i.e., GPUs and FPGAs) in large datacenters and other large-scale computing infrastructures, a comprehensive evaluation of the safe voltage reduction levels for each different chip can be employed for efficient reduction of the total power. We present a survey of recent studies in voltage margins reduction at the system level for modern CPUs, GPUs and FPGAs. The pessimistic voltage guardbands inserted by the silicon vendors can be exploited in all devices for significant power savings. On average, voltage reduction can reach 12% in multicore CPUs, 20% in manycore GPUs and 39% in FPGAs.Comment: Accepted for publication in IEEE Transactions on Device and Materials Reliabilit

arXiv.org e-Print Archive

UPCommons. Portal del coneixement obert de la UPC

PERFORMANCE OPTIMISATIONS FOR HETEROGENEOUS MANAGED RUNTIME SYSTEMS

Author: Papadimitriou Michail
Publication venue
Publication date: 31/12/2021
Field of study

The University of Manchester - Institutional Repository

Hybrid algorithms for efficient Cholesky decomposition and matrix inverse using multicore CPUs with GPU accelerators

Author: Macindoe GI
Publication venue: UCL (University College London)
Publication date: 28/07/2013
Field of study

The use of linear algebra routines is fundamental to many areas of computational science, yet their implementation in software still forms the main computational bottleneck in many widely used algorithms. In machine learning and computational statistics, for example, the use of Gaussian distributions is ubiquitous, and routines for calculating the Cholesky decomposition, matrix inverse and matrix determinant must often be called many thousands of times for common algorithms, such as Markov chain Monte Carlo. These linear algebra routines consume most of the total computational time of a wide range of statistical methods, and any improvements in this area will therefore greatly increase the overall efficiency of algorithms used in many scientific application areas. The importance of linear algebra algorithms is clear from the substantial effort that has been invested over the last 25 years in producing low-level software libraries such as LAPACK, which generally optimise these linear algebra routines by breaking up a large problem into smaller problems that may be computed independently. The performance of such libraries is however strongly dependent on the specific hardware available. LAPACK was originally developed for single core processors with a memory hierarchy, whereas modern day computers often consist of mixed architectures, with large numbers of parallel cores and graphics processing units (GPU) being used alongside traditional CPUs. The challenge lies in making optimal use of these different types of computing units, which generally have very different processor speeds and types of memory. In this thesis we develop novel low-level algorithms that may be generally employed in blocked linear algebra routines, which automatically optimise themselves to take full advantage of the variety of heterogeneous architectures that may be available. We present a comparison of our methods with MAGMA, the state of the art open source implementation of LAPACK designed specifically for hybrid architectures, and demonstrate up to 400% increase in speed that may be obtained using our novel algorithms, specifically when running commonly used Cholesky matrix decomposition, matrix inverse and matrix determinant routines

UCL Discovery

Simulation methodologies for mobile GPUs

Author: Kaszyk Kuba
Publication venue: The University of Edinburgh
Publication date: 17/03/2022
Field of study

GPUs critically rely on a complex system software stack comprising kernel- and user-space drivers and JIT compilers. Yet, existing GPU simulators typically abstract away details of the software stack and GPU instruction set. Partly, this is because GPU vendors rarely release sufficient information about their latest GPU products. However, this is also due to the lack of an integrated CPU-GPU simulation framework, which is complete and powerful enough to drive the complex GPU software environment. This has led to a situation where research on GPU architectures and compilers is largely based on outdated or greatly simplified architectures and software stacks, undermining the validity of the generated results. Making the situation even more dire, existing GPU simulation efforts are concentrated around desktop GPUs, making infrastructure for modelling mobile GPUs virtually non-existent, despite their surging importance in the GPU market. Still, mobile GPU designers are faced with the challenge of evaluating design alternatives involving hundreds of architectural configuration options and micro-architectural improvements under tight time-to-market constraints, to which currently employed design flows involving detailed, but slow simulations are not well suited. In this thesis we develop a full-system simulation environment for a mobile platform, which enables users to run a complete and unmodified software stack for a state-of-the-art mobile Arm CPU and Mali Bifrost GPU powered device, achieving 100\% architectural accuracy across all available toolchains. We demonstrate the capability of our GPU simulation framework through a number of case studies exploring modern, mobile GPU applications, and optimize them using functional simulation statistics, unavailable with other approaches or hardware. Furthermore, we develop a trace-based performance model, allowing architects to rapidly model GPU configurations in early design space exploration

Edinburgh Research Archive