20 research outputs found
Exploiting hybrid parallelism in the kinematic analysis of multibody systems based on group equations
Computational kinematics is a fundamental tool for the design, simulation, control, optimization and dynamic analysis of multibody systems. The analysis of complex multibody systems and the need for real time solutions requires the development of kinematic and dynamic formulations that reduces computational cost, the selection and efficient use of the most appropriated solvers and the exploiting of all the computer resources using parallel computing techniques. The topological approach based on group equations and natural coordinates reduces the computation time in comparison with well-known global formulations and enables the use of parallelism techniques which can be applied at different levels: simultaneous solution of equations, use of multithreading routines, or a combination of both. This paper studies and compares these topological formulation and parallel techniques to ascertain which combination performs better in two applications. The first application uses dedicated systems for the real time control of small multibody systems, defined by a few number of equations and small linear systems, so shared-memory parallelism in combination with linear algebra routines is analyzed in a small multicore and in Raspberry Pi. The control of a Stewart platform is used as a case study. The second application studies large multibody systems in which the kinematic analysis must be performed several times during the design of multibody systems. A simulator which allows us to control the formulation, the solver, the parallel techniques and size of the problem has been developed and tested in more powerful computational systems with larger multicores and GPU.This work was supported by the Spanish MINECO, as well as European Commission FEDER funds, under grant TIN2015-66972-C5-3-
Reliable Linear, Sesquilinear and Bijective Operations On Integer Data Streams Via Numerical Entanglement
A new technique is proposed for fault-tolerant linear, sesquilinear and
bijective (LSB) operations on integer data streams (), such as:
scaling, additions/subtractions, inner or outer vector products, permutations
and convolutions. In the proposed method, the input integer data streams
are linearly superimposed to form numerically-entangled integer data
streams that are stored in-place of the original inputs. A series of LSB
operations can then be performed directly using these entangled data streams.
The results are extracted from the entangled output streams by additions
and arithmetic shifts. Any soft errors affecting any single disentangled output
stream are guaranteed to be detectable via a specific post-computation
reliability check. In addition, when utilizing a separate processor core for
each of the streams, the proposed approach can recover all outputs after
any single fail-stop failure. Importantly, unlike algorithm-based fault
tolerance (ABFT) methods, the number of operations required for the
entanglement, extraction and validation of the results is linearly related to
the number of the inputs and does not depend on the complexity of the performed
LSB operations. We have validated our proposal in an Intel processor (Haswell
architecture with AVX2 support) via fast Fourier transforms, circular
convolutions, and matrix multiplication operations. Our analysis and
experiments reveal that the proposed approach incurs between to
reduction in processing throughput for a wide variety of LSB operations. This
overhead is 5 to 1000 times smaller than that of the equivalent ABFT method
that uses a checksum stream. Thus, our proposal can be used in fault-generating
processor hardware or safety-critical applications, where high reliability is
required without the cost of ABFT or modular redundancy.Comment: to appear in IEEE Trans. on Signal Processing, 201
A Hybrid Multi-GPU Implementation of Simplex Algorithm with CPU Collaboration
The simplex algorithm has been successfully used for many years in solving
linear programming (LP) problems. Due to the intensive computations required
(especially for the solution of large LP problems), parallel approaches have
also extensively been studied. The computational power provided by the modern
GPUs as well as the rapid development of multicore CPU systems have led OpenMP
and CUDA programming models to the top preferences during the last years.
However, the desired efficient collaboration between CPU and GPU through the
combined use of the above programming models is still considered a hard
research problem. In the above context, we demonstrate here an excessively
efficient implementation of standard simplex, targeting to the best possible
exploitation of the concurrent use of all the computing resources, on a
multicore platform with multiple CUDA-enabled GPUs. More concretely, we present
a novel hybrid collaboration scheme which is based on the concurrent execution
of suitably spread CPU-assigned (via multithreading) and GPU-offloaded
computations. The experimental results extracted through the cooperative use of
OpenMP and CUDA over a notably powerful modern hybrid platform (consisting of
32 cores and two high-spec GPUs, Titan Rtx and Rtx 2080Ti) highlight that the
performance of the presented here hybrid GPU/CPU collaboration scheme is
clearly superior to the GPU-only implementation under almost all conditions.
The corresponding measurements validate the value of using all resources
concurrently, even in the case of a multi-GPU configuration platform.
Furthermore, the given implementations are completely comparable (and slightly
superior in most cases) to other related attempts in the bibliography, and
clearly superior to the native CPU-implementation with 32 cores.Comment: 12 page
Reliable Linear, Sesquilinear, and Bijective Operations on Integer Data Streams Via Numerical Entanglement
A new technique is proposed for fault-tolerant linear, sesquilinear and bijective (LSB) operations on integer data streams ( ), such as: scaling, additions/subtractions, inner or outer vector products, permutations and convolutions. In the proposed method, input integer data streams are linearly superimposed to form numerically-entangled integer data streams that are stored in-place of the original inputs. LSB operations can then be performed directly using these entangled data streams. The results are extracted from the entangled output streams by additions and arithmetic shifts. Any soft errors affecting one disentangled output stream are guaranteed to be detectable via a post-computation reliability check. Additionally, when utilizing a separate processor core for each stream, our approach can recover all outputs after any single fail-stop failure. Importantly, unlike algorithm-based fault tolerance (ABFT) methods, the number of operations required for the entire process is linearly related to the number of inputs and does not depend on the complexity of the performed LSB operations. We have validated our proposal in an Intel processor via several types of operations: fast Fourier transforms, convolutions, and matrix multiplication operations. Our analysis and experiments reveal that the proposed approach incurs between 0.03% to 7% reduction in processing throughput for numerous LSB operations. This overhead is 5 to 1000 times smaller than that of the equivalent ABFT method that uses a checksum stream. Thus, our proposal can be used in fault-generating processor hardware or safety-critical applications, where high reliability is required without the cost of ABFT or modular redundancy
Parallel Model Counting with CUDA: Algorithm Engineering for Efficient Hardware Utilization
Propositional model counting (MC) and its extensions as well as applications in the area of probabilistic reasoning have received renewed attention in recent years. As a result, also the need for quickly solving counting-based problems with automated solvers is critical for certain areas. In this paper, we present experiments evaluating various techniques in order to improve the performance of parallel model counting on general purpose graphics processing units (GPGPUs). Thereby, we mainly consider engineering efficient algorithms for model counting on GPGPUs that utilize the treewidth of a propositional formula by means of dynamic programming. The combination of our techniques results in the solver GPUSAT3, which is based on the programming framework Cuda that -compared to other frameworks- shows superior extensibility and driver support. When combining all findings of this work, we show that GPUSAT3 not only solves more instances of the recent Model Counting Competition 2020 (MCC 2020) than existing GPGPU-based systems, but also solves those significantly faster. A portfolio with one of the best solvers of MCC 2020 and GPUSAT3 solves 19% more instances than the former alone in less than half of the runtime
Evaluación de prestaciones mediante la aplicación HPL de clusters utilizando rCUDA
Treball de Fi de MÃ ster en Sistemes Intel.ligents. Codi: SIU043. Curs 2013-2014A lo largo de este documento se describe el proyecto realizado en
la asignatura SIU043-Trabajo Fin de Máster. Este trabajo se ha llevado
a cabo en el grupo de investigación High Performance Computing and
Architectures del Departamento de IngenierÃa y Ciencia de los Computadores
de la Universitat Jaume I bajo la supervisión de Rafael Mayo Gual.
El proyecto se ha centrado en la evaluación del rendimiento mediante
el uso de la aplicación Linpack Benchmark del software rCUDA. Este
software permite la ejecución de una aplicación CUDA en un nodo que
no disponga de ninguna GPU instalada, utilizando mediante la red de
interconexión una GPU instalada en otro nodo como si fuera local.
El objetivo de este trabajo es dotar a rCUDA de la funcionalidad
necesaria para poder ejecutar este test y posteriormente analizar las
prestaciones obtenidas. Estas prestaciones deben de ser comparadas con
la ejecución de este mismo test sobre un nodo utilizando CUDA