20 research outputs found

    Exploiting hybrid parallelism in the kinematic analysis of multibody systems based on group equations

    Get PDF
    Computational kinematics is a fundamental tool for the design, simulation, control, optimization and dynamic analysis of multibody systems. The analysis of complex multibody systems and the need for real time solutions requires the development of kinematic and dynamic formulations that reduces computational cost, the selection and efficient use of the most appropriated solvers and the exploiting of all the computer resources using parallel computing techniques. The topological approach based on group equations and natural coordinates reduces the computation time in comparison with well-known global formulations and enables the use of parallelism techniques which can be applied at different levels: simultaneous solution of equations, use of multithreading routines, or a combination of both. This paper studies and compares these topological formulation and parallel techniques to ascertain which combination performs better in two applications. The first application uses dedicated systems for the real time control of small multibody systems, defined by a few number of equations and small linear systems, so shared-memory parallelism in combination with linear algebra routines is analyzed in a small multicore and in Raspberry Pi. The control of a Stewart platform is used as a case study. The second application studies large multibody systems in which the kinematic analysis must be performed several times during the design of multibody systems. A simulator which allows us to control the formulation, the solver, the parallel techniques and size of the problem has been developed and tested in more powerful computational systems with larger multicores and GPU.This work was supported by the Spanish MINECO, as well as European Commission FEDER funds, under grant TIN2015-66972-C5-3-

    Reliable Linear, Sesquilinear and Bijective Operations On Integer Data Streams Via Numerical Entanglement

    Get PDF
    A new technique is proposed for fault-tolerant linear, sesquilinear and bijective (LSB) operations on MM integer data streams (M≥3M\geq3), such as: scaling, additions/subtractions, inner or outer vector products, permutations and convolutions. In the proposed method, the MM input integer data streams are linearly superimposed to form MM numerically-entangled integer data streams that are stored in-place of the original inputs. A series of LSB operations can then be performed directly using these entangled data streams. The results are extracted from the MM entangled output streams by additions and arithmetic shifts. Any soft errors affecting any single disentangled output stream are guaranteed to be detectable via a specific post-computation reliability check. In addition, when utilizing a separate processor core for each of the MM streams, the proposed approach can recover all outputs after any single fail-stop failure. Importantly, unlike algorithm-based fault tolerance (ABFT) methods, the number of operations required for the entanglement, extraction and validation of the results is linearly related to the number of the inputs and does not depend on the complexity of the performed LSB operations. We have validated our proposal in an Intel processor (Haswell architecture with AVX2 support) via fast Fourier transforms, circular convolutions, and matrix multiplication operations. Our analysis and experiments reveal that the proposed approach incurs between 0.03%0.03\% to 7%7\% reduction in processing throughput for a wide variety of LSB operations. This overhead is 5 to 1000 times smaller than that of the equivalent ABFT method that uses a checksum stream. Thus, our proposal can be used in fault-generating processor hardware or safety-critical applications, where high reliability is required without the cost of ABFT or modular redundancy.Comment: to appear in IEEE Trans. on Signal Processing, 201

    A Combined MPI-CUDA Parallel Solution of Linear and Nonlinear Poisson-Boltzmann Equation

    Get PDF

    A Hybrid Multi-GPU Implementation of Simplex Algorithm with CPU Collaboration

    Full text link
    The simplex algorithm has been successfully used for many years in solving linear programming (LP) problems. Due to the intensive computations required (especially for the solution of large LP problems), parallel approaches have also extensively been studied. The computational power provided by the modern GPUs as well as the rapid development of multicore CPU systems have led OpenMP and CUDA programming models to the top preferences during the last years. However, the desired efficient collaboration between CPU and GPU through the combined use of the above programming models is still considered a hard research problem. In the above context, we demonstrate here an excessively efficient implementation of standard simplex, targeting to the best possible exploitation of the concurrent use of all the computing resources, on a multicore platform with multiple CUDA-enabled GPUs. More concretely, we present a novel hybrid collaboration scheme which is based on the concurrent execution of suitably spread CPU-assigned (via multithreading) and GPU-offloaded computations. The experimental results extracted through the cooperative use of OpenMP and CUDA over a notably powerful modern hybrid platform (consisting of 32 cores and two high-spec GPUs, Titan Rtx and Rtx 2080Ti) highlight that the performance of the presented here hybrid GPU/CPU collaboration scheme is clearly superior to the GPU-only implementation under almost all conditions. The corresponding measurements validate the value of using all resources concurrently, even in the case of a multi-GPU configuration platform. Furthermore, the given implementations are completely comparable (and slightly superior in most cases) to other related attempts in the bibliography, and clearly superior to the native CPU-implementation with 32 cores.Comment: 12 page

    Reliable Linear, Sesquilinear, and Bijective Operations on Integer Data Streams Via Numerical Entanglement

    Get PDF
    A new technique is proposed for fault-tolerant linear, sesquilinear and bijective (LSB) operations on MM integer data streams ( M≥3M \geq 3), such as: scaling, additions/subtractions, inner or outer vector products, permutations and convolutions. In the proposed method, MM input integer data streams are linearly superimposed to form MM numerically-entangled integer data streams that are stored in-place of the original inputs. LSB operations can then be performed directly using these entangled data streams. The results are extracted from the MM entangled output streams by additions and arithmetic shifts. Any soft errors affecting one disentangled output stream are guaranteed to be detectable via a post-computation reliability check. Additionally, when utilizing a separate processor core for each stream, our approach can recover all outputs after any single fail-stop failure. Importantly, unlike algorithm-based fault tolerance (ABFT) methods, the number of operations required for the entire process is linearly related to the number of inputs and does not depend on the complexity of the performed LSB operations. We have validated our proposal in an Intel processor via several types of operations: fast Fourier transforms, convolutions, and matrix multiplication operations. Our analysis and experiments reveal that the proposed approach incurs between 0.03% to 7% reduction in processing throughput for numerous LSB operations. This overhead is 5 to 1000 times smaller than that of the equivalent ABFT method that uses a checksum stream. Thus, our proposal can be used in fault-generating processor hardware or safety-critical applications, where high reliability is required without the cost of ABFT or modular redundancy

    Parallel Model Counting with CUDA: Algorithm Engineering for Efficient Hardware Utilization

    Get PDF
    Propositional model counting (MC) and its extensions as well as applications in the area of probabilistic reasoning have received renewed attention in recent years. As a result, also the need for quickly solving counting-based problems with automated solvers is critical for certain areas. In this paper, we present experiments evaluating various techniques in order to improve the performance of parallel model counting on general purpose graphics processing units (GPGPUs). Thereby, we mainly consider engineering efficient algorithms for model counting on GPGPUs that utilize the treewidth of a propositional formula by means of dynamic programming. The combination of our techniques results in the solver GPUSAT3, which is based on the programming framework Cuda that -compared to other frameworks- shows superior extensibility and driver support. When combining all findings of this work, we show that GPUSAT3 not only solves more instances of the recent Model Counting Competition 2020 (MCC 2020) than existing GPGPU-based systems, but also solves those significantly faster. A portfolio with one of the best solvers of MCC 2020 and GPUSAT3 solves 19% more instances than the former alone in less than half of the runtime

    Evaluación de prestaciones mediante la aplicación HPL de clusters utilizando rCUDA

    Get PDF
    Treball de Fi de Màster en Sistemes Intel.ligents. Codi: SIU043. Curs 2013-2014A lo largo de este documento se describe el proyecto realizado en la asignatura SIU043-Trabajo Fin de Máster. Este trabajo se ha llevado a cabo en el grupo de investigación High Performance Computing and Architectures del Departamento de Ingeniería y Ciencia de los Computadores de la Universitat Jaume I bajo la supervisión de Rafael Mayo Gual. El proyecto se ha centrado en la evaluación del rendimiento mediante el uso de la aplicación Linpack Benchmark del software rCUDA. Este software permite la ejecución de una aplicación CUDA en un nodo que no disponga de ninguna GPU instalada, utilizando mediante la red de interconexión una GPU instalada en otro nodo como si fuera local. El objetivo de este trabajo es dotar a rCUDA de la funcionalidad necesaria para poder ejecutar este test y posteriormente analizar las prestaciones obtenidas. Estas prestaciones deben de ser comparadas con la ejecución de este mismo test sobre un nodo utilizando CUDA
    corecore