2,433 research outputs found

    On Designing Multicore-aware Simulators for Biological Systems

    Full text link
    The stochastic simulation of biological systems is an increasingly popular technique in bioinformatics. It often is an enlightening technique, which may however result in being computational expensive. We discuss the main opportunities to speed it up on multi-core platforms, which pose new challenges for parallelisation techniques. These opportunities are developed in two general families of solutions involving both the single simulation and a bulk of independent simulations (either replicas of derived from parameter sweep). Proposed solutions are tested on the parallelisation of the CWC simulator (Calculus of Wrapped Compartments) that is carried out according to proposed solutions by way of the FastFlow programming framework making possible fast development and efficient execution on multi-cores.Comment: 19 pages + cover pag

    Toward optimised skeletons for heterogeneous parallel architecture with performance cost model

    Get PDF
    High performance architectures are increasingly heterogeneous with shared and distributed memory components, and accelerators like GPUs. Programming such architectures is complicated and performance portability is a major issue as the architectures evolve. This thesis explores the potential for algorithmic skeletons integrating a dynamically parametrised static cost model, to deliver portable performance for mostly regular data parallel programs on heterogeneous archi- tectures. The rst contribution of this thesis is to address the challenges of program- ming heterogeneous architectures by providing two skeleton-based programming libraries: i.e. HWSkel for heterogeneous multicore clusters and GPU-HWSkel that enables GPUs to be exploited as general purpose multi-processor devices. Both libraries provide heterogeneous data parallel algorithmic skeletons including hMap, hMapAll, hReduce, hMapReduce, and hMapReduceAll. The second contribution is the development of cost models for workload dis- tribution. First, we construct an architectural cost model (CM1) to optimise overall processing time for HWSkel heterogeneous skeletons on a heterogeneous system composed of networks of arbitrary numbers of nodes, each with an ar- bitrary number of cores sharing arbitrary amounts of memory. The cost model characterises the components of the architecture by the number of cores, clock speed, and crucially the size of the L2 cache. Second, we extend the HWSkel cost model (CM1) to account for GPU performance. The extended cost model (CM2) is used in the GPU-HWSkel library to automatically nd a good distribution for both a single heterogeneous multicore/GPU node, and clusters of heteroge- neous multicore/GPU nodes. Experiments are carried out on three heterogeneous multicore clusters, four heterogeneous multicore/GPU clusters, and three single heterogeneous multicore/GPU nodes. The results of experimental evaluations for four data parallel benchmarks, i.e. sumEuler, Image matching, Fibonacci, and Matrix Multiplication, show that our combined heterogeneous skeletons and cost models can make good use of resources in heterogeneous systems. Moreover using cores together with a GPU in the same host can deliver good performance either on a single node or on multiple node architectures

    Towards a high performance cellular automata programming skeleton

    Get PDF
    Cellular automata provide an abstract model of parallel computation that can be effectively used for modeling and simulation of complex phenomena and systems. In this paper, we start from a skeleton designed to facilitate faster D-dimensional cellular automata application development. The key for the use of the skeleton is to achieve an efficient implementation, irrespective of the application specific details. In the parallel implementation on a cluster was important to consider issues such as task and data decomposition. With multicore clusters, new problems have emerged. The increasing numbers of cores per node, caches and shared memory inside the nodes, has led to the formation of a new hierarchy of access to processors. In this paper, we described some optimizations to restructuring the prototype code and exposing an abstracted view of the multicore cluster to the high performance CA application developer. The implementation of lattice division functions establishes a partnership relation among parallel processes. We propose that this relation can efficiently map on the multicore cluster communicational topology. We introduce a new mapping strategy that can obtain benefit in the performance by adapting its communication pattern to the hardware affinities among processes allocated in different cores. We apply our approach to a two-dimensional application achieving sensible execution time reduction.Presentado en el X Workshop Procesamiento Distribuido y Paralelo (WPDP)Red de Universidades con Carreras en Informática (RedUNCI

    Towards a high performance cellular automata programming skeleton

    Get PDF
    Cellular automata provide an abstract model of parallel computation that can be effectively used for modeling and simulation of complex phenomena and systems. In this paper, we start from a skeleton designed to facilitate faster D-dimensional cellular automata application development. The key for the use of the skeleton is to achieve an efficient implementation, irrespective of the application specific details. In the parallel implementation on a cluster was important to consider issues such as task and data decomposition. With multicore clusters, new problems have emerged. The increasing numbers of cores per node, caches and shared memory inside the nodes, has led to the formation of a new hierarchy of access to processors. In this paper, we described some optimizations to restructuring the prototype code and exposing an abstracted view of the multicore cluster to the high performance CA application developer. The implementation of lattice division functions establishes a partnership relation among parallel processes. We propose that this relation can efficiently map on the multicore cluster communicational topology. We introduce a new mapping strategy that can obtain benefit in the performance by adapting its communication pattern to the hardware affinities among processes allocated in different cores. We apply our approach to a two-dimensional application achieving sensible execution time reduction.Presentado en el X Workshop Procesamiento Distribuido y Paralelo (WPDP)Red de Universidades con Carreras en Informática (RedUNCI

    JaSkel: a java skeleton-based framework for structured cluster and grid computing

    Get PDF
    This paper presents JaSkel, a skeleton-based framework to develop parallel and grid applications. The framework provides a set of Java abstract classes as a skeleton catalogue, which implements recurring parallel interaction paradigms. This approach aims to improve code efficiency and portability. It also helps to structure scalable applications through the refinement and composition of skeletons. Evaluation results show that using the provided skeletons do contribute to improve both application development time and execution performanceFundação para a Ciência e a Tecnologia (FCT) - PPC-VM Project(POSI/CHS/47158/2002); Project SeARCH (contract REEQ/443/2001)

    Contract-Based General-Purpose GPU Programming

    Get PDF
    Using GPUs as general-purpose processors has revolutionized parallel computing by offering, for a large and growing set of algorithms, massive data-parallelization on desktop machines. An obstacle to widespread adoption, however, is the difficulty of programming them and the low-level control of the hardware required to achieve good performance. This paper suggests a programming library, SafeGPU, that aims at striking a balance between programmer productivity and performance, by making GPU data-parallel operations accessible from within a classical object-oriented programming language. The solution is integrated with the design-by-contract approach, which increases confidence in functional program correctness by embedding executable program specifications into the program text. We show that our library leads to modular and maintainable code that is accessible to GPGPU non-experts, while providing performance that is comparable with hand-written CUDA code. Furthermore, runtime contract checking turns out to be feasible, as the contracts can be executed on the GPU

    Parallel Programming with Global Asynchronous Memory: Models, C++ APIs and Implementations

    Get PDF
    In the realm of High Performance Computing (HPC), message passing has been the programming paradigm of choice for over twenty years. The durable MPI (Message Passing Interface) standard, with send/receive communication, broadcast, gather/scatter, and reduction collectives is still used to construct parallel programs where each communication is orchestrated by the developer-based precise knowledge of data distribution and overheads; collective communications simplify the orchestration but might induce excessive synchronization. Early attempts to bring shared-memory programming model—with its programming advantages—to distributed computing, referred as the Distributed Shared Memory (DSM) model, faded away; one of the main issue was to combine performance and programmability with the memory consistency model. The recently proposed Partitioned Global Address Space (PGAS) model is a modern revamp of DSM that exposes data placement to enable optimizations based on locality, but it still addresses (simple) data- parallelism only and it relies on expensive sharing protocols. We advocate an alternative programming model for distributed computing based on a Global Asynchronous Memory (GAM), aiming to avoid coherency and consistency problems rather than solving them. We materialize GAM by designing and implementing a distributed smart pointers library, inspired by C++ smart pointers. In this model, public and pri- vate pointers (resembling C++ shared and unique pointers, respectively) are moved around instead of messages (i.e., data), thus alleviating the user from the burden of minimizing transfers. On top of smart pointers, we propose a high-level C++ template library for writing applications in terms of dataflow-like networks, namely GAM nets, consisting of stateful processors exchanging pointers in fully asynchronous fashion. We demonstrate the validity of the proposed approach, from the expressiveness perspective, by showing how GAM nets can be exploited to implement both standalone applications and higher-level parallel program- ming models, such as data and task parallelism. As for the performance perspective, preliminary experiments show both close-to-ideal scalability and negligible overhead with respect to state-of-the-art benchmark implementations. For instance, the GAM implementation of a high-quality video restoration filter sustains a 100 fps throughput over 70%-noisy high-quality video streams on a 4-node cluster of Graphics Processing Units (GPUs), with minimal programming effort

    Parallel Performance of MPI Sorting Algorithms on Dual-Core Processor Windows-Based Systems

    Full text link
    Message Passing Interface (MPI) is widely used to implement parallel programs. Although Windowsbased architectures provide the facilities of parallel execution and multi-threading, little attention has been focused on using MPI on these platforms. In this paper we use the dual core Window-based platform to study the effect of parallel processes number and also the number of cores on the performance of three MPI parallel implementations for some sorting algorithms
    • …
    corecore