636 research outputs found

    Hybrid message-passing and shared-memory programming in a molecular dynamics application on multicore clusters

    Get PDF
    Hybrid programming, whereby shared-memory and message-passing programming techniques are combined within a single parallel application, has often been discussed as a method for increasing code performance on clusters of symmetric multiprocessors (SMPs). This paper examines whether the hybrid model brings any performance benefits for clusters based on multicore processors. A molecular dynamics application has been parallelized using both MPI and hybrid MPI/OpenMP programming models. The performance of this application has been examined on two high-end multicore clusters using both Infiniband and Gigabit Ethernet interconnects. The hybrid model has been found to perform well on the higher-latency Gigabit Ethernet connection, but offers no performance benefit on low-latency Infiniband interconnects. The changes in performance are attributed to the differing communication profiles of the hybrid and MPI codes

    Enabling and scaling biomolecular simulations of 100 million atoms on petascale machines with a multicore-optimized message-driven runtime

    Full text link
    A 100-million-atom biomolecular simulation with NAMD is one of the three benchmarks for the NSF-funded sustainable petascale machine. Simulating this large molecular system on a petascale machine presents great challenges, including handling I/O, large memory footprint and getting good strong-scaling results. In this paper, we present parallel I/O techniques to enable the simula-tion. A new SMP model is designed to efficiently utilize ubiquitous wide multicore clusters by extending the CHARM++ asynchronous message-driven runtime. We exploit node-aware techniques to op-timize both the application and the underlying SMP runtime. Hi-erarchical load balancing is further exploited to scale NAMD to the full Jaguar PF Cray XT5 (224,076 cores) at Oak Ridge Na-tional Laboratory, both with and without PME full electrostatics, achieving 93 % parallel efficiency (vs 6720 cores) at 9 ms per step for a simple cutoff calculation. Excellent scaling is also obtained on 65,536 cores of the Intrepid Blue Gene/P at Argonne National Laboratory. 1

    Improving the scalability of parallel N-body applications with an event driven constraint based execution model

    Full text link
    The scalability and efficiency of graph applications are significantly constrained by conventional systems and their supporting programming models. Technology trends like multicore, manycore, and heterogeneous system architectures are introducing further challenges and possibilities for emerging application domains such as graph applications. This paper explores the space of effective parallel execution of ephemeral graphs that are dynamically generated using the Barnes-Hut algorithm to exemplify dynamic workloads. The workloads are expressed using the semantics of an Exascale computing execution model called ParalleX. For comparison, results using conventional execution model semantics are also presented. We find improved load balancing during runtime and automatic parallelism discovery improving efficiency using the advanced semantics for Exascale computing.Comment: 11 figure

    A review of High Performance Computing foundations for scientists

    Full text link
    The increase of existing computational capabilities has made simulation emerge as a third discipline of Science, lying midway between experimental and purely theoretical branches [1, 2]. Simulation enables the evaluation of quantities which otherwise would not be accessible, helps to improve experiments and provides new insights on systems which are analysed [3-6]. Knowing the fundamentals of computation can be very useful for scientists, for it can help them to improve the performance of their theoretical models and simulations. This review includes some technical essentials that can be useful to this end, and it is devised as a complement for researchers whose education is focused on scientific issues and not on technological respects. In this document we attempt to discuss the fundamentals of High Performance Computing (HPC) [7] in a way which is easy to understand without much previous background. We sketch the way standard computers and supercomputers work, as well as discuss distributed computing and discuss essential aspects to take into account when running scientific calculations in computers.Comment: 33 page

    Parallel Discrete Event Simulation with Erlang

    Full text link
    Discrete Event Simulation (DES) is a widely used technique in which the state of the simulator is updated by events happening at discrete points in time (hence the name). DES is used to model and analyze many kinds of systems, including computer architectures, communication networks, street traffic, and others. Parallel and Distributed Simulation (PADS) aims at improving the efficiency of DES by partitioning the simulation model across multiple processing elements, in order to enabling larger and/or more detailed studies to be carried out. The interest on PADS is increasing since the widespread availability of multicore processors and affordable high performance computing clusters. However, designing parallel simulation models requires considerable expertise, the result being that PADS techniques are not as widespread as they could be. In this paper we describe ErlangTW, a parallel simulation middleware based on the Time Warp synchronization protocol. ErlangTW is entirely written in Erlang, a concurrent, functional programming language specifically targeted at building distributed systems. We argue that writing parallel simulation models in Erlang is considerably easier than using conventional programming languages. Moreover, ErlangTW allows simulation models to be executed either on single-core, multicore and distributed computing architectures. We describe the design and prototype implementation of ErlangTW, and report some preliminary performance results on multicore and distributed architectures using the well known PHOLD benchmark.Comment: Proceedings of ACM SIGPLAN Workshop on Functional High-Performance Computing (FHPC 2012) in conjunction with ICFP 2012. ISBN: 978-1-4503-1577-

    Analysis of Various Decentralized Load Balancing Techniques with Node Duplication

    Get PDF
    Experience in parallel computing is an increasingly necessary skill for today’s upcoming computer scientists as processors are hitting a serial execution performance barrier and turning to parallel execution for continued gains. The uniprocessor system has now reached its maximum speed limit and, there is very less scope to improve the speed of such type of system. To solve this problem multiprocessor system is used, which have more than one processor. Multiprocessor system improves the speed of the system but it again faces some problems like data dependency, control dependency, resource dependency and improper load balancing. So this paper presents a detailed analysis of various decentralized load balancing techniques with node duplication to reduce the proper execution time

    Parallel feature selection for distributed-memory clusters

    Get PDF
    Versión final aceptada de: https://doi.org/10.1016/j.ins.2019.01.050This manuscript version is made available under the CC-BY-NC-ND 4.0 license https://creativecommons.org/ licenses/by-nc-nd/4.0/. This version of the article: González-Domínguez, J. et al. (2019) ‘Parallel feature selection for distributed-memory clusters’, has been accepted for publication in Information Sciences, 496, pp. 399–409. The Version of Record is available online at: https://doi.org/10.1016/j.ins.2019.01.050[Abstract]: Feature selection is nowadays an extremely important data mining stage in the field of machine learning due to the appearance of problems of high dimensionality. In the literature there are numerous feature selection methods, mRMR (minimum-Redundancy-Maximum-Relevance) being one of the most widely used. However, although it achieves good results in selecting relevant features, it is impractical for datasets with thousands of features. A possible solution to this limitation is the use of the fast-mRMR method, a greedy optimization of the mRMR algorithm that improves both scalability and efficiency. In this work we present fast-mRMR-MPI, a novel hybrid parallel implementation that uses MPI and OpenMP to accelerate feature selection on distributed-memory clusters. Our performance evaluation on two different systems using five representative input datasets shows that fast-mRMR-MPI is significantly faster than fast-mRMR while providing the same results. As an example, our tool needs less than one minute to select 200 features of a dataset with more than four million features and 16,000 samples on a cluster with 32 nodes (768 cores in total), while the sequential fast-mRMR required more than eight hours. Moreover, fast-mRMR-MPI distributes data so that it is able to exploit the memory available on different nodes of a cluster and then complete analyses that fail on a single node due to memory constraints. Our tool is publicly available at https://github.com/borjaf696/Fast-mRMR.This research has been partially funded by projects TIN2016-75845-P and TIN-2015-65069-C2-1-R of the Ministry of Economy, Industry and Competitiveness of Spain, as well as by Xunta de Galicia projects ED431D R2016/045 and GRC2014/035, all of them partially funded by FEDER funds of the European Union. We gratefully thank CESGA for providing access to the Finis Terrae II supercomputer.Xunta de Galicia; ED431D R2016/045Xunta de Galicia; GRC2014/03

    Performance engineering of hybrid message passing + shared memory programming on multi-core clusters

    Get PDF
    The hybrid message passing + shared memory programming model combines two parallel programming styles within the same application in an effort to improve the performance and efficiency of parallel codes on modern multi-core clusters. This thesis presents a performance study of this model as it applies to two Molecular Dynamics (MD) applications. Both a large scale production MD code and a smaller scale example MD code have been adapted from existing message passing versions by adding shared memory parallelism to create hybrid message passing + shared memory applications. The performance of these hybrid applications has been investigated on different multi-core clusters and compared with the original pure message passing codes. This performance analysis reveals that the hybrid message passing + shared memory model provides performance improvements under some conditions, while the pure message passing model provides better performance in others. Typically, when running on small numbers of cores the pure message passing model provides better performance than the hybrid message passing + shared memory model, as hybrid performance suffers due to increased overheads from the use of shared memory constructs. However, when running on large numbers of cores the hybrid model performs better as these shared memory overheads are minimised while the pure message passing code suffers from increased communication overhead. These results depend on the interconnect used. Hybrid message passing + shared memory molecular dynamics codes are shown to exhibit different communication profiles from their pure message passing versions and this is revealed to be a large factor in the performance difference between pure message passing and hybrid message passing + shared memory codes. An extension of this result shows that the choice of interconnection fabric used in a multi-core cluster has a large impact on the performance difference between the pure message passing and the hybrid code. The factors affecting the performance of the applications have been analytically examined in an effort to describe, generalise and predict the performance of both the pure message passing and hybrid message passing + shared memory codes
    corecore