367 research outputs found

    InterCell: a Software Suite for Rapid Prototyping and Parallel Execution of Fine Grained Applications

    Get PDF
    International audienceInterCell is an open and operational software suite for implementation, code generation and interactive simulation of fine grained parallel computational models. This article describes the software architecture, some use cases from physics and cortical networks as well as first performance measurements.InterCell est une suite logicielle libre et opérationnelle permettant l'implantation, la génération de code et la simulation interactive de modèles de calculs parallèles à grain fin. Cet article décrit son architecture logicielle, ainsi que des cas d'utilisations issus de la physique et du connexionnisme, et présente finalement les premières performances mesurées

    Preliminary Experiments with XKaapi on Intel Xeon Phi Coprocessor

    Get PDF
    International audienceThis paper presents preliminary performance comparisons of parallel applications developed natively for the Intel Xeon Phi accelerator using three different parallel programming environments and their associated runtime systems. We compare Intel OpenMP, Intel CilkPlus and XKaapi together on the same benchmark suite and we provide comparisons between an Intel Xeon Phi coprocessor and a Sandy Bridge Xeon-based machine. Our benchmark suite is composed of three computing kernels: a Fibonacci computation that allows to study the overhead and the scalability of the runtime system, a NQueens application generating irregular and dynamic tasks and a Cholesky factorization algorithm. We also compare the Cholesky factorization with the parallel algorithm provided by the Intel MKL library for Intel Xeon Phi. Performance evaluation shows our XKaapi data-flow parallel programming environment exposes the lowest overhead of all and is highly competitive with native OpenMP and CilkPlus environments on Xeon Phi. Moreover, the efficient handling of data-flow dependencies between tasks makes our XKaapi environment exhibit more parallelism for some applications such as the Cholesky factorization. In that case, we observe substantial gains with up to 180 hardware threads over the state of the art MKL, with a 47% performance increase for 60 hardware threads

    Dagstuhl News January - December 2007

    Get PDF
    "Dagstuhl News" is a publication edited especially for the members of the Foundation "Informatikzentrum Schloss Dagstuhl" to thank them for their support. The News give a summary of the scientific work being done in Dagstuhl. Each Dagstuhl Seminar is presented by a small abstract describing the contents and scientific highlights of the seminar as well as the perspectives or challenges of the research topic

    Applying big data paradigms to a large scale scientific workflow: lessons learned and future directions

    Get PDF
    The increasing amounts of data related to the execution of scientific workflows has raised awareness of their shift towards parallel data-intensive problems. In this paper, we deliver our experience combining the traditional high-performance computing and grid-based approaches with Big Data analytics paradigms, in the context of scientific ensemble workflows. Our goal was to assess and discuss the suitability of such data-oriented mechanisms for production-ready workflows, especially in terms of scalability. We focused on two key elements in the Big Data ecosystem: the data-centric programming model, and the underlying infrastructure that integrates storage and computation in each node. We experimented with a representative MPI-based iterative workflow from the hydrology domain, EnKF-HGS, which we re-implemented using the Spark data analysis framework. We conducted experiments on a local cluster, a private cloud running OpenNebula, and the Amazon Elastic Compute Cloud (AmazonEC2). The results we obtained were analysed to synthesize the lessons we learned from this experience, while discussing promising directions for further research.This work was supported by the Spanish Ministry of Economics and Competitiveness grant TIN-2013-41350-P, the IC1305 COST Action “Network for Sustainable Ultrascale Computing Platforms” (NESUS), and the FPU Training Program for Academic and Teaching Staff Grant FPU15/00422 by the Spanish Ministry of Education

    A Scalable Approach to Modeling on Accelerated Neuromorphic Hardware.

    Get PDF
    Neuromorphic systems open up opportunities to enlarge the explorative space for computational research. However, it is often challenging to unite efficiency and usability. This work presents the software aspects of this endeavor for the BrainScaleS-2 system, a hybrid accelerated neuromorphic hardware architecture based on physical modeling. We introduce key aspects of the BrainScaleS-2 Operating System: experiment workflow, API layering, software design, and platform operation. We present use cases to discuss and derive requirements for the software and showcase the implementation. The focus lies on novel system and software features such as multi-compartmental neurons, fast re-configuration for hardware-in-the-loop training, applications for the embedded processors, the non-spiking operation mode, interactive platform access, and sustainable hardware/software co-development. Finally, we discuss further developments in terms of hardware scale-up, system usability, and efficiency

    Extreme Acceleration of Graph Neural Network-based Prediction Models for Quantum Chemistry

    Full text link
    Molecular property calculations are the bedrock of chemical physics. High-fidelity \textit{ab initio} modeling techniques for computing the molecular properties can be prohibitively expensive, and motivate the development of machine-learning models that make the same predictions more efficiently. Training graph neural networks over large molecular databases introduces unique computational challenges such as the need to process millions of small graphs with variable size and support communication patterns that are distinct from learning over large graphs such as social networks. This paper demonstrates a novel hardware-software co-design approach to scale up the training of graph neural networks for molecular property prediction. We introduce an algorithm to coalesce the batches of molecular graphs into fixed size packs to eliminate redundant computation and memory associated with alternative padding techniques and improve throughput via minimizing communication. We demonstrate the effectiveness of our co-design approach by providing an implementation of a well-established molecular property prediction model on the Graphcore Intelligence Processing Units (IPU). We evaluate the training performance on multiple molecular graph databases with varying degrees of graph counts, sizes and sparsity. We demonstrate that such a co-design approach can reduce the training time of such molecular property prediction models from days to less than two hours, opening new possibilities for AI-driven scientific discovery

    Design and resource management of reconfigurable multiprocessors for data-parallel applications

    Get PDF
    FPGA (Field-Programmable Gate Array)-based custom reconfigurable computing machines have established themselves as low-cost and low-risk alternatives to ASIC (Application-Specific Integrated Circuit) implementations and general-purpose microprocessors in accelerating a wide range of computation-intensive applications. Most often they are Application Specific Programmable Circuiits (ASPCs), which are developer programmable instead of user programmable. The major disadvantages of ASPCs are minimal programmability, and significant time and energy overheads caused by required hardware reconfiguration when the problem size outnumbers the available reconfigurable resources; these problems are expected to become more serious with increases in the FPGA chip size. On the other hand, dominant high-performance computing systems, such as PC clusters and SMPs (Symmetric Multiprocessors), suffer from high communication latencies and/or scalability problems. This research introduces low-cost, user-programmable and reconfigurable MultiProcessor-on-a-Programmable-Chip (MPoPC) systems for high-performance, low-cost computing. It also proposes a relevant resource management framework that deals with performance, power consumption and energy issues. These semi-customized systems reduce significantly runtime device reconfiguration by employing userprogrammable processing elements that are reusable for different tasks in large, complex applications. For the sake of illustration, two different types of MPoPCs with hardware FPUs (floating-point units) are designed and implemented for credible performance evaluation and modeling: the coarse-grain MIMD (Multiple-Instruction, Multiple-Data) CG-MPoPC machine based on a processor IP (Intellectual Property) core and the mixed-mode (MIMD, SIMD or M-SIMD) variant-grain HERA (HEterogeneous Reconfigurable Architecture) machine. In addition to alleviating the above difficulties, MPoPCs can offer several performance and energy advantages to our data-parallel applications when compared to ASPCs; they are simpler and more scalable, and have less verification time and cost. Various common computation-intensive benchmark algorithms, such as matrix-matrix multiplication (MMM) and LU factorization, are studied and their parallel solutions are shown for the two MPoPCs. The performance is evaluated with large sparse real-world matrices primarily from power engineering. We expect even further performance gains on MPoPCs in the near future by employing ever improving FPGAs. The innovative nature of this work has the potential to guide research in this arising field of high-performance, low-cost reconfigurable computing. The largest advantage of reconfigurable logic lies in its large degree of hardware customization and reconfiguration which allows reusing the resources to match the computation and communication needs of applications. Therefore, a major effort in the presented design methodology for mixed-mode MPoPCs, like HERA, is devoted to effective resource management. A two-phase approach is applied. A mixed-mode weighted Task Flow Graph (w-TFG) is first constructed for any given application, where tasks are classified according to their most appropriate computing mode (e.g., SIMD or MIMD). At compile time, an architecture is customized and synthesized for the TFG using an Integer Linear Programming (ILP) formulation and a parameterized hardware component library. Various run-time scheduling schemes with different performanceenergy objectives are proposed. A system-level energy model for HERA, which is based on low-level implementation data and run-time statistics, is proposed to guide performance-energy trade-off decisions. A parallel power flow analysis technique based on Newton\u27s method is proposed and employed to verify the methodology

    Canary: Congestion-Aware In-Network Allreduce Using Dynamic Trees

    Full text link
    The allreduce operation is an essential building block for many distributed applications, ranging from the training of deep learning models to scientific computing. In an allreduce operation, data from multiple hosts is aggregated together and then broadcasted to each host participating in the operation. Allreduce performance can be improved by a factor of two by aggregating the data directly in the network. Switches aggregate data coming from multiple ports before forwarding the partially aggregated result to the next hop. In all existing solutions, each switch needs to know the ports from which it will receive the data to aggregate. However, this forces packets to traverse a predefined set of switches, making these solutions prone to congestion. For this reason, we design Canary, the first congestion-aware in-network allreduce algorithm. Canary uses load balancing algorithms to forward packets on the least congested paths. Because switches do not know from which ports they will receive the data to aggregate, they use timeouts to aggregate the data in a best-effort way. We develop a P4 Canary prototype and evaluate it on a Tofino switch. We then validate Canary through simulations on large networks, showing performance improvements up to 40% compared to the state-of-the-art

    A Data-parallel Approach for Efficient Resource Utilization in Distributed Serverless Deep Learning

    Get PDF
    Serverless computing is an integral part of the recent success of cloud computing, offering cost and performance efficiency for small and large scale distributed systems. Owing to the increasing interest of developers in integrating distributed computing techniques into deep learning frameworks for better performance, serverless infrastructures have been the choice of many to host their applications. However, this computing architecture bears resource limitations which challenge the successful completion of many deep learning jobs. In our research, an approach is presented to address timeout and memory resource limitations which are two key issues to deep learning on serverless infrastructures. Focusing on Apache OpenWhisk as severless platform, and TensorFlow as deep learning framework, our solution follows an in-depth assessment of the former and failed attempts at tackling resource constraints through system-level modifications. The proposed approach employs data parallelism and ensures the concurrent execution of separate cloud functions. A weighted averaging of intermediate models is afterwards applied to build an ensemble model ready for evaluation. Through a fine-grained system design, our solution executed and completed deep learning workflows on OpenWhisk with a 0% failure rate. Moreover, the comparison with a traditional deployment on OpenWhisk shows that our approach uses 45% less memory and reduces the execution time by 58%

    Adaptive heterogeneous parallelism for semi-empirical lattice dynamics in computational materials science.

    Get PDF
    With the variability in performance of the multitude of parallel environments available today, the conceptual overhead created by the need to anticipate runtime information to make design-time decisions has become overwhelming. Performance-critical applications and libraries carry implicit assumptions based on incidental metrics that are not portable to emerging computational platforms or even alternative contemporary architectures. Furthermore, the significance of runtime concerns such as makespan, energy efficiency and fault tolerance depends on the situational context. This thesis presents a case study in the application of both Mattsons prescriptive pattern-oriented approach and the more principled structured parallelism formalism to the computational simulation of inelastic neutron scattering spectra on hybrid CPU/GPU platforms. The original ad hoc implementation as well as new patternbased and structured implementations are evaluated for relative performance and scalability. Two new structural abstractions are introduced to facilitate adaptation by lazy optimisation and runtime feedback. A deferred-choice abstraction represents a unified space of alternative structural program variants, allowing static adaptation through model-specific exhaustive calibration with regards to the extrafunctional concerns of runtime, average instantaneous power and total energy usage. Instrumented queues serve as mechanism for structural composition and provide a representation of extrafunctional state that allows realisation of a market-based decentralised coordination heuristic for competitive resource allocation and the Lyapunov drift algorithm for cooperative scheduling
    • …
    corecore