2,259 research outputs found

    Overview of Swallow --- A Scalable 480-core System for Investigating the Performance and Energy Efficiency of Many-core Applications and Operating Systems

    Full text link
    We present Swallow, a scalable many-core architecture, with a current configuration of 480 x 32-bit processors. Swallow is an open-source architecture, designed from the ground up to deliver scalable increases in usable computational power to allow experimentation with many-core applications and the operating systems that support them. Scalability is enabled by the creation of a tile-able system with a low-latency interconnect, featuring an attractive communication-to-computation ratio and the use of a distributed memory configuration. We analyse the energy and computational and communication performances of Swallow. The system provides 240GIPS with each core consuming 71--193mW, dependent on workload. Power consumption per instruction is lower than almost all systems of comparable scale. We also show how the use of a distributed operating system (nOS) allows the easy creation of scalable software to exploit Swallow's potential. Finally, we show two use case studies: modelling neurons and the overlay of shared memory on a distributed memory system.Comment: An open source release of the Swallow system design and code will follow and references to these will be added at a later dat

    Assessing the Performance of MPI Applications Through Time-Independent Trace Replay

    Get PDF
    International audienceSimulation is a popular approach to obtain objective performance indicators platforms that are not at one's disposal. It may help the dimensioning of compute clusters in large computing centers. In this work we present a framework for the off-line simulation of MPI applications. Its main originality with regard to the literature is to rely on time-independent execution traces. This allows us to completely decouple the acquisition process from the actual replay of the traces in a simulation context. Then we are able to acquire traces for large application instances without being limited to an execution on a single compute cluster. Finally our framework is built on top of a scalable, fast, and validated simulation kernel. In this paper, we present the used time-independent trace format, investigate several acquisition strategies, detail the developed trace replay tool, and assess the quality of our simulation framework in terms of accuracy, acquisition time, simulation time, and trace size.La simulation est une approche trĂšs populaire pour obtenir des indicateurs de performances objectifs sur des plates-formes qui ne sont pas disponibles. Cela peut permettre le dimensionnement de grappes de calculs au sein de grands centres de calcul. Dans cet article nous prĂ©sentons un outil de simulation post-mortem d'applications MPI. Sa principale originalitĂ© au regard de la littĂ©rature est d'utiliser des traces d'exĂ©cution indĂ©pendantes du temps. Cela permet de dĂ©coupler intĂ©gralement le processus d'acquisition des traces de celui de rejeu dans un contexte de simulation. Il est ainsi possible d'obtenir des traces pour de grandes instances de problĂšmes sans ĂȘtre limitĂ© Ă  des exĂ©cutions au sein d'une unique grappe. Enfin notre outil est dĂ©veloppĂ© au dessus d'un noyau de simulation scalable, rapide et validĂ©. Cet article prĂ©sente le format de traces indĂ©pendantes du temps utilisĂ©, Ă©tudie plusieurs stratĂ©gies d'acquisition, dĂ©taille l'outil de rejeu que nous avons dĂ©velopĂ©, et evaluĂ© la qualitĂ© de nos simulations en termes de prĂ©cision, temps d'acuisition, temps de simulation et tailles de traces

    A new multi-particle collision algorithm for optimization in a high performance environment

    Full text link

    Parallel Java: A Unified API for Shared Memory and Cluster Parallel Programming in 100% Java

    Get PDF
    Parallel Java is a parallel programming API whose goals are (1) to support both shared memory (thread-based) parallel programming and cluster (message-based) parallel programming in a single unified API, allowing one to write parallel programs combining both paradigms; (2) to provide the same capabilities as OpenMP and MPI in an object oriented, 100% Java API; and (3) to be easily deployed and run in a heterogeneous computing environment of single-core CPUs, multi-core CPUs, and clusters thereof. This paper describes Parallel Java’s features and architecture; compares and contrasts Parallel Java to other Java based parallel middleware libraries; and reports performance measurements of Parallel Java programs

    Hybrid Satellite-Terrestrial Communication Networks for the Maritime Internet of Things: Key Technologies, Opportunities, and Challenges

    Get PDF
    With the rapid development of marine activities, there has been an increasing number of maritime mobile terminals, as well as a growing demand for high-speed and ultra-reliable maritime communications to keep them connected. Traditionally, the maritime Internet of Things (IoT) is enabled by maritime satellites. However, satellites are seriously restricted by their high latency and relatively low data rate. As an alternative, shore & island-based base stations (BSs) can be built to extend the coverage of terrestrial networks using fourth-generation (4G), fifth-generation (5G), and beyond 5G services. Unmanned aerial vehicles can also be exploited to serve as aerial maritime BSs. Despite of all these approaches, there are still open issues for an efficient maritime communication network (MCN). For example, due to the complicated electromagnetic propagation environment, the limited geometrically available BS sites, and rigorous service demands from mission-critical applications, conventional communication and networking theories and methods should be tailored for maritime scenarios. Towards this end, we provide a survey on the demand for maritime communications, the state-of-the-art MCNs, and key technologies for enhancing transmission efficiency, extending network coverage, and provisioning maritime-specific services. Future challenges in developing an environment-aware, service-driven, and integrated satellite-air-ground MCN to be smart enough to utilize external auxiliary information, e.g., sea state and atmosphere conditions, are also discussed

    Shape-based cost analysis of skeletal parallel programs

    Get PDF
    Institute for Computing Systems ArchitectureThis work presents an automatic cost-analysis system for an implicitly parallel skeletal programming language. Although deducing interesting dynamic characteristics of parallel programs (and in particular, run time) is well known to be an intractable problem in the general case, it can be alleviated by placing restrictions upon the programs which can be expressed. By combining two research threads, the “skeletal” and “shapely” paradigms which take this route, we produce a completely automated, computation and communication sensitive cost analysis system. This builds on earlier work in the area by quantifying communication as well as computation costs, with the former being derived for the Bulk Synchronous Parallel (BSP) model. We present details of our shapely skeletal language and its BSP implementation strategy together with an account of the analysis mechanism by which program behaviour information (such as shape and cost) is statically deduced. This information can be used at compile-time to optimise a BSP implementation and to analyse computation and communication costs. The analysis has been implemented in Haskell. We consider different algorithms expressed in our language for some example problems and illustrate each BSP implementation, contrasting the analysis of their efficiency by traditional, intuitive methods with that achieved by our cost calculator. The accuracy of cost predictions by our cost calculator against the run time of real parallel programs is tested experimentally. Previous shape-based cost analysis required all elements of a vector (our nestable bulk data structure) to have the same shape. We partially relax this strict requirement on data structure regularity by introducing new shape expressions in our analysis framework. We demonstrate that this allows us to achieve the first automated analysis of a complete derivation, the well known maximum segment sum algorithm of Skillicorn and Cai

    Paradigms for Structure in an Amorphous Computer

    Get PDF
    Recent developments in microfabrication and nanotechnology will enable the inexpensive manufacturing of massive numbers of tiny computing elements with sensors and actuators. New programming paradigms are required for obtaining organized and coherent behavior from the cooperation of large numbers of unreliable processing elements that are interconnected in unknown, irregular, and possibly time-varying ways. Amorphous computing is the study of developing and programming such ultrascale computing environments. This paper presents an approach to programming an amorphous computer by spontaneously organizing an unstructured collection of processing elements into cooperative groups and hierarchies. This paper introduces a structure called an AC Hierarchy, which logically organizes processors into groups at different levels of granularity. The AC hierarchy simplifies programming of an amorphous computer through new language abstractions, facilitates the design of efficient and robust algorithms, and simplifies the analysis of their performance. Several example applications are presented that greatly benefit from the AC hierarchy. This paper introduces three algorithms for constructing multiple levels of the hierarchy from an unstructured collection of processors

    Broadcasting in grid graphs

    Get PDF
    This work consists of two separate parts. The first part deals with the problem of multiple message broadcasting, and the topic of the second part is line broadcasting. Broadcasting is a process in which one vertex in a graph knows one or more messages. The goal is to inform all remaining vertices as fast as possible. In this work we consider a special kind of graphs, grids.;In 1980 A. M. Farley showed that the minimum time required to broadcast a set of M messages in any connected graph with diameter d is d + 2(M - 1). This work presents an approach to broadcasting multiple messages from the corner vertex of a 2-dimensional grid. This approach gives us a broadcasting scheme that differs only by 2 (and in the case of an even x even grid by only 1) from the above lower bound.;Line broadcasting describes a different variant of the broadcasting process. A. M. Farley showed that line broadcasting can always be completed in [log n] time units in any connected graph on n vertices. He defined three different cost measures for line broadcasting. This work presents strategies for minimizing those costs for various grid sizes
    • 

    corecore