2,259 research outputs found
Overview of Swallow --- A Scalable 480-core System for Investigating the Performance and Energy Efficiency of Many-core Applications and Operating Systems
We present Swallow, a scalable many-core architecture, with a current
configuration of 480 x 32-bit processors.
Swallow is an open-source architecture, designed from the ground up to
deliver scalable increases in usable computational power to allow
experimentation with many-core applications and the operating systems that
support them.
Scalability is enabled by the creation of a tile-able system with a
low-latency interconnect, featuring an attractive communication-to-computation
ratio and the use of a distributed memory configuration.
We analyse the energy and computational and communication performances of
Swallow. The system provides 240GIPS with each core consuming 71--193mW,
dependent on workload. Power consumption per instruction is lower than almost
all systems of comparable scale.
We also show how the use of a distributed operating system (nOS) allows the
easy creation of scalable software to exploit Swallow's potential. Finally, we
show two use case studies: modelling neurons and the overlay of shared memory
on a distributed memory system.Comment: An open source release of the Swallow system design and code will
follow and references to these will be added at a later dat
Assessing the Performance of MPI Applications Through Time-Independent Trace Replay
International audienceSimulation is a popular approach to obtain objective performance indicators platforms that are not at one's disposal. It may help the dimensioning of compute clusters in large computing centers. In this work we present a framework for the off-line simulation of MPI applications. Its main originality with regard to the literature is to rely on time-independent execution traces. This allows us to completely decouple the acquisition process from the actual replay of the traces in a simulation context. Then we are able to acquire traces for large application instances without being limited to an execution on a single compute cluster. Finally our framework is built on top of a scalable, fast, and validated simulation kernel. In this paper, we present the used time-independent trace format, investigate several acquisition strategies, detail the developed trace replay tool, and assess the quality of our simulation framework in terms of accuracy, acquisition time, simulation time, and trace size.La simulation est une approche trĂšs populaire pour obtenir des indicateurs de performances objectifs sur des plates-formes qui ne sont pas disponibles. Cela peut permettre le dimensionnement de grappes de calculs au sein de grands centres de calcul. Dans cet article nous prĂ©sentons un outil de simulation post-mortem d'applications MPI. Sa principale originalitĂ© au regard de la littĂ©rature est d'utiliser des traces d'exĂ©cution indĂ©pendantes du temps. Cela permet de dĂ©coupler intĂ©gralement le processus d'acquisition des traces de celui de rejeu dans un contexte de simulation. Il est ainsi possible d'obtenir des traces pour de grandes instances de problĂšmes sans ĂȘtre limitĂ© Ă des exĂ©cutions au sein d'une unique grappe. Enfin notre outil est dĂ©veloppĂ© au dessus d'un noyau de simulation scalable, rapide et validĂ©. Cet article prĂ©sente le format de traces indĂ©pendantes du temps utilisĂ©, Ă©tudie plusieurs stratĂ©gies d'acquisition, dĂ©taille l'outil de rejeu que nous avons dĂ©velopĂ©, et evaluĂ© la qualitĂ© de nos simulations en termes de prĂ©cision, temps d'acuisition, temps de simulation et tailles de traces
Parallel Java: A Unified API for Shared Memory and Cluster Parallel Programming in 100% Java
Parallel Java is a parallel programming API whose goals are (1) to support both shared memory (thread-based) parallel programming and cluster (message-based) parallel programming in a single unified API, allowing one to write parallel programs combining both paradigms; (2) to provide the same capabilities as OpenMP and MPI in an object oriented, 100% Java API; and (3) to be easily deployed and run in a heterogeneous computing environment of single-core CPUs, multi-core CPUs, and clusters thereof. This paper describes Parallel Javaâs features and architecture; compares and contrasts Parallel Java to other Java based parallel middleware libraries; and reports performance measurements of Parallel Java programs
Hybrid Satellite-Terrestrial Communication Networks for the Maritime Internet of Things: Key Technologies, Opportunities, and Challenges
With the rapid development of marine activities, there has been an increasing
number of maritime mobile terminals, as well as a growing demand for high-speed
and ultra-reliable maritime communications to keep them connected.
Traditionally, the maritime Internet of Things (IoT) is enabled by maritime
satellites. However, satellites are seriously restricted by their high latency
and relatively low data rate. As an alternative, shore & island-based base
stations (BSs) can be built to extend the coverage of terrestrial networks
using fourth-generation (4G), fifth-generation (5G), and beyond 5G services.
Unmanned aerial vehicles can also be exploited to serve as aerial maritime BSs.
Despite of all these approaches, there are still open issues for an efficient
maritime communication network (MCN). For example, due to the complicated
electromagnetic propagation environment, the limited geometrically available BS
sites, and rigorous service demands from mission-critical applications,
conventional communication and networking theories and methods should be
tailored for maritime scenarios. Towards this end, we provide a survey on the
demand for maritime communications, the state-of-the-art MCNs, and key
technologies for enhancing transmission efficiency, extending network coverage,
and provisioning maritime-specific services. Future challenges in developing an
environment-aware, service-driven, and integrated satellite-air-ground MCN to
be smart enough to utilize external auxiliary information, e.g., sea state and
atmosphere conditions, are also discussed
Shape-based cost analysis of skeletal parallel programs
Institute for Computing Systems ArchitectureThis work presents an automatic cost-analysis system for an implicitly parallel skeletal
programming language.
Although deducing interesting dynamic characteristics of parallel programs (and in
particular, run time) is well known to be an intractable problem in the general case, it
can be alleviated by placing restrictions upon the programs which can be expressed.
By combining two research threads, the âskeletalâ and âshapelyâ paradigms which
take this route, we produce a completely automated, computation and communication
sensitive cost analysis system. This builds on earlier work in the area by quantifying
communication as well as computation costs, with the former being derived for the
Bulk Synchronous Parallel (BSP) model.
We present details of our shapely skeletal language and its BSP implementation strategy
together with an account of the analysis mechanism by which program behaviour
information (such as shape and cost) is statically deduced. This information can be
used at compile-time to optimise a BSP implementation and to analyse computation
and communication costs. The analysis has been implemented in Haskell. We consider
different algorithms expressed in our language for some example problems and
illustrate each BSP implementation, contrasting the analysis of their efficiency by traditional,
intuitive methods with that achieved by our cost calculator. The accuracy of
cost predictions by our cost calculator against the run time of real parallel programs is
tested experimentally.
Previous shape-based cost analysis required all elements of a vector (our nestable bulk
data structure) to have the same shape. We partially relax this strict requirement on data
structure regularity by introducing new shape expressions in our analysis framework.
We demonstrate that this allows us to achieve the first automated analysis of a complete
derivation, the well known maximum segment sum algorithm of Skillicorn and Cai
Paradigms for Structure in an Amorphous Computer
Recent developments in microfabrication and nanotechnology will enable the inexpensive manufacturing of massive numbers of tiny computing elements with sensors and actuators. New programming paradigms are required for obtaining organized and coherent behavior from the cooperation of large numbers of unreliable processing elements that are interconnected in unknown, irregular, and possibly time-varying ways. Amorphous computing is the study of developing and programming such ultrascale computing environments. This paper presents an approach to programming an amorphous computer by spontaneously organizing an unstructured collection of processing elements into cooperative groups and hierarchies. This paper introduces a structure called an AC Hierarchy, which logically organizes processors into groups at different levels of granularity. The AC hierarchy simplifies programming of an amorphous computer through new language abstractions, facilitates the design of efficient and robust algorithms, and simplifies the analysis of their performance. Several example applications are presented that greatly benefit from the AC hierarchy. This paper introduces three algorithms for constructing multiple levels of the hierarchy from an unstructured collection of processors
Broadcasting in grid graphs
This work consists of two separate parts. The first part deals with the problem of multiple message broadcasting, and the topic of the second part is line broadcasting. Broadcasting is a process in which one vertex in a graph knows one or more messages. The goal is to inform all remaining vertices as fast as possible. In this work we consider a special kind of graphs, grids.;In 1980 A. M. Farley showed that the minimum time required to broadcast a set of M messages in any connected graph with diameter d is d + 2(M - 1). This work presents an approach to broadcasting multiple messages from the corner vertex of a 2-dimensional grid. This approach gives us a broadcasting scheme that differs only by 2 (and in the case of an even x even grid by only 1) from the above lower bound.;Line broadcasting describes a different variant of the broadcasting process. A. M. Farley showed that line broadcasting can always be completed in [log n] time units in any connected graph on n vertices. He defined three different cost measures for line broadcasting. This work presents strategies for minimizing those costs for various grid sizes
- âŠ