5,233 research outputs found

    Pipelining the Fast Multipole Method over a Runtime System

    Get PDF
    Fast Multipole Methods (FMM) are a fundamental operation for the simulation of many physical problems. The high performance design of such methods usually requires to carefully tune the algorithm for both the targeted physics and the hardware. In this paper, we propose a new approach that achieves high performance across architectures. Our method consists of expressing the FMM algorithm as a task flow and employing a state-of-the-art runtime system, StarPU, in order to process the tasks on the different processing units. We carefully design the task flow, the mathematical operators, their Central Processing Unit (CPU) and Graphics Processing Unit (GPU) implementations, as well as scheduling schemes. We compute potentials and forces of 200 million particles in 48.7 seconds on a homogeneous 160 cores SGI Altix UV 100 and of 38 million particles in 13.34 seconds on a heterogeneous 12 cores Intel Nehalem processor enhanced with 3 Nvidia M2090 Fermi GPUs.Comment: No. RR-7981 (2012

    Parameterized Complexity of a Parallel Machine Scheduling Problem

    Get PDF
    In this paper we consider the parameterized complexity of two versions of a parallel machine scheduling problem with precedence delays, unit processing times and time windows. In the first version - with exact delays - we assume that the delay between two jobs must be exactly respected, whereas in the second version - with minimum delays - the delay between two jobs is a lower bound on the time between them. Two parameters are considered for this analysis: the pathwidth of the interval graph induced by the time windows and the maximum precedence delay value. We prove that our problems are para-NP-complete with respect to any of the two parameters and fixed-parameter tractable parameterized by the pair of parameters

    Integration of tools for the Design and Assessment of High-Performance, Highly Reliable Computing Systems (DAHPHRS), phase 1

    Get PDF
    Systems for Space Defense Initiative (SDI) space applications typically require both high performance and very high reliability. These requirements present the systems engineer evaluating such systems with the extremely difficult problem of conducting performance and reliability trade-offs over large design spaces. A controlled development process supported by appropriate automated tools must be used to assure that the system will meet design objectives. This report describes an investigation of methods, tools, and techniques necessary to support performance and reliability modeling for SDI systems development. Models of the JPL Hypercubes, the Encore Multimax, and the C.S. Draper Lab Fault-Tolerant Parallel Processor (FTPP) parallel-computing architectures using candidate SDI weapons-to-target assignment algorithms as workloads were built and analyzed as a means of identifying the necessary system models, how the models interact, and what experiments and analyses should be performed. As a result of this effort, weaknesses in the existing methods and tools were revealed and capabilities that will be required for both individual tools and an integrated toolset were identified

    Decentralized Delay Optimal Control for Interference Networks with Limited Renewable Energy Storage

    Full text link
    In this paper, we consider delay minimization for interference networks with renewable energy source, where the transmission power of a node comes from both the conventional utility power (AC power) and the renewable energy source. We assume the transmission power of each node is a function of the local channel state, local data queue state and local energy queue state only. In turn, we consider two delay optimization formulations, namely the decentralized partially observable Markov decision process (DEC-POMDP) and Non-cooperative partially observable stochastic game (POSG). In DEC-POMDP formulation, we derive a decentralized online learning algorithm to determine the control actions and Lagrangian multipliers (LMs) simultaneously, based on the policy gradient approach. Under some mild technical conditions, the proposed decentralized policy gradient algorithm converges almost surely to a local optimal solution. On the other hand, in the non-cooperative POSG formulation, the transmitter nodes are non-cooperative. We extend the decentralized policy gradient solution and establish the technical proof for almost-sure convergence of the learning algorithms. In both cases, the solutions are very robust to model variations. Finally, the delay performance of the proposed solutions are compared with conventional baseline schemes for interference networks and it is illustrated that substantial delay performance gain and energy savings can be achieved

    Parameterized Complexity of Scheduling Chains of Jobs with Delays

    Get PDF
    In this paper, we consider the parameterized complexity of the following scheduling problem. We must schedule a number of jobs on m machines, where each job has unit length, and the graph of precedence constraints consists of a set of chains. Each precedence constraint is labelled with an integer that denotes the exact (or minimum) delay between the jobs. We study different cases; delays can be given in unary and in binary, and the case that we have a single machine is discussed separately. We consider the complexity of this problem parameterized by the number of chains, and by the thickness of the instance, which is the maximum number of chains whose intervals between release date and deadline overlap. We show that this scheduling problem with exact delays in unary is W[t]-hard for all t, when parameterized by the thickness, even when we have a single machine (m = 1). When parameterized by the number of chains, this problem is W[1]-complete when we have a single or a constant number of machines, and W[2]-complete when the number of machines is a variable. The problem with minimum delays, given in unary, parameterized by the number of chains (and as a simple corollary, also when parameterized by the thickness) is W[1]-hard for a single or a constant number of machines, and W[2]-hard when the number of machines is variable. With a dynamic programming algorithm, one can show membership in XP for exact and minimum delays in unary, for any number of machines, when parameterized by thickness or number of chains. For a single machine, with exact delays in binary, parameterized by the number of chains, membership in XP can be shown with branching and solving a system of difference constraints. For all other cases for delays in binary, membership in XP is open

    Macroservers: An Execution Model for DRAM Processor-In-Memory Arrays

    Get PDF
    The emergence of semiconductor fabrication technology allowing a tight coupling between high-density DRAM and CMOS logic on the same chip has led to the important new class of Processor-In-Memory (PIM) architectures. Newer developments provide powerful parallel processing capabilities on the chip, exploiting the facility to load wide words in single memory accesses and supporting complex address manipulations in the memory. Furthermore, large arrays of PIMs can be arranged into a massively parallel architecture. In this report, we describe an object-based programming model based on the notion of a macroserver. Macroservers encapsulate a set of variables and methods; threads, spawned by the activation of methods, operate asynchronously on the variables' state space. Data distributions provide a mechanism for mapping large data structures across the memory region of a macroserver, while work distributions allow explicit control of bindings between threads and data. Both data and work distributuions are first-class objects of the model, supporting the dynamic management of data and threads in memory. This offers the flexibility required for fully exploiting the processing power and memory bandwidth of a PIM array, in particular for irregular and adaptive applications. Thread synchronization is based on atomic methods, condition variables, and futures. A special type of lightweight macroserver allows the formulation of flexible scheduling strategies for the access to resources, using a monitor-like mechanism
    • …
    corecore