19,913 research outputs found

    Scalable data abstractions for distributed parallel computations

    Get PDF
    The ability to express a program as a hierarchical composition of parts is an essential tool in managing the complexity of software and a key abstraction this provides is to separate the representation of data from the computation. Many current parallel programming models use a shared memory model to provide data abstraction but this doesn't scale well with large numbers of cores due to non-determinism and access latency. This paper proposes a simple programming model that allows scalable parallel programs to be expressed with distributed representations of data and it provides the programmer with the flexibility to employ shared or distributed styles of data-parallelism where applicable. It is capable of an efficient implementation, and with the provision of a small set of primitive capabilities in the hardware, it can be compiled to operate directly on the hardware, in the same way stack-based allocation operates for subroutines in sequential machines

    Decentralized Delay Optimal Control for Interference Networks with Limited Renewable Energy Storage

    Full text link
    In this paper, we consider delay minimization for interference networks with renewable energy source, where the transmission power of a node comes from both the conventional utility power (AC power) and the renewable energy source. We assume the transmission power of each node is a function of the local channel state, local data queue state and local energy queue state only. In turn, we consider two delay optimization formulations, namely the decentralized partially observable Markov decision process (DEC-POMDP) and Non-cooperative partially observable stochastic game (POSG). In DEC-POMDP formulation, we derive a decentralized online learning algorithm to determine the control actions and Lagrangian multipliers (LMs) simultaneously, based on the policy gradient approach. Under some mild technical conditions, the proposed decentralized policy gradient algorithm converges almost surely to a local optimal solution. On the other hand, in the non-cooperative POSG formulation, the transmitter nodes are non-cooperative. We extend the decentralized policy gradient solution and establish the technical proof for almost-sure convergence of the learning algorithms. In both cases, the solutions are very robust to model variations. Finally, the delay performance of the proposed solutions are compared with conventional baseline schemes for interference networks and it is illustrated that substantial delay performance gain and energy savings can be achieved

    Scalability of broadcast performance in wireless network-on-chip

    Get PDF
    Networks-on-Chip (NoCs) are currently the paradigm of choice to interconnect the cores of a chip multiprocessor. However, conventional NoCs may not suffice to fulfill the on-chip communication requirements of processors with hundreds or thousands of cores. The main reason is that the performance of such networks drops as the number of cores grows, especially in the presence of multicast and broadcast traffic. This not only limits the scalability of current multiprocessor architectures, but also sets a performance wall that prevents the development of architectures that generate moderate-to-high levels of multicast. In this paper, a Wireless Network-on-Chip (WNoC) where all cores share a single broadband channel is presented. Such design is conceived to provide low latency and ordered delivery for multicast/broadcast traffic, in an attempt to complement a wireline NoC that will transport the rest of communication flows. To assess the feasibility of this approach, the network performance of WNoC is analyzed as a function of the system size and the channel capacity, and then compared to that of wireline NoCs with embedded multicast support. Based on this evaluation, preliminary results on the potential performance of the proposed hybrid scheme are provided, together with guidelines for the design of MAC protocols for WNoC.Peer ReviewedPostprint (published version

    Distributive Stochastic Learning for Delay-Optimal OFDMA Power and Subband Allocation

    Full text link
    In this paper, we consider the distributive queue-aware power and subband allocation design for a delay-optimal OFDMA uplink system with one base station, KK users and NFN_F independent subbands. Each mobile has an uplink queue with heterogeneous packet arrivals and delay requirements. We model the problem as an infinite horizon average reward Markov Decision Problem (MDP) where the control actions are functions of the instantaneous Channel State Information (CSI) as well as the joint Queue State Information (QSI). To address the distributive requirement and the issue of exponential memory requirement and computational complexity, we approximate the subband allocation Q-factor by the sum of the per-user subband allocation Q-factor and derive a distributive online stochastic learning algorithm to estimate the per-user Q-factor and the Lagrange multipliers (LM) simultaneously and determine the control actions using an auction mechanism. We show that under the proposed auction mechanism, the distributive online learning converges almost surely (with probability 1). For illustration, we apply the proposed distributive stochastic learning framework to an application example with exponential packet size distribution. We show that the delay-optimal power control has the {\em multi-level water-filling} structure where the CSI determines the instantaneous power allocation and the QSI determines the water-level. The proposed algorithm has linear signaling overhead and computational complexity O(KN)\mathcal O(KN), which is desirable from an implementation perspective.Comment: To appear in Transactions on Signal Processin

    Optimal Distributed Scheduling in Wireless Networks under the SINR interference model

    Full text link
    Radio resource sharing mechanisms are key to ensuring good performance in wireless networks. In their seminal paper \cite{tassiulas1}, Tassiulas and Ephremides introduced the Maximum Weighted Scheduling algorithm, and proved its throughput-optimality. Since then, there have been extensive research efforts to devise distributed implementations of this algorithm. Recently, distributed adaptive CSMA scheduling schemes \cite{jiang08} have been proposed and shown to be optimal, without the need of message passing among transmitters. However their analysis relies on the assumption that interference can be accurately modelled by a simple interference graph. In this paper, we consider the more realistic and challenging SINR interference model. We present {\it the first distributed scheduling algorithms that (i) are optimal under the SINR interference model, and (ii) that do not require any message passing}. They are based on a combination of a simple and efficient power allocation strategy referred to as {\it Power Packing} and randomization techniques. We first devise algorithms that are rate-optimal in the sense that they perform as well as the best centralized scheduling schemes in scenarios where each transmitter is aware of the rate at which it should send packets to the corresponding receiver. We then extend these algorithms so that they reach throughput-optimality

    EbbRT: a customizable operating system for cloud applications

    Full text link
    Efficient use of hardware requires operating system components be customized to the application workload. Our general purpose operating systems are ill-suited for this task. We present Genesis, a new operating system that enables per-application customizations for cloud applications. Genesis achieves this through a novel heterogeneous distributed structure, a partitioned object model, and an event-driven execution environment. This paper describes the design and prototype implementation of Genesis, and evaluates its ability to improve the performance of common cloud applications. The evaluation of the Genesis prototype demonstrates memcached, run within a VM, can outperform memcached run on an unvirtualized Linux. The prototype evaluation also demonstrates an 14% performance improvement of a V8 JavaScript engine benchmark, and a node.js webserver that achieves a 50% reduction in 99th percentile latency compared to it run on Linux
    corecore