18 research outputs found

    Communication efficient parallel algorithms for nonnumerical computations

    No full text
    The broad goal of this research is to develop a set of paradigms for mapping data-dependent symbolic computations on realistic models of parallel architectures. Within this goal, the thesis represents the initial effort to achieve efficient parallel solutions for a number of non-numerical problems on networks of processors. The specific contributions of the thesis are new parallel algorithms, exhibiting linear speedup on architectures consisting of fixed numbers of processors (i.e., bounded models). The following problems have been considered in the thesis: (1) Determine the minimum spanning tree (MST), and identify the bridges and articulation points (APs) of an undirected weighted graph represented by an n×nn \times n adjacency matrix. (2) The pattern matching problem: Given two strings of characters, of lengths mm and n (m)n\ ({\geq}m) respectively, mark all positions in the second string where there appears an instance of the first string. (3) Sort nn elements. For each problem, we use a processor-network consisting of pp processors. The network model used in the solution of the first set of problems is the linear array; while that used in the solutions of the second and third problems is a butterfly-connected system. The solutions on the butterfly-connected system apply also on a pipelined hypercube. The performances of the solutions are summarized below. (1) For a graph on nn vertices and represented by a distributed adjacency matrix, we present a solution for the MST problem that requires O(n\sp2/p + n + p) time for execution. We present novel data reduction schemes for identifying the bridges and articulation points. (2) The string matching solution requires time O((n + m)/p + \log\sp2 p), where nn and mm are the lengths of the two strings. No previous parallel solutions achieving linear speedups have been proposed on networks of processors. (3) The execution time requirements of the sorting algorithm are O(n/p \log n + \log\sp2p), which represents a linear speedup up to the use of n/lognn/\log n processors. A previous solution achieved linear speedup on a 2\sp{\sqrt{\log n}} processor binary-cube. A new parallel merging procedure is presented in the algorithm. Also, as part of the algorithm, a new routing operation called Forward-copy is shown to result in conflict-free communication on the butterfly. (Abstract shortened with permission of author.

    Extended queuing network modeling

    No full text
    Evaluating the performance of a system is of central concern in making engineering decisions. When direct measurement of performance is not possible or feasible, evaluation consists of two phases: specification of an appropriate performance model, and evaluation of the model to obtain the performance measures. Broadly, a performance model can be evaluated by exact or approximate analysis, or by simulation. A class of models popular for evaluation of a number of systems, computer systems in particular, is that of Extended Queuing Network (EQN) Models. Software tools are typically used for building EQN models for evaluation through analyses or simulation. This thesis describes an effort in experimenting with an approach to the design and implementation of a tool for performance evaluation of EQN models via simulation. The objective in this effort is to design a tool that is easy and intuitive to use, yet versatile and powerful in its modeling capabilities. The tool we have implemented is called Graphical Input Simulation Tool (GIST). GIST meets its design objectives by (1) providing a pair of user interfaces that are capable of accepting the abstract EQN model specification directly, are easy and intuitive to learn and use, and are helpful in quick model specification with reduced likelihood of semantic and syntactic specification errors, and (2) incorporating into the set of EQN objects it provides, the capabilities perceived necessary for realistic modeling of activities that characterize the systems of interest

    WrAP: Managing Byte-Addressable Persistent Memory

    No full text
    Advances in memory technology are promising the availability of byte-addressable persistent memory as an integral component of future computing platforms. This change has significant implications for software that has traditionally made a sharp distinction between durable and volatile storage. In this paper we describe a software-hardware architecture for persistent memory that provides atomicity and durability while simultaneously ensuring that fast paths through the cache, DRAM, and persistent memory layers are not slowed down, by burdensome buffering or double-copying requirements. 1

    Continuous checkpointing of HTM transactions in NVM

    No full text
    This paper addresses the challenges of coupling byte addressable non-volatile memory (NVM) and hardware transaction memory (HTM) in high-performance transaction processing. We first show that HTM transactions can be ordered using existing processor instructions without any hardware changes. In contrast, existing solutions posit changes to HTM mechanisms in the form of special instructions or modified functionality. We exploit the ordering mechanism to design a novel persistence method that decouples HTM concurrency from back-end NVM operations. Failure atomicity is achieved using redo logging coupled with aliasing to guard against mistimed cache evictions. Our algorithm uses efficient lock-free mechanisms with bounded static memory requirements. We evaluated our approach using both micro-benchmarks, and, benchmarks in the STAMP suite, and showed that it compares well with standard (volatile) HTM transactions. We also showed that it yields significant gains in throughput and latency in comparison with persistent transactional locking

    Transaction local aliasing in storage class memory

    No full text
    This paper describes a lightweight software library to solve the challenges [6], [3], [1], [5], [2] of programming storage class memory (SCM). It provides primitives to demarcate failure-atomic code regions. SCM loads and stores within each demarcated code region (called a “wrap”) are routed through the library, which buffers updates and transmits them to SCM locations asynchronously while allowing their speedy propagation from writers to readers through CPU caches

    Workload Decomposition for QoS in Hosted Storage Services

    No full text
    The growing popularity of hosted storage services and shared storage infrastructure in data centers is driving the recent interest in performance isolation and QoS in storage systems. Due to the bursty nature of storage workloads, meeting the traditional response-time Service Level Agreements requires significant over provisioning of the server capacity. We present a graduated, distribution-based QoS specification for storage servers that provides cost benefits over traditional QoS models. Our method RTT partitions the workload to minimize the capacity required to meet response time requirements of any specified fraction of the requests. Categories and Subject Descriptors C.4 [Performance of Systems]: [Modeling techniques]

    Persisting in-memory databases using SCM

    No full text
    Big Data applications need to be able to access large amounts of variable data as fast as possible. Emerging Storage Class Memory (SCM) fit this need by making memory available in large capacity while making changes endure as a seamless continuation of load-store accesses through processor caches. However, when writing values into a persistent memory tier, programmers are faced with the dual problems of controlling untimely cache evictions that might commit changes prematurely, and of grouping changes and making them durable as a unit so that consistency can be guaranteed in the event of sudden failure. In this paper, we present various methods to achieve high-performance byte-addressable persistence for an in-memory data store. We chose Redis, a popular high-performance memory oriented key value database. We modified its source code to use SCM such that updates to data and structures are performed in a failure resilient manner. We evaluated the changes using both internal benchmarks and the Yahoo! Cloud Servicing Benchmark (YCSB). We found that even though Redis uses many SCM read operations, it can benefit from highly optimized persistent SCM write based approaches, especially when SCM write times are longer than DRAM write times. The paper presents an innovative Local Alias Table Batched (LATB) method, and shows that it outperforms the alternatives

    Bridging the programming gap between persistent and volatile memory using WrAP

    No full text
    Advances in memory technology are promising the availability of byte-addressable persistent memory as an integral component of future computing platforms. This change has significant implications for software that has traditionally made a sharp distinction between durable and volatile storage. In this paper we describe a softwarehardware architecture, WrAP, for persistent memory that provides atomicity and durability while simultaneously ensuring that fast paths through the cache, DRAM, and persistent memory layers are not slowed down by burdensome buffering or double-copying requirements. Trace-driven simulation of transactional data structures indicate the potential for significant performance gains using the WrAP approach
    corecore