610 research outputs found

    The "MIND" Scalable PIM Architecture

    Get PDF
    MIND (Memory, Intelligence, and Network Device) is an advanced parallel computer architecture for high performance computing and scalable embedded processing. It is a Processor-in-Memory (PIM) architecture integrating both DRAM bit cells and CMOS logic devices on the same silicon die. MIND is multicore with multiple memory/processor nodes on each chip and supports global shared memory across systems of MIND components. MIND is distinguished from other PIM architectures in that it incorporates mechanisms for efficient support of a global parallel execution model based on the semantics of message-driven multithreaded split-transaction processing. MIND is designed to operate either in conjunction with other conventional microprocessors or in standalone arrays of like devices. It also incorporates mechanisms for fault tolerance, real time execution, and active power management. This paper describes the major elements and operational methods of the MIND architecture

    GPUs as Storage System Accelerators

    Full text link
    Massively multicore processors, such as Graphics Processing Units (GPUs), provide, at a comparable price, a one order of magnitude higher peak performance than traditional CPUs. This drop in the cost of computation, as any order-of-magnitude drop in the cost per unit of performance for a class of system components, triggers the opportunity to redesign systems and to explore new ways to engineer them to recalibrate the cost-to-performance relation. This project explores the feasibility of harnessing GPUs' computational power to improve the performance, reliability, or security of distributed storage systems. In this context, we present the design of a storage system prototype that uses GPU offloading to accelerate a number of computationally intensive primitives based on hashing, and introduce techniques to efficiently leverage the processing power of GPUs. We evaluate the performance of this prototype under two configurations: as a content addressable storage system that facilitates online similarity detection between successive versions of the same file and as a traditional system that uses hashing to preserve data integrity. Further, we evaluate the impact of offloading to the GPU on competing applications' performance. Our results show that this technique can bring tangible performance gains without negatively impacting the performance of concurrently running applications.Comment: IEEE Transactions on Parallel and Distributed Systems, 201

    CARISMA: a context-sensitive approach to race-condition sample-instance selection for multithreaded applications

    Get PDF
    Dynamic race detectors can explore multiple thread schedules of a multithreaded program over the same input to detect data races. Although existing sampling-based precise race detectors reduce overheads effectively so that lightweight precise race detection can be performed in testing or post-deployment environments, they are ineffective in detecting races if the sampling rates are low. This paper presents CARISMA to address this problem. CARISMA exploits the insight that along an execution trace, a program may potentially handle many accesses to the memory locations created at the same site for similar purposes. Iterating over multiple execution trials of the same input, CARISMA estimates and distributes the sampling budgets among such location creation sites, and probabilistically collects a fraction of all accesses to the memory locations associated with such sites for subsequent race detection. Our experiment shows that, compared with PACER on the same platform and at the same sampling rate (such as 1%), CARISMA is significantly more effective. © 2012 ACM.postprin

    A Context-Sensitive Coverage Criterion for Test Suite Reduction

    Get PDF
    Modern software is increasingly developed using multi-language implementations, large supporting libraries and frameworks, callbacks, virtual function calls, reflection, multithreading, and object- and aspect-oriented programming. The predominant example of such software is the graphical user interface (GUI), which is used as a front-end to most of today's software applications. The characteristics of GUIs and other modern software present new challenges to software testing. Because recently developed techniques for automated test case generation can generate more tests than are practical to regularly execute, one important challenge is test suite reduction. Test suite reduction seeks to decrease the size of a test suite without overly compromising its original fault detection ability. This research advances the state-of-the-art in test suite reduction by empirically studying a coverage criterion which considers the context in which program concepts are covered. Conventional approaches to test suite reduction were developed and evaluated on batch-style applications and, due to the aforementioned considerations, are not always easily applicable to modern software. Furthermore, many existing techniques fail to consider the context in which code executes inside an event-driven paradigm, where programs wait for and interactively respond to user- and system-generated events. Consequently, they yield reduced test suites with severely impaired fault detection ability. The novel feature of this research is a test suite reduction technique based on the call stack coverage criterion which addresses many of the challenges associated with coverage-based test suite reduction in modern applications. Results show that reducing test suites while maintaining call stack coverage yields good tradeoffs between size reduction and fault detection effectiveness compared to traditional techniques. The output of this research includes models, metrics, algorithms, and techniques based upon this approach

    Rigorous concurrency analysis of multithreaded programs

    Get PDF
    technical reportThis paper explores the practicality of conducting program analysis for multithreaded software using constraint solv- ing. By precisely defining the underlying memory consis- tency rules in addition to the intra-thread program seman- tics, our approach orders a unique advantage for program ver- ification | it provides an accurate and exhaustive coverage of all thread interleavings for any given memory model. We demonstrate how this can be achieved by formalizing sequen- tial consistency for a source language that supports control branches and a monitor-style mutual exclusion mechanism. We then discuss how to formulate programmer expectations as constraints and propose three concrete applications of this approach: execution validation, race detection, and atom- icity analysis. Finally, we describe the implementation of a formal analysis tool using constraint logic programming, with promising initial results for reasoning about small but non-trivial concurrent programs

    A Configurable Transport Layer for CAF

    Full text link
    The message-driven nature of actors lays a foundation for developing scalable and distributed software. While the actor itself has been thoroughly modeled, the message passing layer lacks a common definition. Properties and guarantees of message exchange often shift with implementations and contexts. This adds complexity to the development process, limits portability, and removes transparency from distributed actor systems. In this work, we examine actor communication, focusing on the implementation and runtime costs of reliable and ordered delivery. Both guarantees are often based on TCP for remote messaging, which mixes network transport with the semantics of messaging. However, the choice of transport may follow different constraints and is often governed by deployment. As a first step towards re-architecting actor-to-actor communication, we decouple the messaging guarantees from the transport protocol. We validate our approach by redesigning the network stack of the C++ Actor Framework (CAF) so that it allows to combine an arbitrary transport protocol with additional functions for remote messaging. An evaluation quantifies the cost of composability and the impact of individual layers on the entire stack

    Pathway: a fast and flexible unified stream data processing framework for analytical and Machine Learning applications

    Full text link
    We present Pathway, a new unified data processing framework that can run workloads on both bounded and unbounded data streams. The framework was created with the original motivation of resolving challenges faced when analyzing and processing data from the physical economy, including streams of data generated by IoT and enterprise systems. These required rapid reaction while calling for the application of advanced computation paradigms (machinelearning-powered analytics, contextual analysis, and other elements of complex event processing). Pathway is equipped with a Table API tailored for Python and Python/SQL workflows, and is powered by a distributed incremental dataflow in Rust. We describe the system and present benchmarking results which demonstrate its capabilities in both batch and streaming contexts, where it is able to surpass state-of-the-art industry frameworks in both scenarios. We also discuss streaming use cases handled by Pathway which cannot be easily resolved with state-of-the-art industry frameworks, such as streaming iterative graph algorithms (PageRank, etc.)
    • …
    corecore