1,176 research outputs found

    Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems

    Full text link
    In modern Commercial Off-The-Shelf (COTS) multicore systems, each core can generate many parallel memory requests at a time. The processing of these parallel requests in the DRAM controller greatly affects the memory interference delay experienced by running tasks on the platform. In this paper, we model a modern COTS multicore system which has a nonblocking last-level cache (LLC) and a DRAM controller that prioritizes reads over writes. To minimize interference, we focus on LLC and DRAM bank partitioned systems. Based on the model, we propose an analysis that computes a safe upper bound for the worst-case memory interference delay. We validated our analysis on a real COTS multicore platform with a set of carefully designed synthetic benchmarks as well as SPEC2006 benchmarks. Evaluation results show that our analysis is more accurately capture the worst-case memory interference delay and provides safer upper bounds compared to a recently proposed analysis which significantly under-estimate the delay.Comment: Technical Repor

    File-System Workload on a Scientific Multiprocessor

    Get PDF
    Many scientific applications have intense computational and I/O requirements. Although multiprocessors have permitted astounding increases in computational performance, the formidable I/O needs of these applications cannot be met by current multiprocessors a their I/O subsystems. To prevent I/O subsystems from forever bottlenecking multiprocessors and limiting the range of feasible applications, new I/O subsystems must be designed. The successful design of computer systems (both hardware and software) depends on a thorough understanding of their intended use. A system designer optimizes the policies and mechanisms for the cases expected to most common in the user's workload. In the case of multiprocessor file systems, however, designers have been forced to build file systems based only on speculation about how they would be used, extrapolating from file-system characterizations of general-purpose workloads on uniprocessor and distributed systems or scientific workloads on vector supercomputers (see sidebar on related work). To help these system designers, in June 1993 we began the Charisma Project, so named because the project sought to characterize 1/0 in scientific multiprocessor applications from a variety of production parallel computing platforms and sites. The Charisma project is unique in recording individual read and write requests-in live, multiprogramming, parallel workloads (rather than from selected or nonparallel applications). In this article, we present the first results from the project: a characterization of the file-system workload an iPSC/860 multiprocessor running production, parallel scientific applications at NASA's Ames Research Center

    Scalable, reliable, power-efficient communication for hardware transactional memory

    Get PDF
    Journal ArticleIn a hardware transactional memory system with lazy versioning and lazy conflict detection, the process of transaction commit can emerge as a bottleneck. This is especially true for a large-scale distributed memory system where multiple transactions may attempt to commit simultaneously and co-ordination is required before allowing commits to proceed in parallel. In this paper, we propose novel algorithms to implement commit that are more scalable (in terms of delay and energy) and are free of deadlocks/livelocks. We show that these algorithms have similarities with the token cache coherence concept and leverage these similarities to extend the algorithms to handle message loss and starvation scenarios. The proposed algorithms improve upon the state-of-the-art by yielding up to a 7X reduction in commit delay and up to a 48X reduction in network messages. These translate into overall performance improvements of up to 66% (for synthetic workloads with average transaction length of 200 cycles), 35% (for average transaction length of 1000 cycles), 8% (for average transaction length of 4000 cycles), and 41% (for a collection of SPLASH-2 programs)

    C-MOS array design techniques: SUMC multiprocessor system study

    Get PDF
    The current capabilities of LSI techniques for speed and reliability, plus the possibilities of assembling large configurations of LSI logic and storage elements, have demanded the study of multiprocessors and multiprocessing techniques, problems, and potentialities. Evaluated are three previous systems studies for a space ultrareliable modular computer multiprocessing system, and a new multiprocessing system is proposed that is flexibly configured with up to four central processors, four 1/0 processors, and 16 main memory units, plus auxiliary memory and peripheral devices. This multiprocessor system features a multilevel interrupt, qualified S/360 compatibility for ground-based generation of programs, virtual memory management of a storage hierarchy through 1/0 processors, and multiport access to multiple and shared memory units

    A file server for the DistriX prototype : a multitransputer UNIX system

    Get PDF
    Bibliography: pages 90-94.The DISTRIX operating system is a multiprocessor distributed operating system based on UNIX. It consists of a number of satellite processors connected to central servers. The system is derived from the MINIX operating system, compatible with UNIX Version 7. A remote procedure call interface is used in conjunction with a system wide, end-to-end communication protocol that connects satellite processors to the central servers. A cached file server provides access to all files and devices at the UNIX system call level. The design of the file server is discussed in depth and the performance evaluated. Additional information is given about the software and hardware used during the development of the project. The MINIX operating system has proved to be a good choice as the software base, but certain features have proved to be poorer. The Inmos transputer emerges as a processor with many useful features that eased the implementation

    Efficiently mapping high-performance early vision algorithms onto multicore embedded platforms

    Get PDF
    The combination of low-cost imaging chips and high-performance, multicore, embedded processors heralds a new era in portable vision systems. Early vision algorithms have the potential for highly data-parallel, integer execution. However, an implementation must operate within the constraints of embedded systems including low clock rate, low-power operation and with limited memory. This dissertation explores new approaches to adapt novel pixel-based vision algorithms for tomorrow's multicore embedded processors. It presents : - An adaptive, multimodal background modeling technique called Multimodal Mean that achieves high accuracy and frame rate performance with limited memory and a slow-clock, energy-efficient, integer processing core. - A new workload partitioning technique to optimize the execution of early vision algorithms on multi-core systems. - A novel data transfer technique called cat-tail dma that provides globally-ordered, non-blocking data transfers on a multicore system. By using efficient data representations, Multimodal Mean provides comparable accuracy to the widely used Mixture of Gaussians (MoG) multimodal method. However, it achieves a 6.2x improvement in performance while using 18% less storage than MoG while executing on a representative embedded platform. When this algorithm is adapted to a multicore execution environment, the new workload partitioning technique demonstrates an improvement in execution times of 25% with only a 125 ms system reaction time. It also reduced the overall number of data transfers by 50%. Finally, the cat-tail buffering technique reduces the data-transfer latency between execution cores and main memory by 32.8% over the baseline technique when executing Multimodal Mean. This technique concurrently performs data transfers with code execution on individual cores, while maintaining global ordering through low-overhead scheduling to prevent collisions.Ph.D.Committee Chair: Wills, Scott; Committee Co-Chair: Wills, Linda; Committee Member: Bader, David; Committee Member: Davis, Jeff; Committee Member: Hamblen, James; Committee Member: Lanterman, Aaro

    Telemetry downlink interfaces and level-zero processing

    Get PDF
    The technical areas being investigated are as follows: (1) processing of space to ground data frames; (2) parallel architecture performance studies; and (3) parallel programming techniques. Additionally, the University administrative details and the technical liaison between New Mexico State University and Goddard Space Flight Center are addressed

    A Classification and Survey of Computer System Performance Evaluation Techniques

    Get PDF
    Classification and survey of computer system performance evaluation technique
    corecore