200 research outputs found

    SGER: dynamic partitioned global address spaces for future large scale systems

    Get PDF
    Issued as final reportNational Science Foundation (U.S.

    Dynamic partitioned global address spaces for high-efficiency computing

    Get PDF
    The current trend of ever larger clusters and data centers has coincided with a dramatic increase in the cost and power of these installations. While many efficiency improvements have focused on processor power and cooling costs, reducing the cost and power consumption of high-performance memory has mostly been overlooked. This thesis proposes a new address translation model called Dynamic Partitioned Global Address Space (DPGAS) that extends the ideas of NUMA and software-based approaches to create a high-performance hardware model that can be used to reduce the overall cost and power of memory in larger server installations. A memory model and hardware implementation of DPGAS is developed, and simulations of memory-intensive workloads are used to show potential cost and power reductions when DPGAS is integrated into a server environment.M.S.Committee Chair: Yalamanchili, Sudhakar; Committee Member: Riley, George; Committee Member: Schimmel, Davi

    Real-time systems on multicore platforms: managing hardware resources for predictable execution

    Full text link
    Shared hardware resources in commodity multicore processors are subject to contention from co-running threads. The resultant interference can lead to highly-variable performance for individual applications. This is particularly problematic for real-time applications, which require predictable timing guarantees. It also leads to a pessimistic estimate of the Worst Case Execution Time (WCET) for every real-time application. More CPU time needs to be reserved, thus less applications can enter the system. As the average execution time is usually far less than the WCET, a significant amount of reserved CPU resource would be wasted. Previous works have attempted partitioning the shared resources, amongst either CPUs or processes, to improve performance isolation. However, they have not proven to be both efficient and effective. In this thesis, we propose several mechanisms and frameworks that manage the shared caches and memory buses on multicore platforms. Firstly, we introduce a multicore real-time scheduling framework with the foreground/background scheduling model. Combining real-time load balancing with background scheduling, CPU utilization is greatly improved. Besides, a memory bus management mechanism is implemented on top of the background scheduling, making sure bus contention is under control while utilizing unused CPU cycles. Also, cache partitioning is thoroughly studied in this thesis, with a cache-aware load balancing algorithm and a dynamic cache partitioning framework proposed. Lastly, we describe a system architecture to integrate the above solutions all together. It tackles one of the toughest problems in OS innovation, legacy support, by converting existing OSes into libraries in a virtualized environment. Thus, within a single multicore platform, we benefit from the fine-grained resource control of a real-time OS and the richness of functionality of a general-purpose OS

    Improvement Energy Efficiency for a Hybrid Multibank Memory in Energy Critical Applications

    Get PDF
    High performance, low power multiprocessor/multibank memory system requires a compiler that provides efficient data partitioning and mapping procedures. This paper introduced two compiler techniques for the data mapping to multibank memory, since data mapping is still an open problem and needs a better solution. The multibank memory can be consisted of volatile and non-volatile memory components to support ultra-low powered wearable devices. This hybrid memory system including volatile and non-volatile memory components yields higher complexity to map data onto it. To efficiently solve this mapping problem, we formulate it to a simple decision problem. Based on the problem definition, we proposed two efficient algorithms to determine the placement of data to the multibank memory. The proposed techniques consider the characteristic of the non-volatile memory that its write operation consumes more energy than the same operation of a volatile memory even though it provides ultra-low operation power and nearly zero leakage current. The proposed technique solves this negative effect of non-volatile memory by using efficient data placement technique and hybrid memory architecture. In experimental section, the result shows that the proposed techniques improve energy saving up to 59.5% for the hybrid multibank memory architecture

    Castell: a heterogeneous cmp architecture scalable to hundreds of processors

    Get PDF
    Technology improvements and power constrains have taken multicore architectures to dominate microprocessor designs over uniprocessors. At the same time, accelerator based architectures have shown that heterogeneous multicores are very efficient and can provide high throughput for parallel applications, but with a high-programming effort. We propose Castell a scalable chip multiprocessor architecture that can be programmed as uniprocessors, and provides the high throughput of accelerator-based architectures. Castell relies on task-based programming models that simplify software development. These models use a runtime system that dynamically finds, schedules, and adds hardware-specific features to parallel tasks. One of these features is DMA transfers to overlap computation and data movement, which is known as double buffering. This feature allows applications on Castell to tolerate large memory latencies and lets us design the memory system focusing on memory bandwidth. In addition to provide programmability and the design of the memory system, we have used a hierarchical NoC and added a synchronization module. The NoC design distributes memory traffic efficiently to allow the architecture to scale. The synchronization module is a consequence of the large performance degradation of application for large synchronization latencies. Castell is mainly an architecture framework that enables the definition of domain-specific implementations, fine-tuned to a particular problem or application. So far, Castell has been successfully used to propose heterogeneous multicore architectures for scientific kernels, video decoding (using H.264), and protein sequence alignment (using Smith-Waterman and clustalW). It has also been used to explore a number of architecture optimizations such as enhanced DMA controllers, and architecture support for task-based programming models. ii

    Evaluating Cache Coherent Shared Virtual Memory for Heterogeneous Multicore Chips

    Full text link
    The trend in industry is towards heterogeneous multicore processors (HMCs), including chips with CPUs and massively-threaded throughput-oriented processors (MTTOPs) such as GPUs. Although current homogeneous chips tightly couple the cores with cache-coherent shared virtual memory (CCSVM), this is not the communication paradigm used by any current HMC. In this paper, we present a CCSVM design for a CPU/MTTOP chip, as well as an extension of the pthreads programming model, called xthreads, for programming this HMC. Our goal is to evaluate the potential performance benefits of tightly coupling heterogeneous cores with CCSVM
    corecore