105 research outputs found

    IMPROVING THE PERFORMANCE AND ENERGY EFFICIENCY OF EMERGING MEMORY SYSTEMS

    Get PDF
    Modern main memory is primarily built using dynamic random access memory (DRAM) chips. As DRAM chip scales to higher density, there are mainly three problems that impede DRAM scalability and performance improvement. First, DRAM refresh overhead grows from negligible to severe, which limits DRAM scalability and causes performance degradation. Second, although memory capacity has increased dramatically in past decade, memory bandwidth has not kept pace with CPU performance scaling, which has led to the memory wall problem. Third, DRAM dissipates considerable power and has been reported to account for as much as 40% of the total system energy and this problem exacerbates as DRAM scales up. To address these problems, 1) we propose Rank-level Piggyback Caching (RPC) to alleviate DRAM refresh overhead by servicing memory requests and refresh operations in parallel; 2) we propose a high performance and bandwidth efficient approach, called SELF, to breaking the memory bandwidth wall by exploiting die-stacked DRAM as a part of memory; 3) we propose a cost-effective and energy-efficient architecture for hybrid memory systems composed of high bandwidth memory (HBM) and phase change memory (PCM), called Dual Role HBM (DR-HBM). In DR-HBM, hot pages are tracked at a cost-effective way and migrated to the HBM to improve performance, while cold pages are stored at the PCM to save energy

    A New System Architecture for Heterogeneous Compute Units

    Get PDF
    The ongoing trend to more heterogeneous systems forces us to rethink the design of systems. In this work, I study a new system design that considers heterogeneous compute units (general-purpose cores with different instruction sets, DSPs, FPGAs, fixed-function accelerators, etc.) from the beginning instead of as an afterthought. The goal is to treat all compute units (CUs) as first-class citizens, enabling (1) isolation and secure communication between all types of CUs, (2) a direct interaction of all CUs, removing the conventional CPU from the critical path, and (3) access to operating system (OS) services such as file systems and network stacks for all CUs. To study this system design, I am using a hardware/software co-design based on two key ideas: 1) introduce a new hardware component next to each CU used by the OS as the CUs' common interface and 2) let the OS kernel control applications remotely from a different CU. The hardware component is called data transfer unit (DTU) and offers the minimal set of features to reach the stated goals: secure message passing and memory access. The OS is called M³ and runs its kernel on a dedicated CU and runs the OS services and applications on the remaining CUs. The kernel is responsible for establishing DTU-based communication channels between services and applications. After a channel has been set up, services and applications communicate directly without involving the kernel. This approach allows to support arbitrary CUs as aforementioned first-class citizens, ranging from fixed-function accelerators to complex general-purpose cores

    Task Activity Vectors: A Novel Metric for Temperature-Aware and Energy-Efficient Scheduling

    Get PDF
    This thesis introduces the abstraction of the task activity vector to characterize applications by the processor resources they utilize. Based on activity vectors, the thesis introduces scheduling policies for improving the temperature distribution on the processor chip and for increasing energy efficiency by reducing the contention for shared resources of multicore and multithreaded processors

    High Performance On-Chip Interconnects Design for Future Many-Core Architectures

    Get PDF
    Switch-based Network-on-Chip (NoC) is a widely accepted inter-core communication infrastructure for Chip Multiprocessors (CMPs). With the continued advance of CMOS technology, the number of cores on a single chip keeps increasing at a rapid pace. It is highly expected that many-core architectures with more than hundreds of processor cores will be commercialized in the near future. In such a large scale CMP system, NoC overheads are more dominant than computation power in determining overall system performance. Also, for modern computational workloads requiring abundant thread level parallelism (TLP), NoC design for highly-parallel, many-core accelerators such as General Purpose Graphics Processing Units (GPGPUs) is of primary importance in harnessing the potential of massive thread- and data-level parallelism. In these contexts, it is critical that NoC provides both low latency and high bandwidth within limited on-chip power and area budgets. In this dissertation, we explore various design issues inherent in future many-core architectures, CMPs and GPGPUs, to achieve both high performance and power efficiency. First, we deal with issues in using a promising next generation memory technology, Spin-Transfer Torque Magnetic RAM (STT-MRAM), for NoC input buffers in CMPs. Using a high density and low leakage memory offers more buffer capacities with the same die footprint, thus helping increase network throughput in NoC routers. However, its long latency and high power consumption in write operations still need to be addressed. Thus, we propose a hybrid design of input buffers using both SRAM and STT-MRAM to hide the long write latency efficiently. Considering that simple data migration in the hybrid buffer consumes more dynamic power compared to SRAM, we provide a lazy migration scheme that reduces the dynamic power consumption of the hybrid buffer. Second, we propose the first NoC router design that uses only STT-MRAM, providing much larger buffer space with less power consumption, while preserving data integrity. To hide the multicycle writes, we employ a multibank STT-MRAM buffer, a virtual channel with multiple banks where every incoming flit is seamlessly pipelined to each bank alternately. Our STT-MRAM design has aggressively reduced the retention time, resulting in a significant reduction in the latency and power overheads of write operations. To ensure data integrity against inadvertent bit flips from the thermal fluctuation during the given retention time, we propose a cost-efficient dynamic buffer refresh scheme combined with Error Correcting Codes (ECC) to detect and correct data corruption. Third, we present schemes for bandwidth-efficient on-chip interconnects in GPGPUs. GPGPUs place a heavy demand on the on-chip interconnect between the many cores and a few memory controllers (MCs). Thus, traffic is highly asymmetric, impacting on-chip resource utilization and system performance. Here, we analyze the communication demands of typical GPGPU applications, and propose efficient NoC designs to meet those demands

    Characterization and Optimization of Resource Utilization for Cellular Networks.

    Full text link
    Cellular data networks have experienced significant growth in the recent years particularly due to the emergence of smartphones. Despite its popularity, there remain two major challenges associated with cellular carriers and their customers: carriers operate under severe resource constraints, while many mobile applications are unaware of the cellular specific characteristics, leading to inefficient radio resource and handset energy utilization. My dissertation is dedicated to address both challenges, aiming at providing practical, effective, and efficient methods to monitor and to reduce the resource utilization and bandwidth consumption in cellular networks. Specifically, from carriers' perspective, we performed the first measurement study to understand the state-of-the-art of resource utilization for a commercial cellular network, and revealed that fundamental limitation of the current resource management policy is treating all traffic according to the same resource management policy globally configured for all users. On mobile applications' side, we developed a novel data analysis framework called ARO (mobile Application Resource Optimizer), the first tool that exposes the interaction between mobile applications and the radio resource management policy, to reveal inefficient resource usage due to a lack of transparency in the lower-layer protocol behavior. ARO revealed that many popular applications built by professional developers have significant resource utilization inefficiencies that are previously unknown. Motivated by the observations from both sides, we further proposed a novel resource management framework that enables the cooperation between handsets and the network to allow adaptive resource release, therefore better balancing the key tradeoffs in cellular networks. We also investigated the problem of reducing the bandwidth consumption in cellular networks by performing the first network-wide study of HTTP caching on smartphones due to its popularity. Our findings suggest that for web caching, there exists a huge gap between the protocol specification and the protocol implementation on today's mobile devices, leading to significant amount of redundant network traffic.PHDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/94024/1/fengqian_1.pd

    Dynamic contention management for distributed applications

    Get PDF
    PhD ThesisDistributed applications often make use of replicated state to afford a greater level of availability and throughput. This is achieved by allowing individual processes to progress without requiring prior synchronisation. This approach, termed optimistic replication, results in divergent replicas that must be reconciled to achieve an overall consistent state. Concurrent operations to shared objects in the replicas result in conflicting updates that require reconciliatory action to rectify. This typically takes the form of compensatory execution or simply undoing and rolling back client state. When considering user interaction with the application, there exists relationships and intent in the ordering and execution of these operations. The enactment of reconciliation that determines one action as conflicted may have far reaching implications with regards to the user’s original intent. In such scenarios, the compensatory action applied to a conflict may require previous operations to also be undone or compensated such that the user’s intent is maintained. Therefore, an ability to manage the contention to the shared data across the distributed application to pre-emptively lower conflicts resulting from these infringements is desirable. The aim is to not hinder throughput, achieved from the weaker consistency model known as eventual consistency. In this thesis, a model is presented for a contention management framework that schedules access using the expected execution inherent in the application domain to best inform the contention manager. A backoff scheme is employed to create an access schedule, preserving user intent for applications that require this high level of maintenance for user actions. By using such an approach, this results in a performance improvement seen in the reduction of the overall number of conflicts, while also improving overall system throughput. This thesis describes how the contention management scheme operates and, through experimentation, the performance benefits received

    Redundant disk arrays: Reliable, parallel secondary storage

    Get PDF
    During the past decade, advances in processor and memory technology have given rise to increases in computational performance that far outstrip increases in the performance of secondary storage technology. Coupled with emerging small-disk technology, disk arrays provide the cost, volume, and capacity of current disk subsystems, by leveraging parallelism, many times their performance. Unfortunately, arrays of small disks may have much higher failure rates than the single large disks they replace. Redundant arrays of inexpensive disks (RAID) use simple redundancy schemes to provide high data reliability. The data encoding, performance, and reliability of redundant disk arrays are investigated. Organizing redundant data into a disk array is treated as a coding problem. Among alternatives examined, codes as simple as parity are shown to effectively correct single, self-identifying disk failures

    Interactive Visualization on High-Resolution Tiled Display Walls with Network Accessible Compute- and Display-Resources

    Get PDF
    Papers number 2-7 and appendix B and C of this thesis are not available in Munin: 2. Hagen, T-M.S., Johnsen, E.S., Stødle, D., Bjorndalen, J.M. and Anshus, O.: 'Liberating the Desktop', First International Conference on Advances in Computer-Human Interaction (2008), pp 89-94. Available at http://dx.doi.org/10.1109/ACHI.2008.20 3. Tor-Magne Stien Hagen, Oleg Jakobsen, Phuong Hoai Ha, and Otto J. Anshus: 'Comparing the Performance of Multiple Single-Cores versus a Single Multi-Core' (manuscript)4. Tor-Magne Stien Hagen, Phuong Hoai Ha, and Otto J. Anshus: 'Experimental Fault-Tolerant Synchronization for Reliable Computation on Graphics Processors' (manuscript) 5. Tor-Magne Stien Hagen, Daniel Stødle and Otto J. Anshus: 'On-Demand High-Performance Visualization of Spatial Data on High-Resolution Tiled Display Walls', Proceedings of the International Conference on Imaging Theory and Applications and International Conference on Information Visualization Theory and Applications (2010), pages 112-119. Available at http://dx.doi.org/10.5220/0002849601120119 6. Bård Fjukstad, Tor-Magne Stien Hagen, Daniel Stødle, Phuong Hoai Ha, John Markus Bjørndalen and Otto Anshus: 'Interactive Weather Simulation and Visualization on a Display Wall with Many-Core Compute Nodes', Para 2010 – State of the Art in Scientific and Parallel Computing. Available at http://vefir.hi.is/para10/extab/para10-paper-60 7. Tor-Magne Stien Hagen, Daniel Stødle, John Markus Bjørndalen, and Otto Anshus: 'A Step towards Making Local and Remote Desktop Applications Interoperable with High-Resolution Tiled Display Walls', Lecture Notes in Computer Science (2011), Volume 6723/2011, 194-207. Available at http://dx.doi.org/10.1007/978-3-642-21387-8_15The vast volume of scientific data produced today requires tools that can enable scientists to explore large amounts of data to extract meaningful information. One such tool is interactive visualization. The amount of data that can be simultaneously visualized on a computer display is proportional to the display’s resolution. While computer systems in general have seen a remarkable increase in performance the last decades, display resolution has not evolved at the same rate. Increased resolution can be provided by tiling several displays in a grid. A system comprised of multiple displays tiled in such a grid is referred to as a display wall. Display walls provide orders of magnitude more resolution than typical desktop displays, and can provide insight into problems not possible to visualize on desktop displays. However, their distributed and parallel architecture creates several challenges for designing systems that can support interactive visualization. One challenge is compatibility issues with existing software designed for personal desktop computers. Another set of challenges include identifying characteristics of visualization systems that can: (i) Maintain synchronous state and display-output when executed over multiple display nodes; (ii) scale to multiple display nodes without being limited by shared interconnect bottlenecks; (iii) utilize additional computational resources such as desktop computers, clusters and supercomputers for workload distribution; and (iv) use data from local and remote compute- and data-resources with interactive performance. This dissertation presents Network Accessible Compute (NAC) resources and Network Accessible Display (NAD) resources for interactive visualization of data on displays ranging from laptops to high-resolution tiled display walls. A NAD is a display having functionality that enables usage over a network connection. A NAC is a computational resource that can produce content for network accessible displays. A system consisting of NACs and NADs is either push-based (NACs provide NADs with content) or pull-based (NADs request content from NACs). To attack the compatibility challenge, a push-based system was developed. The system enables several simultaneous users to mirror multiple regions from the desktop of their computers (NACs) onto nearby NADs (among others a 22 megapixel display wall) without requiring usage of separate DVI/VGA cables, permanent installation of third party software or opening firewall ports. The system has lower performance than that of a DVI/VGA cable approach, but increases flexibility such as the possibility to share network accessible displays from multiple computers. At a resolution of 800 by 600 pixels, the system can mirror dynamic content between a NAC and a NAD at 38.6 frames per second (FPS). At 1600x1200 pixels, the refresh rate is 12.85 FPS. The bottleneck of the system is frame buffer capturing and encoding/decoding of pixels. These two functional parts are executed in sequence, limiting the usage of additional CPU cores. By pipelining and executing these parts on separate CPU cores, higher frame rates can be expected and by a factor of two in the best case. To attack all presented challenges, a pull-based system, WallScope, was developed. WallScope enables interactive visualization of local and remote data sets on high-resolution tiled display walls. The WallScope architecture comprises a compute-side and a display-side. The compute-side comprises a set of static and dynamic NACs. Static NACs are considered permanent to the system once added. This type of NAC typically has strict underlying security and access policies. Examples of such NACs are clusters, grids and supercomputers. Dynamic NACs are compute resources that can register on-the-fly to become compute nodes in the system. Examples of this type of NAC are laptops and desktop computers. The display-side comprises of a set of NADs and a data set containing data customized for the particular application domain of the NADs. NADs are based on a sort-first rendering approach where a visualization client is executed on each display-node. The state of these visualization clients is provided by a separate state server, enabling central control of load and refresh-rate. Based on the state received from the state server, the visualization clients request content from the data set. The data set is live in that it translates these requests into compute messages and forwards them to available NACs. Results of the computations are returned to the NADs for the final rendering. The live data set is close to the NADs, both in terms of bandwidth and latency, to enable interactive visualization. WallScope can visualize the Earth, gigapixel images, and other data available through the live data set. When visualizing the Earth on a 28-node display wall by combining the Blue Marble data set with the Landsat data set using a set of static NACs, the bottleneck of WallScope is the computation involved in combining the data sets. However, the time used to combine data sets on the NACs decreases by a factor of 23 when going from 1 to 26 compute nodes. The display-side can decode 414.2 megapixels of images per second (19 frames per second) when visualizing the Earth. The decoding process is multi-threaded and higher frame rates are expected using multi-core CPUs. WallScope can rasterize a 350-page PDF document into 550 megapixels of image-tiles and display these image-tiles on a 28-node display wall in 74.66 seconds (PNG) and 20.66 seconds (JPG) using a single quad-core desktop computer as a dynamic NAC. This time is reduced to 4.20 seconds (PNG) and 2.40 seconds (JPG) using 28 quad-core NACs. This shows that the application output from personal desktop computers can be decoupled from the resolution of the local desktop and display for usage on high-resolution tiled display walls. It also shows that the performance can be increased by adding computational resources giving a resulting speedup of 17.77 (PNG) and 8.59 (JPG) using 28 compute nodes. Three principles are formulated based on the concepts and systems researched and developed: (i) Establishing the end-to-end principle through customization, is a principle stating that the setup and interaction between a display-side and a compute-side in a visualization context can be performed by customizing one or both sides; (ii) Personal Computer (PC) – Personal Compute Resource (PCR) duality states that a user’s computer is both a PC and a PCR, implying that desktop applications can be utilized locally using attached interaction devices and display(s), or remotely by other visualization systems for domain specific production of data based on a user’s personal desktop install; and (iii) domain specific best-effort synchronization stating that for distributed visualization systems running on tiled display walls, state handling can be performed using a best-effort synchronization approach, where visualization clients eventually will get the correct state after a given period of time. Compared to state-of-the-art systems presented in the literature, the contributions of this dissertation enable utilization of a broader range of compute resources from a display wall, while at the same time providing better control over where to provide functionality and where to distribute workload between compute-nodes and display-nodes in a visualization context
    • …
    corecore