Search CORE

877 research outputs found

Exploiting Inter- and Intra-Memory Asymmetries for Data Mapping in Hybrid Tiered-Memories

Author: Antognetti P.
Arafa M.
Arjomand M.
Bhattacharyya A.
Blagodurov S.
Cao Y.
Chang Y.-M.
Cho B.-H.
Das A.
Das A.
Dray C.
Goda A.
Huang Y.
Jayasena N. S.
Kang U.
Kim Y.
Lee D.
Mallik A.
Mutlu O.
Mutlu O.
Pourshirazi B.
Qureshi M. K.
Qureshi M. K.
Redaelli A.
Rixner S.
Sandhu B. S.
Seong N. H.
Seshadri V.
Srinivasan J.
Stuecheli J.
Yoon H.
Yue J.
Zhang L.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 10/05/2020
Field of study

Modern computing systems are embracing hybrid memory comprising of DRAM and non-volatile memory (NVM) to combine the best properties of both memory technologies, achieving low latency, high reliability, and high density. A prominent characteristic of DRAM-NVM hybrid memory is that it has NVM access latency much higher than DRAM access latency. We call this inter-memory asymmetry. We observe that parasitic components on a long bitline are a major source of high latency in both DRAM and NVM, and a significant factor contributing to high-voltage operations in NVM, which impact their reliability. We propose an architectural change, where each long bitline in DRAM and NVM is split into two segments by an isolation transistor. One segment can be accessed with lower latency and operating voltage than the other. By introducing tiers, we enable non-uniform accesses within each memory type (which we call intra-memory asymmetry), leading to performance and reliability trade-offs in DRAM-NVM hybrid memory. We extend existing NVM-DRAM OS in three ways. First, we exploit both inter- and intra-memory asymmetries to allocate and migrate memory pages between the tiers in DRAM and NVM. Second, we improve the OS's page allocation decisions by predicting the access intensity of a newly-referenced memory page in a program and placing it to a matching tier during its initial allocation. This minimizes page migrations during program execution, lowering the performance overhead. Third, we propose a solution to migrate pages between the tiers of the same memory without transferring data over the memory channel, minimizing channel occupancy and improving performance. Our overall approach, which we call MNEME, to enable and exploit asymmetries in DRAM-NVM hybrid tiered memory improves both performance and reliability for both single-core and multi-programmed workloads.Comment: 15 pages, 29 figures, accepted at ACM SIGPLAN International Symposium on Memory Managemen

arXiv.org e-Print Archive

Crossref

Towards scalable, energy-efficient, bus-based on-chip networks

Author: Balasubramonian Rajeev
Udipi Aniruddha N.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2010
Field of study

Journal ArticleIt is expected that future on-chip networks for many-core processors will impose huge overheads in terms of energy, delay, complexity, verification effort, and area. There is a common belief that the bandwidth necessary for future applications can only be provided by employing packet-switched networks with complex routers and a scalable directory-based coherence protocol. We posit that such a scheme might likely be overkill in a well designed system in addition to being expensive in terms of power because of a large number of power-hungry routers. We show that bus-based networks with snooping protocols can significantly lower energy consumption and simplify network/ protocol design and verification, with no loss in performance. We achieve these characteristics by dividing the chip into multiple segments, each having its own broadcast bus, with these buses further connected by a central bus. This helps eliminate expensive routers, but suffers from the energy overhead of long wires. We propose the use of multiple Bloom filters to effectively track data presence in the cache and restrict bus broadcasts to a subset of segments, significantly reducing energy consumption. We further show that the use of OS page coloring helps maximize locality and improves the effectiveness of the Bloom filters. We also employ low-swing wiring to further reduce the energy overheads of the links. Performance can also be improved at relatively low costs by utilizing more of the abundant metal budgets on-chip and employing multiple address-interleaved buses rather than multiple routers. Thus, with the combination of all the above innovations, we extend the scalability of buses and believe that buses can be a viable and attractive option for future on-chip networks. We show energy reductions of up to 31X on average compared to many state-of-the-art packet switched networks

The University of Utah: J. Willard Marriott Digital Library

Microarchitectural techniques to reduce energy consumption in the memory hierarchy

Author: Ghosh Mrinmoy
Publication venue: Georgia Institute of Technology
Publication date: 03/04/2009
Field of study

This thesis states that dynamic profiling of the memory reference stream can improve energy and performance in the memory hierarchy. The research presented in this theses provides multiple instances of using lightweight hardware structures to profile the memory reference stream. The objective of this research is to develop microarchitectural techniques to reduce energy consumption at different levels of the memory hierarchy. Several simple and implementable techniques were developed as a part of this research. One of the techniques identifies and eliminates redundant refresh operations in DRAM and reduces DRAM refresh power. Another, reduces leakage energy in L2 and higher level caches for multiprocessor systems. The emphasis of this research has been to develop several techniques of obtaining energy savings in caches using a simple hardware structure called the counting Bloom filter (CBF). CBFs have been used to predict L2 cache misses and obtain energy savings by not accessing the L2 cache on a predicted miss. A simple extension of this technique allows CBFs to do way-estimation of set associative caches to reduce energy in cache lookups. Another technique using CBFs track addresses in a Virtual Cache and reduce false synonym lookups. Finally this thesis presents a technique to reduce dynamic power consumption in level one caches using significance compression. The significant energy and performance improvements demonstrated by the techniques presented in this thesis suggest that this work will be of great value for designing memory hierarchies of future computing platforms.Ph.D.Committee Chair: Lee, Hsien-Hsin S.; Committee Member: Cahtterjee,Abhijit; Committee Member: Mukhopadhyay, Saibal; Committee Member: Pande, Santosh; Committee Member: Yalamanchili, Sudhaka

Scholarly Materials And Research @ Georgia Tech

Hardware acceleration for power efficient deep packet inspection

Author: Zhou Yachao
Publication venue: Dublin City University. Research Institute for Networks and Communications Engineering (RINCE)
Publication date: 01/11/2012
Field of study

The rapid growth of the Internet leads to a massive spread of malicious attacks like viruses and malwares, making the safety of online activity a major concern. The use of Network Intrusion Detection Systems (NIDS) is an effective method to safeguard the Internet. One key procedure in NIDS is Deep Packet Inspection (DPI). DPI can examine the contents of a packet and take actions on the packets based on predefined rules. In this thesis, DPI is mainly discussed in the context of security applications. However, DPI can also be used for bandwidth management and network surveillance. DPI inspects the whole packet payload, and due to this and the complexity of the inspection rules, DPI algorithms consume significant amounts of resources including time, memory and energy. The aim of this thesis is to design hardware accelerated methods for memory and energy efficient high-speed DPI. The patterns in packet payloads, especially complex patterns, can be efficiently represented by regular expressions, which can be translated by the use of Deterministic Finite Automata (DFA). DFA algorithms are fast but consume very large amounts of memory with certain kinds of regular expressions. In this thesis, memory efficient algorithms are proposed based on the transition compressions of the DFAs. In this work, Bloom filters are used to implement DPI on an FPGA for hardware acceleration with the design of a parallel architecture. Furthermore, devoted at a balance of power and performance, an energy efficient adaptive Bloom filter is designed with the capability of adjusting the number of active hash functions according to current workload. In addition, a method is given for implementation on both two-stage and multi-stage platforms. Nevertheless, false positive rates still prevents the Bloom filter from extensive utilization; a cache-based counting Bloom filter is presented in this work to get rid of the false positives for fast and precise matching. Finally, in future work, in order to estimate the effect of power savings, models will be built for routers and DPI, which will also analyze the latency impact of dynamic frequency adaption to current traffic. Besides, a low power DPI system will be designed with a single or multiple DPI engines. Results and evaluation of the low power DPI model and system will be produced in future

DCU Online Research Access Service

Recommended from our members

Scalable hardware memory disambiguation

Author: Sethumadhavan Lakshminarasimhan, 1978-
Publication venue
Publication date: 01/12/2007
Field of study

This dissertation deals with one of the long-standing problems in Computer Architecture – the problem of memory disambiguation. Microprocessors typically reorder memory instructions during execution to improve concurrency. Such microprocessors use hardware memory structures for memory disambiguation, known as LoadStore Queues (LSQs), to ensure that memory instruction dependences are satisfied even when the memory instructions execute out-of-order. A typical LSQ implementation (circa 2006) holds all in-flight memory instructions in a physically centralized LSQ and performs a fully associative search on all buffered instructions to ensure that memory dependences are satisfied. These LSQ implementations do not scale because they use large, fully associative structures, which are known to be slow and power hungry. The increasing trend towards distributed microarchitectures further exacerbates these problems. As on-chip wire delays increase and high-performance processors become necessarily distributed, centralized structures such as the LSQ can limit scalability. This dissertation describes techniques to create scalable LSQs in both centralized and distributed microarchitectures. The problems and solutions described in this thesis are motivated and validated by real system designs. The dissertation starts with a description of the partitioned primary memory system of the TRIPS processor, of which the LSQ is an important component, and then through a series of optimizations describes how the power, area, and centralization problems of the LSQ can be solved with minor performance losses (if at all) even for large number of in flight memory instructions. The four solutions described in this dissertation — partitioning, filtering, late binding and efficient overflow management — enable power-, area-efficient, distributed and scalable LSQs, which in turn enable aggressive large-window processors capable of simultaneously executing thousands of instructions. To mitigate the power problem, we replaced the power-hungry, fully associative search with a power-efficient hash table lookup using a simple address-based Bloom filter. Bloom filters are probabilistic data structures used for testing set membership and can be used to quickly check if an instruction with the same data address is likely to be found in the LSQ without performing the associative search. Bloom filters typically eliminate more than 80% of the associative searches and they are highly effective because in most programs, it is uncommon for loads and stores to have the same data address and be in execution simultaneously. To rectify the area problem, we observe the fact that only a small fraction of all memory instructions are dependent, that only such dependent instructions need to be buffered in the LSQ, and that these instructions need to be in the LSQ only for certain parts of the pipelined execution. We propose two mechanisms to exploit these observations. The first mechanism, area filtering, is a hardware mechanism that couples Bloom filters and dependence predictors to dynamically identify and buffer only those instructions which are likely to be dependent. The second mechanism, late binding, reduces the occupancy and hence size of the LSQ. Both of these optimizations allows the number of LSQ slots to be reduced by up to one-half compared to a traditional organization without any performance degradation. Finally, we describe a new decentralized LSQ design for handling LSQ structural hazards in distributed microarchitectures. Decentralization of LSQs, and to a large extent distributed microarchitectures with memory speculation, has proved to be impractical because of the high performance penalties associated with the mechanisms for dealing with hazards. To solve this problem, we applied classic flow-control techniques from interconnection networks for handling resource con- flicts. The first method, memory-side buffering, buffers the overflowing instructions in a separate buffer near the LSQs. The second scheme, execution-side NACKing, sends the overflowing instruction back to the issue window from which it is later re-issued. The third scheme, network buffering, uses the buffers in the interconnection network between the execution units and memory to hold instructions when the LSQ is full, and uses virtual channel flow control to avoid deadlocks. The network buffering scheme is the most robust of all the overflow schemes and shows less than 1% performance degradation due to overflows for a subset of SPEC CPU 2000 and EEMBC benchmarks on a cycle-accurate simulator that closely models the TRIPS processor. The techniques proposed in this dissertation are independent, architectureneutral and their cumulative benefits result in LSQs that can be partitioned at a fine granularity and have low design complexity. Each of these partitions selectively buffers only memory instructions with true dependences and can be closely coupled with the execution units thus minimizing power, area, and latency. Such LSQ designs with near-ideal characteristics are well suited for microarchitectures with thousands of instructions in-flight and may enable even more aggressive microarchitectures in the future.Computer Science

Texas ScholarWorks

Text Preprocessing in Programmable Logic

Author: Skiba Michal
Publication venue: 'University of Waterloo'
Publication date: 03/08/2010
Field of study

There is a tremendous amount of information being generated and stored every year, and its growth rate is exponential. From 2008 to 2009, the growth rate was estimated to be 62%. In 2010, the amount of generated information is expected to grow by 50% to 1.2 Zettabytes, and by 2020 this rate is expected to grow to 35 Zettabytes. By preprocessing text in programmable logic, high data processing rates could be achieved with greater power efficiency than with an equivalent software solution, leading to a smaller carbon footprint. This thesis presents an overview of the fields of Information Retrieval and Natural Language Processing, and the design and implementation of four text preprocessing modules in programmable logic: UTF–8 decoding, stop–word filtering, and stemming with both Lovins’ and Porter’s techniques. These extensively pipelined circuits were implemented in a high performance FPGA and found to sustain maximum operational frequencies of 704 MHz, data throughputs in excess of 5 Gbps and efficiencies in the range of 4.332 – 6.765 mW/Gbps and 34.66 – 108.2 uW/MHz. These circuits can be incorporated into larger systems, such as document classifiers and information extraction engines

University of Waterloo's Institutional Repository

Transparent management of scratchpad memories in shared memory programming models

Author: Álvarez Martí Lluc
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2015
Field of study

Cache-coherent shared memory has traditionally been the favorite memory organization for chip multiprocessors thanks to its high programmability. In this organization the cache hierarchy is in charge of moving the data and keeping it coherent between all the caches, enabling the usage of shared memory programming models where the programmer does not need to carry out any data management operation. Unfortunately, performing all the data management operations in hardware causes severe problems, being the primary concerns the power consumption originated in the caches and the amount of coherence traffic in the interconnection network. A good solution is to introduce ScratchPad Memories (SPMs) alongside the cache hierarchy, forming a hybrid memory hierarchy. SPMs are more power-efficient than caches and do not generate coherence traffic, but they degrade programmability. In particular, SPMs require the programmer to partition the data, to program data transfers, and to keep coherence between different copies of the data. A promising solution to exploit the benefits of the SPMs without harming programmability is to allow programmers to use shared memory programming models and to automatically generate code that manages the SPMs. Unfortunately, current compilers and runtime systems encounter serious limitations to automatically generate code for hybrid memory hierarchies from shared memory programming models. This thesis proposes to transparently manage the SPMs of hybrid memory hierarchies in shared memory programming models. In order to achieve this goal this thesis proposes a combination of hardware and compiler techniques to manage the SPMs in fork-join programming models and a set of runtime system techniques to manage the SPMs in task programming models. The proposed techniques allow to program hybrid memory hierarchies with these two well-known and easy-to-use forms of shared memory programming models, capitalizing on the benefits of hybrid memory hierarchies in power consumption and network traffic without harming programmability. The first contribution of this thesis is a hardware/software co-designed coherence protocol to transparently manage the SPMs of hybrid memory hierarchies in fork-join programming models. The solution allows the compiler to always generate code to manage the SPMs with tiling software caches, even in the presence of unknown memory aliasing hazards between memory references to the SPMs and to the cache hierarchy. On the software side, the compiler generates a special form of memory instruction for memory references with possible aliasing hazards. On the hardware side, the special memory instructions are diverted to the correct copy of the data using a set of directories that track what data is mapped to the SPMs. The second contribution of this thesis is a set of runtime system techniques to manage the SPMs of hybrid memory hierarchies in task programming models. The proposed runtime system techniques exploit the characteristics of these programming models to map the data specified in the task dependences to the SPMs. Different policies are proposed to mitigate the communication costs of the data transfers, overlapping them with other execution phases such as the task scheduling phase or the execution of the previous task. The runtime system can also reduce the number of data transfers by using a task scheduler that exploits data locality in the SPMs. In addition, the proposed techniques are combined with mechanisms that reduce the impact of fine-grained tasks, such as hardware runtime systems or large SPM sizes. The accomplishment of this thesis is that hybrid memory hierarchies can be programmed with fork-join and task programming models. Consequently, architectures with hybrid memory hierarchies can be exposed to the programmer as a shared memory multiprocessor, taking advantage of the benefits of the SPMs while maintaining the programming simplicity of shared memory programming models.La memoria compartida con coherencia de caches es la jerarquía de memoria más utilizada en multiprocesadores gracias a su programabilidad. En esta solución la jerarquía de caches se encarga de mover los datos y mantener la coherencia entre las caches, habilitando el uso de modelos de programación de memoria compartida donde el programador no tiene que realizar ninguna operación para gestionar las memorias. Desafortunadamente, realizar estas operaciones en la arquitectura causa problemas severos, siendo especialmente relevantes el consumo de energía de las caches y la cantidad de tráfico de coherencia en la red de interconexión. Una buena solución es añadir Memorias ScratchPad (SPMs) acompañando la jerarquía de caches, formando una jerarquía de memoria híbrida. Las SPMs son más eficientes en energía y tráfico de coherencia, pero dificultan la programabilidad ya que requieren que el programador particione los datos, programe transferencias de datos y mantenga la coherencia entre diferentes copias de datos. Una solución prometedora para beneficiarse de las ventajas de las SPMs sin dificultar la programabilidad es permitir que el programador use modelos de programación de memoria compartida y generar código para gestionar las SPMs automáticamente. El problema es que los compiladores y los entornos de ejecución actuales sufren graves limitaciones al gestionar automáticamente una jerarquía de memoria híbrida en modelos de programación de memoria compartida. Esta tesis propone gestionar automáticamente una jerarquía de memoria híbrida en modelos de programación de memoria compartida. Para conseguir este objetivo esta tesis propone una combinación de técnicas hardware y de compilador para gestionar las SPMs en modelos de programación fork-join, y técnicas en entornos de ejecución para gestionar las SPMs en modelos de programación basados en tareas. Las técnicas propuestas hacen que las jerarquías de memoria híbridas puedan programarse con estos dos modelos de programación de memoria compartida, de tal forma que las ventajas en energía y tráfico de coherencia se puedan explotar sin dificultar la programabilidad. La primera contribución de esta tesis en un protocolo de coherencia hardware/software para gestionar SPMs en modelos de programación fork-join. La propuesta consigue que el compilador siempre pueda generar código para gestionar las SPMs, incluso cuando hay posibles alias de memoria entre referencias a memoria a las SPMs y a la jerarquía de caches. En la solución el compilador genera instrucciones especiales para las referencias a memoria con posibles alias, y el hardware sirve las instrucciones especiales con la copia válida de los datos usando directorios que guardan información sobre qué datos están mapeados en las SPMs. La segunda contribución de esta tesis son una serie de técnicas para gestionar SPMs en modelos de programación basados en tareas. Las técnicas aprovechan las características de estos modelos de programación para mapear las dependencias de las tareas en las SPMs y se complementan con políticas para minimizar los costes de las transferencias de datos, como solaparlas con fases del entorno de ejecución o la ejecución de tareas anteriores. El número de transferencias también se puede reducir utilizando un planificador que tenga en cuenta la localidad de datos y, además, las técnicas se pueden combinar con mecanismos para reducir los efectos negativos de tener tareas pequeñas, como entornos de ejecución en hardware o SPMs de más capacidad. Las propuestas de esta tesis consiguen que las jerarquías de memoria híbridas se puedan programar con modelos de programación fork-join y basados en tareas. En consecuencia, las arquitecturas con jerarquías de memoria híbridas se pueden exponer al programador como multiprocesadores de memoria compartida, beneficiándose de las ventajas de las SPMs en energía y tráfico de coherencia y manteniendo la simplicidad de uso de los modelos de programación de memoria compartida

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

Cooperative partitioning: Energy-efficient cache partitioning for high-performance CMPs

Author: Franke B.
Jones Timothy M.
Porpodas V.
Sundararajan K.T.
Topham N.P.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/02/2012
Field of study

Edinburgh Research Explorer

Bridging the Gap between Application and Solid-State-Drives

Author: Zhou Jian
Publication venue: 'Information Bulletin on Variable Stars (IBVS)'
Publication date: 01/01/2018
Field of study

Data storage is one of the important and often critical parts of the computing system in terms of performance, cost, reliability, and energy. Numerous new memory technologies, such as NAND flash, phase change memory (PCM), magnetic RAM (STT-RAM) and Memristor, have emerged recently. Many of them have already entered the production system. Traditional storage optimization and caching algorithms are far from optimal because storage I/Os do not show simple locality. To provide optimal storage we need accurate predictions of I/O behavior. However, the workloads are increasingly dynamic and diverse, making the long and short time I/O prediction challenge. Because of the evolution of the storage technologies and the increasing diversity of workloads, the storage software is becoming more and more complex. For example, Flash Translation Layer (FTL) is added for NAND-flash based Solid State Disks (NAND-SSDs). However, it introduces overhead such as address translation delay and garbage collection costs. There are many recent studies aim to address the overhead. Unfortunately, there is no one-size-fits-all solution due to the variety of workloads. Despite rapidly evolving in storage technologies, the increasing heterogeneity and diversity in machines and workloads coupled with the continued data explosion exacerbate the gap between computing and storage speeds. In this dissertation, we improve the data storage performance from both top-down and bottom-up approach. First, we will investigate exposing the storage level parallelism so that applications can avoid I/O contentions and workloads skew when scheduling the jobs. Second, we will study how architecture aware task scheduling can improve the performance of the application when PCM based NVRAM are equipped. Third, we will develop an I/O correlation aware flash translation layer for NAND-flash based Solid State Disks. Fourth, we will build a DRAM-based correlation aware FTL emulator and study the performance in various filesystems

University of Central Florida (UCF): STARS (Showcase of Text, Archives, Research & Scholarship)