84 research outputs found

    A High-Speed Range-Matching TCAM for Storage-Efficient Packet Classification

    Get PDF
    Abstract—A critical issue in the use of TCAMs for packet classification is how to efficiently represent rules with ranges, known as range matching. A range-matching ternary content addressable memory (RM-TCAM) including a highly functional range-matching cell (RMC) is presented in this paper. By offering various range operators, the RM-TCAM can reduce storage expansion ratio from 4.21 to 1.01 compared with conventional TCAMs, under real-world packet classification rule sets, which results in reduced power consumption and die area. A new pre-discharging match-line scheme is used to realize high-speed searching in a dynamic match-line structure. An additional charge-recycling driver further reduces the power consumption of search lines. Simulation results of a 256 64-bit range-matching TCAM, when implemented in the 0.13- m CMOS technology, achieves a 1.99-ns search time with an energy efficiency of 1.26 fJ/bit/search. While a TCAM including range encoding approach requires an additional SRAM or DRAM, the RM-TCAM can improve storage efficiency without any extra components as well as reduce the die area

    GenPIP: In-Memory Acceleration of Genome Analysis via Tight Integration of Basecalling and Read Mapping

    Full text link
    Nanopore sequencing is a widely-used high-throughput genome sequencing technology that can sequence long fragments of a genome into raw electrical signals at low cost. Nanopore sequencing requires two computationally-costly processing steps for accurate downstream genome analysis. The first step, basecalling, translates the raw electrical signals into nucleotide bases (i.e., A, C, G, T). The second step, read mapping, finds the correct location of a read in a reference genome. In existing genome analysis pipelines, basecalling and read mapping are executed separately. We observe in this work that such separate execution of the two most time-consuming steps inherently leads to (1) significant data movement and (2) redundant computations on the data, slowing down the genome analysis pipeline. This paper proposes GenPIP, an in-memory genome analysis accelerator that tightly integrates basecalling and read mapping. GenPIP improves the performance of the genome analysis pipeline with two key mechanisms: (1) in-memory fine-grained collaborative execution of the major genome analysis steps in parallel; (2) a new technique for early-rejection of low-quality and unmapped reads to timely stop the execution of genome analysis for such reads, reducing inefficient computation. Our experiments show that, for the execution of the genome analysis pipeline, GenPIP provides 41.6X (8.4X) speedup and 32.8X (20.8X) energy savings with negligible accuracy loss compared to the state-of-the-art software genome analysis tools executed on a state-of-the-art CPU (GPU). Compared to a design that combines state-of-the-art in-memory basecalling and read mapping accelerators, GenPIP provides 1.39X speedup and 1.37X energy savings.Comment: 17 pages, 13 figure

    Memory Management for Emerging Memory Technologies

    Get PDF
    The Memory Wall, or the gap between CPU speed and main memory latency, is ever increasing. The latency of Dynamic Random-Access Memory (DRAM) is now of the order of hundreds of CPU cycles. Additionally, the DRAM main memory is experiencing power, performance and capacity constraints that limit process technology scaling. On the other hand, the workloads running on such systems are themselves changing due to virtualization and cloud computing demanding more performance of the data centers. Not only do these workloads have larger working set sizes, but they are also changing the way memory gets used, resulting in higher sharing and increased bandwidth demands. New Non-Volatile Memory technologies (NVM) are emerging as an answer to the current main memory issues. This thesis looks at memory management issues as the emerging memory technologies get integrated into the memory hierarchy. We consider the problems at various levels in the memory hierarchy, including sharing of CPU LLC, traffic management to future non-volatile memories behind the LLC, and extending main memory through the employment of NVM. The first solution we propose is “Adaptive Replacement and Insertion" (ARI), an adaptive approach to last-level CPU cache management, optimizing the cache miss rate and writeback rate simultaneously. Our specific focus is to reduce writebacks as much as possible while maintaining or improving miss rate relative to conventional LRU replacement policy, with minimal hardware overhead. ARI reduces writebacks on benchmarks from SPEC2006 suite on average by 32.9% while also decreasing misses on average by 4.7%. In a PCM based memory system, this decreases energy consumption by 23% compared to LRU and provides a 49% lifetime improvement beyond what is possible with randomized wear-leveling. Our second proposal is “Variable-Timeslice Thread Scheduling" (VATS), an OS kernel-level approach to CPU cache sharing. With modern, large, last-level caches (LLC), the time to fill the LLC is greater than the OS scheduling window. As a result, when a thread aggressively thrashes the LLC by replacing much of the data in it, another thread may not be able to recover its working set before being rescheduled. We isolate the threads in time by increasing their allotted time quanta, and allowing larger periods of time between interfering threads. Our approach, compared to conventional scheduling, mitigates up to 100% of the performance loss caused by CPU LLC interference. The system throughput is boosted by up to 15%. As an unconventional approach to utilizing emerging memory technologies, we present a Ternary Content-Addressable Memory (TCAM) design with Flash transistors. TCAM is successfully used in network routing but can also be utilized in the OS Virtual Memory applications. Based on our layout and circuit simulation experiments, we conclude that our FTCAM block achieves an area improvement of 7.9× and a power improvement of 1.64× compared to a CMOS approach. In order to lower the cost of Main Memory in systems with huge memory demand, it is becoming practical to extend the DRAM in the system with the less-expensive NVMe Flash, for a much lower system cost. However, given the relatively high Flash devices access latency, naively using them as main memory leads to serious performance degradation. We propose OSVPP, a software-only, OS swap-based page prefetching scheme for managing such hybrid DRAM + NVM systems. We show that it is possible to gain about 50% of the lost performance due to swapping into the NVM and thus enable the utilization of such hybrid systems for memory-hungry applications, lowering the memory cost while keeping the performance comparable to the DRAM-only system

    Memory Management for Emerging Memory Technologies

    Get PDF
    The Memory Wall, or the gap between CPU speed and main memory latency, is ever increasing. The latency of Dynamic Random-Access Memory (DRAM) is now of the order of hundreds of CPU cycles. Additionally, the DRAM main memory is experiencing power, performance and capacity constraints that limit process technology scaling. On the other hand, the workloads running on such systems are themselves changing due to virtualization and cloud computing demanding more performance of the data centers. Not only do these workloads have larger working set sizes, but they are also changing the way memory gets used, resulting in higher sharing and increased bandwidth demands. New Non-Volatile Memory technologies (NVM) are emerging as an answer to the current main memory issues. This thesis looks at memory management issues as the emerging memory technologies get integrated into the memory hierarchy. We consider the problems at various levels in the memory hierarchy, including sharing of CPU LLC, traffic management to future non-volatile memories behind the LLC, and extending main memory through the employment of NVM. The first solution we propose is “Adaptive Replacement and Insertion" (ARI), an adaptive approach to last-level CPU cache management, optimizing the cache miss rate and writeback rate simultaneously. Our specific focus is to reduce writebacks as much as possible while maintaining or improving miss rate relative to conventional LRU replacement policy, with minimal hardware overhead. ARI reduces writebacks on benchmarks from SPEC2006 suite on average by 32.9% while also decreasing misses on average by 4.7%. In a PCM based memory system, this decreases energy consumption by 23% compared to LRU and provides a 49% lifetime improvement beyond what is possible with randomized wear-leveling. Our second proposal is “Variable-Timeslice Thread Scheduling" (VATS), an OS kernel-level approach to CPU cache sharing. With modern, large, last-level caches (LLC), the time to fill the LLC is greater than the OS scheduling window. As a result, when a thread aggressively thrashes the LLC by replacing much of the data in it, another thread may not be able to recover its working set before being rescheduled. We isolate the threads in time by increasing their allotted time quanta, and allowing larger periods of time between interfering threads. Our approach, compared to conventional scheduling, mitigates up to 100% of the performance loss caused by CPU LLC interference. The system throughput is boosted by up to 15%. As an unconventional approach to utilizing emerging memory technologies, we present a Ternary Content-Addressable Memory (TCAM) design with Flash transistors. TCAM is successfully used in network routing but can also be utilized in the OS Virtual Memory applications. Based on our layout and circuit simulation experiments, we conclude that our FTCAM block achieves an area improvement of 7.9× and a power improvement of 1.64× compared to a CMOS approach. In order to lower the cost of Main Memory in systems with huge memory demand, it is becoming practical to extend the DRAM in the system with the less-expensive NVMe Flash, for a much lower system cost. However, given the relatively high Flash devices access latency, naively using them as main memory leads to serious performance degradation. We propose OSVPP, a software-only, OS swap-based page prefetching scheme for managing such hybrid DRAM + NVM systems. We show that it is possible to gain about 50% of the lost performance due to swapping into the NVM and thus enable the utilization of such hybrid systems for memory-hungry applications, lowering the memory cost while keeping the performance comparable to the DRAM-only system

    DEMANDS FOR SPIN-BASED NONVOLATILITY IN EMERGING DIGITAL LOGIC AND MEMORY DEVICES FOR LOW POWER COMPUTING

    Get PDF
    Miniaturization of semiconductor devices is the main driving force to achieve an outstanding performance of modern integrated circuits. As the industry is focusing on the development of the 3nm technology node, it is apparent that transistor scaling shows signs of saturation. At the same time, the critically high power consumption becomes incompatible with the global demands of sustaining and accelerating the vital industrial growth, prompting an introduction of new solutions for energy efficient computations.Probably the only radically new option to reduce power consumption in novel integrated circuits is to introduce nonvolatility. The data retention without power sources eliminates the leakages and refresh cycles. As the necessity to waste time on initializing the data in temporarily unused parts of the circuit is not needed, nonvolatility also supports an instant-on computing paradigm.The electron spin adds additional functionality to digital switches based on field effect transistors. SpinFETs and SpinMOSFETs are promising devices, with the nonvolatility introduced through relative magnetization orientation between the ferromagnetic source and drain. A successful demonstration of such devices requires resolving several fundamental problems including spin injection from metal ferromagnets to a semiconductor, spin propagation and relaxation, as well as spin manipulation by the gate voltage. However, increasing the spin injection efficiency to boost the magnetoresistance ratio as well as an efficient spin control represent the challenges to be resolved before these devices appear on the market. Magnetic tunnel junctions with large magnetoresistance ratio are perfectly suited as key elements of nonvolatile CMOS-compatible magnetoresistive embedded memory. Purely electrically manipulated spin-transfer torque and spin-orbit torque magnetoresistive memories are superior compared to flash and will potentially compete with DRAM and SRAM. All major foundries announced a near-future production of such memories.Two-terminal magnetic tunnel junctions possess a simple structure, long retention time, high endurance, fast operation speed, and they yield a high integration density. Combining nonvolatile elements with CMOS devices allows for efficient power gating. Shifting data processing capabilities into the nonvolatile segment paves the way for a new low power and high-performance computing paradigm based on an in-memory computing architecture, where the same nonvolatile elements are used to store and to process the information

    Enable advanced QoS-aware network slicing in 5G networks for slice-based media use cases

    Get PDF
    © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Media use cases for emergency services require mission-critical levels of reliability for the delivery of media-rich services, such as video streaming. With the upcoming deployment of the fifth generation (5G) networks, a wide variety of applications and services with heterogeneous performance requirements are expected to be supported, and any migration of mission-critical services to 5G networks presents significant challenges in the quality of service (QoS), for emergency service operators. This paper presents a novel SliceNet framework, based on advanced and customizable network slicing to address some of the highlighted challenges in migrating eHealth telemedicine services to 5G networks. An overview of the framework outlines the technical approaches in beyond the state-of-the-art network slicing. Subsequently, this paper emphasizes the design and prototyping of a media-centric eHealth use case, focusing on a set of innovative enablers toward achieving end-to-end QoS-aware network slicing capabilities, required by this demanding use case. Experimental results empirically validate the prototyped enablers and demonstrate the applicability of the proposed framework in such media-rich use cases.Peer ReviewedPostprint (author's final draft

    Sensor-based machine olfaction with neuromorphic models of the olfactory system

    Get PDF
    Electronic noses combine an array of cross-selective gas sensors with a pattern recognition engine to identify odors. Pattern recognition of multivariate gas sensor response is usually performed using existing statistical and chemometric techniques. An alternative solution involves developing novel algorithms inspired by information processing in the biological olfactory system. The objective of this dissertation is to develop a neuromorphic architecture for pattern recognition for a chemosensor array inspired by key signal processing mechanisms in the olfactory system. Our approach can be summarized as follows. First, a high-dimensional odor signal is generated from a chemical sensor array. Three approaches have been proposed to generate this combinatorial and high dimensional odor signal: temperature-modulation of a metal-oxide chemoresistor, a large population of optical microbead sensors, and infrared spectroscopy. The resulting high-dimensional odor signals are subject to dimensionality reduction using a self-organizing model of chemotopic convergence. This convergence transforms the initial combinatorial high-dimensional code into an organized spatial pattern (i.e., an odor image), which decouples odor identity from intensity. Two lateral inhibitory circuits subsequently process the highly overlapping odor images obtained after convergence. The first shunting lateral inhibition circuits perform gain control enabling identification of the odorant across a wide range of concentration. This shunting lateral inhibition is followed by an additive lateral inhibition circuit with center-surround connections. These circuits improve contrast between odor images leading to more sparse and orthogonal patterns than the one available at the input. The sharpened odor image is stored in a neurodynamic model of a cortex. Finally, anti-Hebbian/ Hebbian inhibitory feedback from the cortical circuits to the contrast enhancement circuits performs mixture segmentation and weaker odor/background suppression, respectively. We validate the models using experimental datasets and show our results are consistent with recent neurobiological findings

    Fully Programming the Data Plane: A Hardware/Software Approach

    Get PDF
    Les rĂ©seaux dĂ©finis par logiciel — en anglais Software-Defined Networking (SDN) — sont apparus ces derniĂšres annĂ©es comme un nouveau paradigme de rĂ©seau. SDN introduit une sĂ©paration entre les plans de gestion, de contrĂŽle et de donnĂ©es, permettant Ă  ceux-ci d’évoluer de maniĂšre indĂ©pendante, rompant ainsi avec la rigiditĂ© des rĂ©seaux traditionnels. En particulier, dans le plan de donnĂ©es, les avancĂ©es rĂ©centes ont portĂ© sur la dĂ©finition des langages de traitement de paquets, tel que P4, et sur la dĂ©finition d’architectures de commutateurs programmables, par exemple la Protocol Independent Switch Architecture (PISA). Dans cette thĂšse, nous nous intĂ©ressons a l’architecture PISA et Ă©valuons comment exploiter les FPGA comme plateforme de traitement efficace de paquets. Cette problĂ©matique est Ă©tudiĂ©e a trois niveaux d’abstraction : microarchitectural, programmation et architectural. Au niveau microarchitectural, nous avons proposĂ© une architecture efficace d’un analyseur d’entĂȘtes de paquets pour PISA. L’analyseur de paquets utilise une architecture pipelinĂ©e avec propagation en avant — en anglais feed-forward. La complexitĂ© de l’architecture est rĂ©duite par rapport Ă  l’état de l’art grĂące a l’utilisation d’optimisations algorithmiques. Finalement, l’architecture est gĂ©nĂ©rĂ©e par un compilateur P4 vers C++, combinĂ© Ă  un outil de synthĂšse de haut niveau. La solution proposĂ©e atteint un dĂ©bit de 100 Gb/s avec une latence comparable Ă  celle d’analyseurs d’entĂȘtes de paquets Ă©crits Ă  la main. Au niveau de la programmation, nous avons proposĂ© une nouvelle mĂ©thodologie de conception de synthĂšse de haut niveau visant Ă  amĂ©liorer conjointement la qualitĂ© logicielle et matĂ©rielle. Nous exploitons les fonctionnalitĂ©s du C++ moderne pour amĂ©liorer Ă  la fois la modularitĂ© et la lisibilitĂ© du code, tout en conservant (ou amĂ©liorant) les rĂ©sultats du matĂ©riel gĂ©nĂ©rĂ©. Des exemples de conception utilisant notre mĂ©thodologie, incluant pour l’analyseur d’entĂȘte de paquets, ont Ă©tĂ© rendus publics.----------ABSTRACT: Software-Defined Networking (SDN) has emerged in recent years as a new network paradigm to de-ossify communication networks. Indeed, by offering a clear separation of network concerns between the management, control, and data planes, SDN allows each of these planes to evolve independently, breaking the rigidity of traditional networks. However, while well spread in the control and management planes, this de-ossification has only recently reached the data plane with the advent of packet processing languages, e.g. P4, and novel programmable switch architectures, e.g. Protocol Independent Switch Architecture (PISA). In this work, we focus on leveraging the PISA architecture by mainly exploiting the FPGA capabilities for efficient packet processing. In this way, we address this issue at different abstraction levels: i) microarchitectural; ii) programming; and, iii) architectural. At the microarchitectural level, we have proposed an efficient FPGA-based packet parser architecture, which is a major PISA’s component. The proposed packet parser follows a feedforward pipeline architecture in which the internal microarchitectural has been meticulously optimized for FPGA implementation. The architecture is automatically generated by a P4- to-C++ compiler after several rounds of graph optimizations. The proposed solution achieves 100 Gb/s line rate with latency comparable to hand-written packet parsers. The throughput scales from 10 Gb/s to 160 Gb/s with moderate increase in resource consumption. Both the compiler and the packet parser codebase have been open-sourced to permit reproducibility. At the programming level, we have proposed a novel High-Level Synthesis (HLS) design methodology aiming at improving software and hardware quality. We have employed this novel methodology when designing the packet parser. In our work, we have exploited features of modern C++ that improves at the same time code modularity and readability while keeping (or improving) the results of the generated hardware. Design examples using our methodology have been publicly released

    Self-timed field programmmable gate array architectures

    Get PDF
    • 

    corecore