84 research outputs found
A High-Speed Range-Matching TCAM for Storage-Efficient Packet Classification
AbstractâA critical issue in the use of TCAMs for packet
classification is how to efficiently represent rules with ranges,
known as range matching. A range-matching ternary content
addressable memory (RM-TCAM) including a highly functional
range-matching cell (RMC) is presented in this paper. By offering
various range operators, the RM-TCAM can reduce storage
expansion ratio from 4.21 to 1.01 compared with conventional
TCAMs, under real-world packet classification rule sets, which
results in reduced power consumption and die area. A new pre-discharging
match-line scheme is used to realize high-speed searching
in a dynamic match-line structure. An additional charge-recycling
driver further reduces the power consumption of search lines.
Simulation results of a 256 64-bit range-matching TCAM, when
implemented in the 0.13- m CMOS technology, achieves a 1.99-ns
search time with an energy efficiency of 1.26 fJ/bit/search. While a
TCAM including range encoding approach requires an additional
SRAM or DRAM, the RM-TCAM can improve storage efficiency
without any extra components as well as reduce the die area
GenPIP: In-Memory Acceleration of Genome Analysis via Tight Integration of Basecalling and Read Mapping
Nanopore sequencing is a widely-used high-throughput genome sequencing
technology that can sequence long fragments of a genome into raw electrical
signals at low cost. Nanopore sequencing requires two computationally-costly
processing steps for accurate downstream genome analysis. The first step,
basecalling, translates the raw electrical signals into nucleotide bases (i.e.,
A, C, G, T). The second step, read mapping, finds the correct location of a
read in a reference genome. In existing genome analysis pipelines, basecalling
and read mapping are executed separately. We observe in this work that such
separate execution of the two most time-consuming steps inherently leads to (1)
significant data movement and (2) redundant computations on the data, slowing
down the genome analysis pipeline. This paper proposes GenPIP, an in-memory
genome analysis accelerator that tightly integrates basecalling and read
mapping. GenPIP improves the performance of the genome analysis pipeline with
two key mechanisms: (1) in-memory fine-grained collaborative execution of the
major genome analysis steps in parallel; (2) a new technique for
early-rejection of low-quality and unmapped reads to timely stop the execution
of genome analysis for such reads, reducing inefficient computation. Our
experiments show that, for the execution of the genome analysis pipeline,
GenPIP provides 41.6X (8.4X) speedup and 32.8X (20.8X) energy savings with
negligible accuracy loss compared to the state-of-the-art software genome
analysis tools executed on a state-of-the-art CPU (GPU). Compared to a design
that combines state-of-the-art in-memory basecalling and read mapping
accelerators, GenPIP provides 1.39X speedup and 1.37X energy savings.Comment: 17 pages, 13 figure
Memory Management for Emerging Memory Technologies
The Memory Wall, or the gap between CPU speed and main memory latency, is ever increasing. The latency of Dynamic Random-Access Memory (DRAM) is now of the order of hundreds of CPU cycles. Additionally, the DRAM main memory is experiencing power, performance and capacity constraints that limit process technology scaling. On the other hand, the workloads running on such systems are themselves changing due to virtualization and cloud computing demanding more performance of the data centers. Not only do these workloads have larger working set sizes, but they are also changing the way memory gets used, resulting in higher sharing and increased bandwidth demands. New Non-Volatile Memory technologies (NVM) are emerging as an answer to the current main memory issues.
This thesis looks at memory management issues as the emerging memory technologies get integrated into the memory hierarchy. We consider the problems at various levels in the memory hierarchy, including sharing of CPU LLC, traffic management to future non-volatile memories behind the LLC, and extending main memory through the employment of NVM.
The first solution we propose is âAdaptive Replacement and Insertion" (ARI), an adaptive approach to last-level CPU cache management, optimizing the cache miss rate and writeback rate simultaneously. Our specific focus is to reduce writebacks as much as possible while maintaining or improving miss rate relative to conventional LRU replacement policy, with minimal hardware overhead. ARI reduces writebacks on benchmarks from SPEC2006 suite on average by 32.9% while also decreasing misses on average by 4.7%. In a PCM based memory system, this decreases energy consumption by 23% compared to LRU and provides a 49% lifetime improvement beyond what is possible with randomized wear-leveling.
Our second proposal is âVariable-Timeslice Thread Scheduling" (VATS), an OS kernel-level approach to CPU cache sharing. With modern, large, last-level caches (LLC), the time to fill the LLC is greater than the OS scheduling window. As a result, when a thread aggressively thrashes the LLC by replacing much of the data in it, another thread may not be able to recover its working set before being rescheduled. We isolate the threads in time by increasing their allotted time quanta, and allowing larger periods of time between interfering threads. Our approach, compared to conventional scheduling, mitigates up to 100% of the performance loss caused by CPU LLC interference. The system throughput is boosted by up to 15%.
As an unconventional approach to utilizing emerging memory technologies, we present a Ternary Content-Addressable Memory (TCAM) design with Flash transistors. TCAM is successfully used in network routing but can also be utilized in the OS Virtual Memory applications. Based on our layout and circuit simulation experiments, we conclude that our FTCAM block achieves an area improvement of 7.9Ă and a power improvement of 1.64Ă compared to a CMOS approach.
In order to lower the cost of Main Memory in systems with huge memory demand, it is becoming practical to extend the DRAM in the system with the less-expensive NVMe Flash, for a much lower system cost. However, given the relatively high Flash devices access latency, naively using them as main memory leads to serious performance degradation. We propose OSVPP, a software-only, OS swap-based page prefetching scheme for managing such hybrid DRAM + NVM systems. We show that it is possible to gain about 50% of the lost performance due to swapping into the NVM and thus enable the utilization of such hybrid systems for memory-hungry applications, lowering the memory cost while keeping the performance comparable to the DRAM-only system
Memory Management for Emerging Memory Technologies
The Memory Wall, or the gap between CPU speed and main memory latency, is ever increasing. The latency of Dynamic Random-Access Memory (DRAM) is now of the order of hundreds of CPU cycles. Additionally, the DRAM main memory is experiencing power, performance and capacity constraints that limit process technology scaling. On the other hand, the workloads running on such systems are themselves changing due to virtualization and cloud computing demanding more performance of the data centers. Not only do these workloads have larger working set sizes, but they are also changing the way memory gets used, resulting in higher sharing and increased bandwidth demands. New Non-Volatile Memory technologies (NVM) are emerging as an answer to the current main memory issues.
This thesis looks at memory management issues as the emerging memory technologies get integrated into the memory hierarchy. We consider the problems at various levels in the memory hierarchy, including sharing of CPU LLC, traffic management to future non-volatile memories behind the LLC, and extending main memory through the employment of NVM.
The first solution we propose is âAdaptive Replacement and Insertion" (ARI), an adaptive approach to last-level CPU cache management, optimizing the cache miss rate and writeback rate simultaneously. Our specific focus is to reduce writebacks as much as possible while maintaining or improving miss rate relative to conventional LRU replacement policy, with minimal hardware overhead. ARI reduces writebacks on benchmarks from SPEC2006 suite on average by 32.9% while also decreasing misses on average by 4.7%. In a PCM based memory system, this decreases energy consumption by 23% compared to LRU and provides a 49% lifetime improvement beyond what is possible with randomized wear-leveling.
Our second proposal is âVariable-Timeslice Thread Scheduling" (VATS), an OS kernel-level approach to CPU cache sharing. With modern, large, last-level caches (LLC), the time to fill the LLC is greater than the OS scheduling window. As a result, when a thread aggressively thrashes the LLC by replacing much of the data in it, another thread may not be able to recover its working set before being rescheduled. We isolate the threads in time by increasing their allotted time quanta, and allowing larger periods of time between interfering threads. Our approach, compared to conventional scheduling, mitigates up to 100% of the performance loss caused by CPU LLC interference. The system throughput is boosted by up to 15%.
As an unconventional approach to utilizing emerging memory technologies, we present a Ternary Content-Addressable Memory (TCAM) design with Flash transistors. TCAM is successfully used in network routing but can also be utilized in the OS Virtual Memory applications. Based on our layout and circuit simulation experiments, we conclude that our FTCAM block achieves an area improvement of 7.9Ă and a power improvement of 1.64Ă compared to a CMOS approach.
In order to lower the cost of Main Memory in systems with huge memory demand, it is becoming practical to extend the DRAM in the system with the less-expensive NVMe Flash, for a much lower system cost. However, given the relatively high Flash devices access latency, naively using them as main memory leads to serious performance degradation. We propose OSVPP, a software-only, OS swap-based page prefetching scheme for managing such hybrid DRAM + NVM systems. We show that it is possible to gain about 50% of the lost performance due to swapping into the NVM and thus enable the utilization of such hybrid systems for memory-hungry applications, lowering the memory cost while keeping the performance comparable to the DRAM-only system
DEMANDS FOR SPIN-BASED NONVOLATILITY IN EMERGING DIGITAL LOGIC AND MEMORY DEVICES FOR LOW POWER COMPUTING
Miniaturization of semiconductor devices is the main driving force to achieve an outstanding performance of modern integrated circuits. As the industry is focusing on the development of the 3nm technology node, it is apparent that transistor scaling shows signs of saturation. At the same time, the critically high power consumption becomes incompatible with the global demands of sustaining and accelerating the vital industrial growth, prompting an introduction of new solutions for energy efficient computations.Probably the only radically new option to reduce power consumption in novel integrated circuits is to introduce nonvolatility. The data retention without power sources eliminates the leakages and refresh cycles. As the necessity to waste time on initializing the data in temporarily unused parts of the circuit is not needed, nonvolatility also supports an instant-on computing paradigm.The electron spin adds additional functionality to digital switches based on field effect transistors. SpinFETs and SpinMOSFETs are promising devices, with the nonvolatility introduced through relative magnetization orientation between the ferromagnetic source and drain. A successful demonstration of such devices requires resolving several fundamental problems including spin injection from metal ferromagnets to a semiconductor, spin propagation and relaxation, as well as spin manipulation by the gate voltage. However, increasing the spin injection efficiency to boost the magnetoresistance ratio as well as an efficient spin control represent the challenges to be resolved before these devices appear on the market. Magnetic tunnel junctions with large magnetoresistance ratio are perfectly suited as key elements of nonvolatile CMOS-compatible magnetoresistive embedded memory. Purely electrically manipulated spin-transfer torque and spin-orbit torque magnetoresistive memories are superior compared to flash and will potentially compete with DRAM and SRAM. All major foundries announced a near-future production of such memories.Two-terminal magnetic tunnel junctions possess a simple structure, long retention time, high endurance, fast operation speed, and they yield a high integration density. Combining nonvolatile elements with CMOS devices allows for efficient power gating. Shifting data processing capabilities into the nonvolatile segment paves the way for a new low power and high-performance computing paradigm based on an in-memory computing architecture, where the same nonvolatile elements are used to store and to process the information
Enable advanced QoS-aware network slicing in 5G networks for slice-based media use cases
© 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Media use cases for emergency services require mission-critical levels of reliability for the delivery of media-rich services, such as video streaming. With the upcoming deployment of the fifth generation (5G) networks, a wide variety of applications and services with heterogeneous performance requirements are expected to be supported, and any migration of mission-critical services to 5G networks presents significant challenges in the quality of service (QoS), for emergency service operators. This paper presents a novel SliceNet framework, based on advanced and customizable network slicing to address some of the highlighted challenges in migrating eHealth telemedicine services to 5G networks. An overview of the framework outlines the technical approaches in beyond the state-of-the-art network slicing. Subsequently, this paper emphasizes the design and prototyping of a media-centric eHealth use case, focusing on a set of innovative enablers toward achieving end-to-end QoS-aware network slicing capabilities, required by this demanding use case. Experimental results empirically validate the prototyped enablers and demonstrate the applicability of the proposed framework in such media-rich use cases.Peer ReviewedPostprint (author's final draft
Sensor-based machine olfaction with neuromorphic models of the olfactory system
Electronic noses combine an array of cross-selective gas sensors with a pattern recognition engine to identify odors. Pattern recognition of multivariate gas sensor response is usually performed using existing statistical and chemometric techniques. An alternative solution involves developing novel algorithms inspired by information processing in the biological olfactory system. The objective of this dissertation is to develop a neuromorphic architecture for pattern recognition for a chemosensor array inspired by key signal processing mechanisms in the olfactory system. Our approach can be summarized as follows. First, a high-dimensional odor signal is generated from a chemical sensor array. Three approaches have been proposed to generate this combinatorial and high dimensional odor signal: temperature-modulation of a metal-oxide chemoresistor, a large population of optical microbead sensors, and infrared spectroscopy. The resulting high-dimensional odor signals are subject to dimensionality reduction using a self-organizing model of chemotopic convergence. This convergence transforms the initial combinatorial high-dimensional code into an organized spatial pattern (i.e., an odor image), which decouples odor identity from intensity. Two lateral inhibitory circuits subsequently process the highly overlapping odor images obtained after convergence. The first shunting lateral inhibition circuits perform gain control enabling identification of the odorant across a wide range of concentration. This shunting lateral inhibition is followed by an additive lateral inhibition circuit with center-surround connections. These circuits improve contrast between odor images leading to more sparse and orthogonal patterns than the one available at the input. The sharpened odor image is stored in a neurodynamic model of a cortex. Finally, anti-Hebbian/ Hebbian inhibitory feedback from the cortical circuits to the contrast enhancement circuits performs mixture segmentation and weaker odor/background suppression, respectively. We validate the models using experimental datasets and show our results are consistent with recent neurobiological findings
Fully Programming the Data Plane: A Hardware/Software Approach
Les rĂ©seaux dĂ©finis par logiciel â en anglais Software-Defined Networking (SDN) â sont apparus ces derniĂšres annĂ©es comme un nouveau paradigme de rĂ©seau. SDN introduit une sĂ©paration entre les plans de gestion, de contrĂŽle et de donnĂ©es, permettant Ă ceux-ci dâĂ©voluer de maniĂšre indĂ©pendante, rompant ainsi avec la rigiditĂ© des rĂ©seaux traditionnels. En particulier, dans le plan de donnĂ©es, les avancĂ©es rĂ©centes ont portĂ© sur la dĂ©finition des langages
de traitement de paquets, tel que P4, et sur la dĂ©finition dâarchitectures de commutateurs programmables, par exemple la Protocol Independent Switch Architecture (PISA). Dans cette thĂšse, nous nous intĂ©ressons a lâarchitecture PISA et Ă©valuons comment exploiter les FPGA comme plateforme de traitement efficace de paquets. Cette problĂ©matique est
Ă©tudiĂ©e a trois niveaux dâabstraction : microarchitectural, programmation et architectural. Au niveau microarchitectural, nous avons proposĂ© une architecture efficace dâun analyseur dâentĂȘtes de paquets pour PISA. Lâanalyseur de paquets utilise une architecture pipelinĂ©e avec propagation en avant â en anglais feed-forward. La complexitĂ© de lâarchitecture est rĂ©duite par rapport Ă lâĂ©tat de lâart grĂące a lâutilisation dâoptimisations algorithmiques. Finalement, lâarchitecture est gĂ©nĂ©rĂ©e par un compilateur P4 vers C++, combinĂ© Ă un outil de synthĂšse de haut niveau. La solution proposĂ©e atteint un dĂ©bit de 100 Gb/s avec une latence comparable Ă celle dâanalyseurs dâentĂȘtes de paquets Ă©crits Ă la main. Au niveau de la programmation, nous avons proposĂ© une nouvelle mĂ©thodologie de conception de synthĂšse de haut niveau visant Ă amĂ©liorer conjointement la qualitĂ© logicielle et matĂ©rielle. Nous exploitons les fonctionnalitĂ©s du C++ moderne pour amĂ©liorer Ă la fois la modularitĂ© et la lisibilitĂ© du code, tout en conservant (ou amĂ©liorant) les rĂ©sultats du matĂ©riel gĂ©nĂ©rĂ©.
Des exemples de conception utilisant notre mĂ©thodologie, incluant pour lâanalyseur dâentĂȘte de paquets, ont Ă©tĂ© rendus publics.----------ABSTRACT: Software-Defined Networking (SDN) has emerged in recent years as a new network paradigm to de-ossify communication networks. Indeed, by offering a clear separation of network concerns
between the management, control, and data planes, SDN allows each of these planes to evolve independently, breaking the rigidity of traditional networks. However, while well
spread in the control and management planes, this de-ossification has only recently reached the data plane with the advent of packet processing languages, e.g. P4, and novel programmable switch architectures, e.g. Protocol Independent Switch Architecture (PISA). In this work, we focus on leveraging the PISA architecture by mainly exploiting the FPGA capabilities for efficient packet processing. In this way, we address this issue at different
abstraction levels: i) microarchitectural; ii) programming; and, iii) architectural. At the microarchitectural level, we have proposed an efficient FPGA-based packet parser
architecture, which is a major PISAâs component. The proposed packet parser follows a feedforward
pipeline architecture in which the internal microarchitectural has been meticulously optimized for FPGA implementation. The architecture is automatically generated by a P4- to-C++ compiler after several rounds of graph optimizations. The proposed solution achieves 100 Gb/s line rate with latency comparable to hand-written packet parsers. The throughput scales from 10 Gb/s to 160 Gb/s with moderate increase in resource consumption. Both the compiler and the packet parser codebase have been open-sourced to permit reproducibility. At the programming level, we have proposed a novel High-Level Synthesis (HLS) design methodology aiming at improving software and hardware quality. We have employed this novel methodology when designing the packet parser. In our work, we have exploited features of modern C++ that improves at the same time code modularity and readability while keeping (or improving) the results of the generated hardware. Design examples using our methodology have been publicly released
- âŠ