Search CORE

84 research outputs found

A High-Speed Range-Matching TCAM for Storage-Efficient Packet Classification

Author: Ahn Hyun-Seok
Jeong Deog-Kyoon
Kim Suhwan
Kim Young-Deok
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/06/2009
Field of study

Abstract—A critical issue in the use of TCAMs for packet classification is how to efficiently represent rules with ranges, known as range matching. A range-matching ternary content addressable memory (RM-TCAM) including a highly functional range-matching cell (RMC) is presented in this paper. By offering various range operators, the RM-TCAM can reduce storage expansion ratio from 4.21 to 1.01 compared with conventional TCAMs, under real-world packet classification rule sets, which results in reduced power consumption and die area. A new pre-discharging match-line scheme is used to realize high-speed searching in a dynamic match-line structure. An additional charge-recycling driver further reduces the power consumption of search lines. Simulation results of a 256 64-bit range-matching TCAM, when implemented in the 0.13- m CMOS technology, achieves a 1.99-ns search time with an energy efficiency of 1.26 fJ/bit/search. While a TCAM including range encoding approach requires an additional SRAM or DRAM, the RM-TCAM can improve storage efficiency without any extra components as well as reduce the die area

GenPIP: In-Memory Acceleration of Genome Analysis via Tight Integration of Basecalling and Read Mapping

Author: Alser Mohammed
Alserr Nour Almadhoun
Baranwal Akanksha
Cali Damla Senol
Firtina Can
Manglik Aditya
Mao Haiyu
Mutlu Onur
Sadrosadati Mohammad
Publication venue
Publication date: 18/09/2022
Field of study

Nanopore sequencing is a widely-used high-throughput genome sequencing technology that can sequence long fragments of a genome into raw electrical signals at low cost. Nanopore sequencing requires two computationally-costly processing steps for accurate downstream genome analysis. The first step, basecalling, translates the raw electrical signals into nucleotide bases (i.e., A, C, G, T). The second step, read mapping, finds the correct location of a read in a reference genome. In existing genome analysis pipelines, basecalling and read mapping are executed separately. We observe in this work that such separate execution of the two most time-consuming steps inherently leads to (1) significant data movement and (2) redundant computations on the data, slowing down the genome analysis pipeline. This paper proposes GenPIP, an in-memory genome analysis accelerator that tightly integrates basecalling and read mapping. GenPIP improves the performance of the genome analysis pipeline with two key mechanisms: (1) in-memory fine-grained collaborative execution of the major genome analysis steps in parallel; (2) a new technique for early-rejection of low-quality and unmapped reads to timely stop the execution of genome analysis for such reads, reducing inefficient computation. Our experiments show that, for the execution of the genome analysis pipeline, GenPIP provides 41.6X (8.4X) speedup and 32.8X (20.8X) energy savings with negligible accuracy loss compared to the state-of-the-art software genome analysis tools executed on a state-of-the-art CPU (GPU). Compared to a design that combines state-of-the-art in-memory basecalling and read mapping accelerators, GenPIP provides 1.39X speedup and 1.37X energy savings.Comment: 17 pages, 13 figure

arXiv.org e-Print Archive

Memory Management for Emerging Memory Technologies

Author: Fedorov Viacheslav
Publication venue
Publication date: 27/02/2020
Field of study

The Memory Wall, or the gap between CPU speed and main memory latency, is ever increasing. The latency of Dynamic Random-Access Memory (DRAM) is now of the order of hundreds of CPU cycles. Additionally, the DRAM main memory is experiencing power, performance and capacity constraints that limit process technology scaling. On the other hand, the workloads running on such systems are themselves changing due to virtualization and cloud computing demanding more performance of the data centers. Not only do these workloads have larger working set sizes, but they are also changing the way memory gets used, resulting in higher sharing and increased bandwidth demands. New Non-Volatile Memory technologies (NVM) are emerging as an answer to the current main memory issues. This thesis looks at memory management issues as the emerging memory technologies get integrated into the memory hierarchy. We consider the problems at various levels in the memory hierarchy, including sharing of CPU LLC, traffic management to future non-volatile memories behind the LLC, and extending main memory through the employment of NVM. The first solution we propose is “Adaptive Replacement and Insertion" (ARI), an adaptive approach to last-level CPU cache management, optimizing the cache miss rate and writeback rate simultaneously. Our specific focus is to reduce writebacks as much as possible while maintaining or improving miss rate relative to conventional LRU replacement policy, with minimal hardware overhead. ARI reduces writebacks on benchmarks from SPEC2006 suite on average by 32.9% while also decreasing misses on average by 4.7%. In a PCM based memory system, this decreases energy consumption by 23% compared to LRU and provides a 49% lifetime improvement beyond what is possible with randomized wear-leveling. Our second proposal is “Variable-Timeslice Thread Scheduling" (VATS), an OS kernel-level approach to CPU cache sharing. With modern, large, last-level caches (LLC), the time to fill the LLC is greater than the OS scheduling window. As a result, when a thread aggressively thrashes the LLC by replacing much of the data in it, another thread may not be able to recover its working set before being rescheduled. We isolate the threads in time by increasing their allotted time quanta, and allowing larger periods of time between interfering threads. Our approach, compared to conventional scheduling, mitigates up to 100% of the performance loss caused by CPU LLC interference. The system throughput is boosted by up to 15%. As an unconventional approach to utilizing emerging memory technologies, we present a Ternary Content-Addressable Memory (TCAM) design with Flash transistors. TCAM is successfully used in network routing but can also be utilized in the OS Virtual Memory applications. Based on our layout and circuit simulation experiments, we conclude that our FTCAM block achieves an area improvement of 7.9× and a power improvement of 1.64× compared to a CMOS approach. In order to lower the cost of Main Memory in systems with huge memory demand, it is becoming practical to extend the DRAM in the system with the less-expensive NVMe Flash, for a much lower system cost. However, given the relatively high Flash devices access latency, naively using them as main memory leads to serious performance degradation. We propose OSVPP, a software-only, OS swap-based page prefetching scheme for managing such hybrid DRAM + NVM systems. We show that it is possible to gain about 50% of the lost performance due to swapping into the NVM and thus enable the utilization of such hybrid systems for memory-hungry applications, lowering the memory cost while keeping the performance comparable to the DRAM-only system

Memory Management for Emerging Memory Technologies

Author: Fedorov Viacheslav
Publication venue
Publication date: 27/02/2020
Field of study

DEMANDS FOR SPIN-BASED NONVOLATILITY IN EMERGING DIGITAL LOGIC AND MEMORY DEVICES FOR LOW POWER COMPUTING

Author: Selberherr Siegfried
Sverdlov Viktor
Publication venue: Published by the University of Niš, Serbia
Publication date: 28/09/2018
Field of study

Miniaturization of semiconductor devices is the main driving force to achieve an outstanding performance of modern integrated circuits. As the industry is focusing on the development of the 3nm technology node, it is apparent that transistor scaling shows signs of saturation. At the same time, the critically high power consumption becomes incompatible with the global demands of sustaining and accelerating the vital industrial growth, prompting an introduction of new solutions for energy efficient computations.Probably the only radically new option to reduce power consumption in novel integrated circuits is to introduce nonvolatility. The data retention without power sources eliminates the leakages and refresh cycles. As the necessity to waste time on initializing the data in temporarily unused parts of the circuit is not needed, nonvolatility also supports an instant-on computing paradigm.The electron spin adds additional functionality to digital switches based on field effect transistors. SpinFETs and SpinMOSFETs are promising devices, with the nonvolatility introduced through relative magnetization orientation between the ferromagnetic source and drain. A successful demonstration of such devices requires resolving several fundamental problems including spin injection from metal ferromagnets to a semiconductor, spin propagation and relaxation, as well as spin manipulation by the gate voltage. However, increasing the spin injection efficiency to boost the magnetoresistance ratio as well as an efficient spin control represent the challenges to be resolved before these devices appear on the market. Magnetic tunnel junctions with large magnetoresistance ratio are perfectly suited as key elements of nonvolatile CMOS-compatible magnetoresistive embedded memory. Purely electrically manipulated spin-transfer torque and spin-orbit torque magnetoresistive memories are superior compared to flash and will potentially compete with DRAM and SRAM. All major foundries announced a near-future production of such memories.Two-terminal magnetic tunnel junctions possess a simple structure, long retention time, high endurance, fast operation speed, and they yield a high integration density. Combining nonvolatile elements with CMOS devices allows for efficient power gating. Shifting data processing capabilities into the nonvolatile segment paves the way for a new low power and high-performance computing paradigm based on an in-memory computing architecture, where the same nonvolatile elements are used to store and to process the information

Enable advanced QoS-aware network slicing in 5G networks for slice-based media use cases

Author: Alcaraz Calero Jose
Barros Weiss Maria
Ricart Sánchez Rubén
Spadaro Salvatore
Wang Qi
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

© 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Media use cases for emergency services require mission-critical levels of reliability for the delivery of media-rich services, such as video streaming. With the upcoming deployment of the fifth generation (5G) networks, a wide variety of applications and services with heterogeneous performance requirements are expected to be supported, and any migration of mission-critical services to 5G networks presents significant challenges in the quality of service (QoS), for emergency service operators. This paper presents a novel SliceNet framework, based on advanced and customizable network slicing to address some of the highlighted challenges in migrating eHealth telemedicine services to 5G networks. An overview of the framework outlines the technical approaches in beyond the state-of-the-art network slicing. Subsequently, this paper emphasizes the design and prototyping of a media-centric eHealth use case, focusing on a set of innovative enablers toward achieving end-to-end QoS-aware network slicing capabilities, required by this demanding use case. Experimental results empirically validate the prototyped enablers and demonstrate the applicability of the proposed framework in such media-rich use cases.Peer ReviewedPostprint (author's final draft

Sensor-based machine olfaction with neuromorphic models of the olfactory system

Author: Raman Baranidharan
Publication venue: Texas A&M University
Publication date: 25/04/2007
Field of study

Electronic noses combine an array of cross-selective gas sensors with a pattern recognition engine to identify odors. Pattern recognition of multivariate gas sensor response is usually performed using existing statistical and chemometric techniques. An alternative solution involves developing novel algorithms inspired by information processing in the biological olfactory system. The objective of this dissertation is to develop a neuromorphic architecture for pattern recognition for a chemosensor array inspired by key signal processing mechanisms in the olfactory system. Our approach can be summarized as follows. First, a high-dimensional odor signal is generated from a chemical sensor array. Three approaches have been proposed to generate this combinatorial and high dimensional odor signal: temperature-modulation of a metal-oxide chemoresistor, a large population of optical microbead sensors, and infrared spectroscopy. The resulting high-dimensional odor signals are subject to dimensionality reduction using a self-organizing model of chemotopic convergence. This convergence transforms the initial combinatorial high-dimensional code into an organized spatial pattern (i.e., an odor image), which decouples odor identity from intensity. Two lateral inhibitory circuits subsequently process the highly overlapping odor images obtained after convergence. The first shunting lateral inhibition circuits perform gain control enabling identification of the odorant across a wide range of concentration. This shunting lateral inhibition is followed by an additive lateral inhibition circuit with center-surround connections. These circuits improve contrast between odor images leading to more sparse and orthogonal patterns than the one available at the input. The sharpened odor image is stored in a neurodynamic model of a cortex. Finally, anti-Hebbian/ Hebbian inhibitory feedback from the cortical circuits to the contrast enhancement circuits performs mixture segmentation and weaker odor/background suppression, respectively. We validate the models using experimental datasets and show our results are consistent with recent neurobiological findings

Fully Programming the Data Plane: A Hardware/Software Approach

Author: Santiago dal Silva Jeferson
Publication venue
Publication date: 01/04/2020
Field of study

Les réseaux définis par logiciel — en anglais Software-Defined Networking (SDN) — sont apparus ces dernières années comme un nouveau paradigme de réseau. SDN introduit une séparation entre les plans de gestion, de contrôle et de données, permettant à ceux-ci d’évoluer de manière indépendante, rompant ainsi avec la rigidité des réseaux traditionnels. En particulier, dans le plan de données, les avancées récentes ont porté sur la définition des langages de traitement de paquets, tel que P4, et sur la définition d’architectures de commutateurs programmables, par exemple la Protocol Independent Switch Architecture (PISA). Dans cette thèse, nous nous intéressons a l’architecture PISA et évaluons comment exploiter les FPGA comme plateforme de traitement efficace de paquets. Cette problématique est étudiée a trois niveaux d’abstraction : microarchitectural, programmation et architectural. Au niveau microarchitectural, nous avons proposé une architecture efficace d’un analyseur d’entêtes de paquets pour PISA. L’analyseur de paquets utilise une architecture pipelinée avec propagation en avant — en anglais feed-forward. La complexité de l’architecture est réduite par rapport à l’état de l’art grâce a l’utilisation d’optimisations algorithmiques. Finalement, l’architecture est générée par un compilateur P4 vers C++, combiné à un outil de synthèse de haut niveau. La solution proposée atteint un débit de 100 Gb/s avec une latence comparable à celle d’analyseurs d’entêtes de paquets écrits à la main. Au niveau de la programmation, nous avons proposé une nouvelle méthodologie de conception de synthèse de haut niveau visant à améliorer conjointement la qualité logicielle et matérielle. Nous exploitons les fonctionnalités du C++ moderne pour améliorer à la fois la modularité et la lisibilité du code, tout en conservant (ou améliorant) les résultats du matériel généré. Des exemples de conception utilisant notre méthodologie, incluant pour l’analyseur d’entête de paquets, ont été rendus publics.----------ABSTRACT: Software-Defined Networking (SDN) has emerged in recent years as a new network paradigm to de-ossify communication networks. Indeed, by offering a clear separation of network concerns between the management, control, and data planes, SDN allows each of these planes to evolve independently, breaking the rigidity of traditional networks. However, while well spread in the control and management planes, this de-ossification has only recently reached the data plane with the advent of packet processing languages, e.g. P4, and novel programmable switch architectures, e.g. Protocol Independent Switch Architecture (PISA). In this work, we focus on leveraging the PISA architecture by mainly exploiting the FPGA capabilities for efficient packet processing. In this way, we address this issue at different abstraction levels: i) microarchitectural; ii) programming; and, iii) architectural. At the microarchitectural level, we have proposed an efficient FPGA-based packet parser architecture, which is a major PISA’s component. The proposed packet parser follows a feedforward pipeline architecture in which the internal microarchitectural has been meticulously optimized for FPGA implementation. The architecture is automatically generated by a P4- to-C++ compiler after several rounds of graph optimizations. The proposed solution achieves 100 Gb/s line rate with latency comparable to hand-written packet parsers. The throughput scales from 10 Gb/s to 160 Gb/s with moderate increase in resource consumption. Both the compiler and the packet parser codebase have been open-sourced to permit reproducibility. At the programming level, we have proposed a novel High-Level Synthesis (HLS) design methodology aiming at improving software and hardware quality. We have employed this novel methodology when designing the packet parser. In our work, we have exploited features of modern C++ that improves at the same time code modularity and readability while keeping (or improving) the results of the generated hardware. Design examples using our methodology have been publicly released

Self-timed field programmmable gate array architectures

Author: Payne Robert
Publication venue: The University of Edinburgh
Publication date: 01/01/1997
Field of study