    GenPIP: In-Memory Acceleration of Genome Analysis via Tight Integration of Basecalling and Read Mapping

    Nanopore sequencing is a widely-used high-throughput genome sequencing technology that can sequence long fragments of a genome into raw electrical signals at low cost. Nanopore sequencing requires two computationally-costly processing steps for accurate downstream genome analysis. The first step, basecalling, translates the raw electrical signals into nucleotide bases (i.e., A, C, G, T). The second step, read mapping, finds the correct location of a read in a reference genome. In existing genome analysis pipelines, basecalling and read mapping are executed separately. We observe in this work that such separate execution of the two most time-consuming steps inherently leads to (1) significant data movement and (2) redundant computations on the data, slowing down the genome analysis pipeline. This paper proposes GenPIP, an in-memory genome analysis accelerator that tightly integrates basecalling and read mapping. GenPIP improves the performance of the genome analysis pipeline with two key mechanisms: (1) in-memory fine-grained collaborative execution of the major genome analysis steps in parallel; (2) a new technique for early-rejection of low-quality and unmapped reads to timely stop the execution of genome analysis for such reads, reducing inefficient computation. Our experiments show that, for the execution of the genome analysis pipeline, GenPIP provides 41.6X (8.4X) speedup and 32.8X (20.8X) energy savings with negligible accuracy loss compared to the state-of-the-art software genome analysis tools executed on a state-of-the-art CPU (GPU). Compared to a design that combines state-of-the-art in-memory basecalling and read mapping accelerators, GenPIP provides 1.39X speedup and 1.37X energy savings.Comment: 17 pages, 13 figure

    Long-Term Memory for Cognitive Architectures: A Hardware Approach Using Resistive Devices

    A cognitive agent capable of reliably performing complex tasks over a long time will acquire a large store of knowledge. To interact with changing circumstances, the agent will need to quickly search and retrieve knowledge relevant to its current context. Real time knowledge search and cognitive processing like this is a challenge for conventional computers, which are not optimised for such tasks. This thesis describes a new content-addressable memory, based on resistive devices, that can perform massively parallel knowledge search in the memory array. The fundamental circuit block that supports this capability is a memory cell that closely couples comparison logic with non-volatile storage. By using resistive devices instead of transistors in both the comparison circuit and storage elements, this cell improves area density by over an order of magnitude compared to state of the art CMOS implementations. The resulting memory does not need power to maintain stored information, and is therefore well suited to cognitive agents with large long-term memories. The memory incorporates activation circuits, which bias the knowledge retrieval process according to past memory access patterns. This is achieved by approximating the widely used base-level activation function using resistive devices to store, maintain and compare activation values. By distributing an instance of this circuit to every row in memory, the activation for all memory objects can be updated in parallel. A test using the word sense disambiguation task shows this circuit-based activation model only incurs a small loss in accuracy compared to exact base-level calculations. A variation of spreading activation can also be achieved in-memory. Memory objects are encoded with high-dimensional vectors that create association between correlated representations. By storing these high-dimensional vectors in the new content-addressable memory, activation can be spread to related objects during search operations. The new memory is scalable, power and area efficient, and performs operations in parallel that are infeasible in real-time for a sequential processor with a conventional memory hierarchy.Thesis (Ph.D.) -- University of Adelaide, School of Electrical and Electronic Engineering, 201

    Hardware acceleration for power efficient deep packet inspection

    The rapid growth of the Internet leads to a massive spread of malicious attacks like viruses and malwares, making the safety of online activity a major concern. The use of Network Intrusion Detection Systems (NIDS) is an effective method to safeguard the Internet. One key procedure in NIDS is Deep Packet Inspection (DPI). DPI can examine the contents of a packet and take actions on the packets based on predefined rules. In this thesis, DPI is mainly discussed in the context of security applications. However, DPI can also be used for bandwidth management and network surveillance. DPI inspects the whole packet payload, and due to this and the complexity of the inspection rules, DPI algorithms consume significant amounts of resources including time, memory and energy. The aim of this thesis is to design hardware accelerated methods for memory and energy efficient high-speed DPI. The patterns in packet payloads, especially complex patterns, can be efficiently represented by regular expressions, which can be translated by the use of Deterministic Finite Automata (DFA). DFA algorithms are fast but consume very large amounts of memory with certain kinds of regular expressions. In this thesis, memory efficient algorithms are proposed based on the transition compressions of the DFAs. In this work, Bloom filters are used to implement DPI on an FPGA for hardware acceleration with the design of a parallel architecture. Furthermore, devoted at a balance of power and performance, an energy efficient adaptive Bloom filter is designed with the capability of adjusting the number of active hash functions according to current workload. In addition, a method is given for implementation on both two-stage and multi-stage platforms. Nevertheless, false positive rates still prevents the Bloom filter from extensive utilization; a cache-based counting Bloom filter is presented in this work to get rid of the false positives for fast and precise matching. Finally, in future work, in order to estimate the effect of power savings, models will be built for routers and DPI, which will also analyze the latency impact of dynamic frequency adaption to current traffic. Besides, a low power DPI system will be designed with a single or multiple DPI engines. Results and evaluation of the low power DPI model and system will be produced in future

    Techniques d'abstraction pour l'analyse et la mitigation des effets dus Ă  la radiation

    The main objective of this thesis is to develop techniques that can beused to analyze and mitigate the effects of radiation-induced soft errors in industrialscale integrated circuits. To achieve this goal, several methods have been developedbased on analyzing the design at higher levels of abstraction. These techniquesaddress both sequential and combinatorial SER.Fault-injection simulations remain the primary method for analyzing the effectsof soft errors. In this thesis, techniques which significantly speed-up fault-injectionsimulations are presented. Soft errors in flip-flops are typically mitigated by selectivelyreplacing the most critical flip-flops with hardened implementations. Selectingan optimal set to harden is a compute intensive problem and the second contributionconsists of a clustering technique which significantly reduces the number offault-injections required to perform selective mitigation.In terrestrial applications, the effect of soft errors in combinatorial logic hasbeen fairly small. It is known that this effect is growing, yet there exist few techniqueswhich can quickly estimate the extent of combinatorial SER for an entireintegrated circuit. The third contribution of this thesis is a hierarchical approachto combinatorial soft error analysis.Systems-on-chip are often developed by re-using design-blocks that come frommultiple sources. In this context, there is a need to develop and exchange reliabilitymodels. The final contribution of this thesis consists of an application specificmodeling language called RIIF (Reliability Information Interchange Format). Thislanguage is able to model how faults at the gate-level propagate up to the block andchip-level. Work is underway to standardize the RIIF modeling language as well asto extend it beyond modeling of radiation-induced failures.In addition to the main axis of research, some tangential topics were studied incollaboration with other teams. One of these consisted in the development of a novelapproach for protecting ternary content addressable memories (TCAMs), a specialtype of memory important in networking applications. The second supplementalproject resulted in an algorithm for quickly generating approximate redundant logicwhich can protect combinatorial networks against permanent faults. Finally anapproach for reducing the detection time for errors in the configuration RAM forField-Programmable Gate-Arrays (FPGAs) was outlined.Les effets dus Ă  la radiation peuvent provoquer des pannes dans des circuits intĂ©grĂ©s. Lorsqu'une particule subatomique, fait se dĂ©poser une charge dans les rĂ©gions sensibles d'un transistor cela provoque une impulsion de courant. Cette impulsion peut alors engendrer l'inversion d'un bit ou se propager dans un rĂ©seau de logique combinatoire avant d'ĂȘtre Ă©chantillonnĂ©e par une bascule en aval.Selon l'Ă©tat du circuit au moment de la frappe de la particule et selon l'application, cela provoquera une panne observable ou non. Parmi les Ă©vĂ©nements induits par la radiation, seule une petite portion gĂ©nĂšre des pannes. Il est donc essentiel de dĂ©terminer cette fraction afin de prĂ©dire la fiabilitĂ© du systĂšme. En effet, les raisons pour lesquelles une perturbation pourrait ĂȘtre masquĂ©e sont multiples, et il est de plus parfois difficile de prĂ©ciser ce qui constitue une erreur. A cela s'ajoute le fait que les circuits intĂ©grĂ©s comportent des milliards de transistors. Comme souvent dans le contexte de la conception assistĂ© par ordinateur, les approches hiĂ©rarchiques et les techniques d'abstraction permettent de trouver des solutions.Cette thĂšse propose donc plusieurs nouvelles techniques pour analyser les effets dus Ă  la radiation. La premiĂšre technique permet d'accĂ©lĂ©rer des simulations d'injections de fautes en dĂ©tectant lorsqu'une faute a Ă©tĂ© supprimĂ©e du systĂšme, permettant ainsi d'arrĂȘter la simulation. La deuxiĂšme technique permet de regrouper en ensembles les Ă©lĂ©ments d'un circuit ayant une fonction similaire. Ensuite, une analyse au niveau des ensemble peut ĂȘtre faite, identifiant ainsi ceux qui sont les plus critiques et qui nĂ©cessitent donc d'ĂȘtre durcis. Le temps de calcul est ainsi grandement rĂ©duit.La troisiĂšme technique permet d'analyser les effets des fautes transitoires dans les circuits combinatoires. Il est en effet possible de calculer Ă  l'avance la sensibilitĂ© Ă  des fautes transitoires de cellules ainsi que les effets de masquage dans des blocs frĂ©quemment utilisĂ©s. Ces modĂšles peuvent alors ĂȘtre combinĂ©s afin d'analyser la sensibilitĂ© de grands circuits. La contribution finale de cette thĂšse consiste en la dĂ©finition d'un nouveau langage de modĂ©lisation appelĂ© RIIF (Reliability Information Ineterchange Format). Ce langage permet de dĂ©crire le taux des fautes dans des composants simples en fonction de leur environnement de fonctionnement. Ces composants simples peuvent ensuite ĂȘtre combinĂ©s permettant ainsi de modĂ©liser la propagation de leur fautes vers des pannes au niveau systĂšme. En outre, l'utilisation d'un langage standard facilite l'Ă©change de donnĂ©es de fiabilitĂ© entre les partenaires industriels.Au-delĂ  des contributions principales, cette thĂšse aborde aussi des techniques permettant de protĂ©ger des mĂ©moires associatives ternaires (TCAMs). Les approches classiques de protection (codes correcteurs) ne s'appliquent pas directement. Une des nouvelles techniques proposĂ©es consiste Ă  utiliser une structure de donnĂ©es qui peut dĂ©tecter, d'une maniĂšre statistique, quand le rĂ©sultat n'est pas correct. La probabilitĂ© de dĂ©tection peut ĂȘtre contrĂŽlĂ©e par le nombre de bits allouĂ©s Ă  cette structure. Une autre technique consiste Ă  utiliser un dĂ©tecteur de courant embarquĂ© (BICS) afin de diriger un processus de fond directement vers le rĂ©gion touchĂ©e par une erreur. La contribution finale consiste en un algorithme qui permet de synthĂ©tiser de la logique combinatoire afin de protĂ©ger des circuits combinatoires contre les fautes transitoires.Dans leur ensemble, ces techniques facilitent l'analyse des erreurs provoquĂ©es par les effets dus Ă  la radiation dans les circuits intĂ©grĂ©s, en particulier pour les trĂšs grands circuits composĂ©s de blocs provenant de divers fournisseurs. Des techniques pour mieux sĂ©lectionner les bascules/flip-flops Ă  durcir et des approches pour protĂ©ger des TCAMs ont Ă©tĂ©s Ă©tudiĂ©es

    Application Centric Networks-On-Chip Design Solutions for Future Multicore Systems

    With advances in technology, future multicore systems scaled to 100s and 1000s of cores/accelerators are being touted as an effective solution for extracting huge performance gains using parallel programming paradigms. However with the failure of Dennard Scaling all the components on the chip cannot be run simultaneously without breaking the power and thermal constraints leading to strict chip power envelops. The scaling up of the number of on chip components has also brought upon Networks-On-Chip (NoC) based interconnect designs like 2D mesh. The contribution of NoC to the total on chip power and overall performance has been increasing steadily and hence high performance power-efficient NoC designs are becoming crucial. Future multicore paradigms can be broadly classified, based on the applications they are tailored to, into traditional Chip Multi processor(CMP) based application based systems, characterized by low core and NoC utilization, and emerging big data application based systems, characterized by large amounts of data movement necessitating high throughput requirements. To this order, we propose NoC design solutions for power-savings in future CMPs tailored to traditional applications and higher effective throughput gains in multicore systems tailored to bandwidth intensive applications. First, we propose Fly-over, a light-weight distributed mechanism for power-gating routers attached to switched off cores to reduce NoC power consumption in low load CMP environment. Secondly, we plan on utilizing a promising next generation memory technology, Spin-Transfer Torque Magnetic RAM(STT-MRAM), to achieve enhanced NoC performance to satisfy the high throughput demands in emerging bandwidth intensive applications, while reducing the power consumption simultaneously. Thirdly, we present a hardware data approximation framework for NoCs, APPROX-NoC, with an online data error control mechanism, which can leverage the approximate computing paradigm in the emerging data intensive big data applications to attain higher performance per watt

    SpiNNaker - A Spiking Neural Network Architecture

    20 years in conception and 15 in construction, the SpiNNaker project has delivered the world’s largest neuromorphic computing platform incorporating over a million ARM mobile phone processors and capable of modelling spiking neural networks of the scale of a mouse brain in biological real time. This machine, hosted at the University of Manchester in the UK, is freely available under the auspices of the EU Flagship Human Brain Project. This book tells the story of the origins of the machine, its development and its deployment, and the immense software development effort that has gone into making it openly available and accessible to researchers and students the world over. It also presents exemplar applications from ‘Talk’, a SpiNNaker-controlled robotic exhibit at the Manchester Art Gallery as part of ‘The Imitation Game’, a set of works commissioned in 2016 in honour of Alan Turing, through to a way to solve hard computing problems using stochastic neural networks. The book concludes with a look to the future, and the SpiNNaker-2 machine which is yet to come

    High Performance Network Evaluation and Testing

    Towards high quality and flexible future internet architectures

