Search CORE

13 research outputs found

Implémentations logicielle et matérielle de l'algorithme Aho-Corasick pour la détection d'intrusions

Author: Lacroix Alexsandre Bonneau
Publication venue
Publication date: 01/12/2016
Field of study

RÉSUMÉ Ce travail propose des méthodes et architectures efficaces pour l’implémentation de l’algorithme Aho-Corasick. Cet algorithme peut être utilisé pour la recherche de chaînes de caractères dans un système de détection d’intrusion, tels que Snort, pour les réseaux informatiques. Deux versions sont proposées, une version logicielle et une version matérielle. La première version développe une implémentation logicielle pour des processeurs à usage général. Pour cela, de nouvelles implémentations de l'algorithme tenant compte des ressources mémoire et de l’exécution séquentielle des processeurs ont été proposées. La deuxième version développe de nouvelles architectures de processeurs particularisés pour FPGA. Elles tiennent compte des ressources de calcul disponibles, des ressources mémoire et du potentiel de parallélisation à grain fin offert par le FPGA. De plus, une comparaison avec une version logicielle modifiée est effectuée. Dans les deux cas, les performances et les compromis pour la sélection de différentes structures de données de nœuds en mémoire ont été analysés. Une sélection de paramètres est proposée afin de maximiser la fonction objective de performance qui combine le nombre de cycles, la consommation mémoire et la fréquence d’horloge du système. Les paramètres permettent de déterminer lequel des deux ou des trois types de structures de données de nœuds (selon la version) sera choisi pour chaque nœud d’une machine à états. Lors de la validation, des scénarios de test utilisant des données variées ont été utilisés afin de s'assurer du bon fonctionnement de l'algorithme. De plus, les contenus des règles de Snort 2.9.7 ont été utilisés. La machine à états a été construite avec environ 26×103 chaînes de caractères qui sont toutes extraites de ces règles. La machine à états contient environ 381×103 nœuds. La contribution générale de ce mémoire est de montrer qu’il est possible, à travers l’exploration d’architectures, de sélectionner des paramètres afin d’obtenir un produit mémoire × temps optimal. Pour ce qui est de la version logicielle, la consommation mémoire diminue de 407 Mo à 21 Mo, ce qui correspond à une diminution de mémoire d’environ 20× par rapport au pire cas avec seulement un type de nœud. Pour ce qui est de la version matérielle, la consommation mémoire diminue de 11 Mo à 4 Mo, ce qui résulte en une diminution de mémoire d’environ 3× par rapport à la version logicielle modifiée. Pour ce qui est du débit, il augmente de 300 Mbps pour la version logicielle modifiée à 400 Mbps pour la version matérielle.----------ABSTRACT This work proposes effective methods and architectures for the implementation of the Aho-Corasick algorithm. This algorithm can be used for pattern matching in network-based intrusion detection systems such as Snort. Two versions are proposed, a software version and a hardware version. The first version develops a software implementation in C/C++ for general purpose processors. For this, new implementations of the algorithm, considering the memory resources and the processor’s sequential execution, are proposed. The second version develops an architecture in VHDL for a specialized processor on FPGA. For this, new architectures of the algorithm, considering the available computing resources, the memory resources and the inherent parallelism of FPGAs, are proposed. Furthermore, a comparison with a modified software version is performed. For both cases, we analyze the performance and cost trade-off from selecting different data structures of nodes in memory. A selection of parameters is used in order to maximize de performance objective function that combines the cycles count, the memory usage and the system’s frequency. The parameters determine which of two or three types of data structures of nodes (depending on the version) is selected for each node of the state machine. For the validation phase, test cases with diverse data are used in order to ensure that the algorithm operates properly. Furthermore, the Snort 2.9.7 rules are used. The state machine was built with around 26×103 patterns which are all extracted from these rules. The state machine is comprised of around 381×103 nodes. The main contribution of this work is to show that it is possible to choose parameters through architecture exploration, to obtain an optimal memory × time product. For the software version, the memory consumption is reduced from 407 MB to 21 MB, which results in a memory improvement of about 20× compared with the single node-type case. For the hardware version, the memory consumption is reduced from 11 MB to 4 MB, which results in a memory improvement of about 3× compared with the modified software version. For the throughput, it increases from 300 Mbps with the modified software version to 400 Mbps with the hardware version

PolyPublie

Energy Efficient Hardware Accelerators for Packet Classification and String Matching

Author: Kennedy Alan
Publication venue: Dublin City University. School of Electronic Engineering
Publication date: 21/09/2010
Field of study

This thesis focuses on the design of new algorithms and energy efficient high throughput hardware accelerators that implement packet classification and fixed string matching. These computationally heavy and memory intensive tasks are used by networking equipment to inspect all packets at wire speed. The constant growth in Internet usage has made them increasingly difficult to implement at core network line speeds. Packet classification is used to sort packets into different flows by comparing their headers to a list of rules. A flow is used to decide a packet’s priority and the manner in which it is processed. Fixed string matching is used to inspect a packet’s payload to check if it contains any strings associated with known viruses, attacks or other harmful activities. The contributions of this thesis towards the area of packet classification are hardware accelerators that allow packet classification to be implemented at core network line speeds when classifying packets using rulesets containing tens of thousands of rules. The hardware accelerators use modified versions of the HyperCuts packet classification algorithm. An adaptive clocking unit is also presented that dynamically adjusts the clock speed of a packet classification hardware accelerator so that its processing capacity matches the processing needs of the network traffic. This keeps dynamic power consumption to a minimum. Contributions made towards the area of fixed string matching include a new algorithm that builds a state machine that is used to search for strings with the aid of default transition pointers. The use of default transition pointers keep memory consumption low, allowing state machines capable of searching for thousands of strings to be small enough to fit in the on-chip memory of devices such as FPGAs. A hardware accelerator is also presented that uses these state machines to search through the payloads of packets for strings at core network line speeds

Irish Universities

DCU Online Research Access Service

Recommended from our members

GPU-Acceleration of In-Memory Data Analytics

Author: Sitaridi Evangelia
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2016
Field of study

Hardware advances strongly influence the database system design. The flattening speed of CPU cores makes many-core accelerators, such as GPUs, a vital alternative to explore for processing the ever-increasing amounts of data. GPUs have a significantly higher degree of parallelism than multi-core CPUs but their cores are simpler. As a result, they do not face the power constraints limiting the parallelism of CPUs. Their trade-off, however, is the increased implementation complexity. This thesis adapts and redesigns data analytics operators to better exploit the GPU special memory and threading model. Due to the increasing memory capacity and also the user's need for fast interaction with the data, we focus on in-memory analytics. Our techniques span different steps of the data processing pipeline: (1) Data preprocessing, (2) Query compilation, and (3) Algorithmic optimization of the operators. Our data preprocessing techniques adapt the data layout for numeric and string columns to maximize the achieved GPU memory bandwidth. Our query compilation techniques compute the optimal execution plan for conjunctive filters. We formulate \textit{memory divergence} for string matching algorithms and suggest how to eliminate it. Finally, we parallelize decompression algorithms in our compression framework \textit{Gompresso} to fit more data into the limited GPU memory. Gompresso achieves high speed-ups on GPUs over multi-core CPU state-of-the-art libraries and is suitable for any massively parallel processor

Columbia University Academic Commons

Computational Analysis of T Cell Receptor Repertoire and Structure

Author: Peacock Thomas P.
Publication venue: UCL (University College London)
Publication date: 28/07/2022
Field of study

The human adaptive immune system has evolved to provide a sophisticated response to a vast body of pathogenic microbes and toxic substances. The primary mediators of this response are T and B lymphocytes. Antigenic peptides presented at the surface of infected cells by major histocompatibility complex (MHC) molecules are recognised by T cell receptors (TCRs) with exceptional specificity. This specificity arises from the enormous diversity in TCR sequence and structure generated through an imprecise process of somatic gene recombination that takes place during T cell development. Quantification of the TCR repertoire through the analysis of data produced by high-throughput RNA sequencing allows for a characterisation of the immune response to disease over time and between patients, and the development of methods for diagnosis and therapeutic design. The latest version of the software package Decombinator extracts and quantifies the TCR repertoire with improved accuracy and compatibility with complementary experimental protocols and external computational tools. The software has been extended for analysis of fragmented short-read data from single cells, comparing favourably with two alternative tools. The development of cell-based therapeutics and vaccines is incomplete without an understanding of molecular level interactions. The breadth of TCR diversity and cross-reactivity presents a barrier for comprehensive structural resolution of the repertoire by traditional means. Computational modelling of TCR structures and TCR-pMHC complexes provides an efficient alternative. Four generalpurpose protein-protein docking platforms were compared in their ability to accurately model TCR-pMHC complexes. Each platform was evaluated against an expanded benchmark of docking test cases and in the context of varying additional information about the binding interface. Continual innovation in structural modelling techniques sets the stage for novel automated tools for TCR design. A prototype platform has been developed, integrating structural modelling and an optimisation routine, to engineer desirable features into TCR and TCR-pMHC complex models

UCL Discovery

Recommended from our members

Algorithms for string matching with applications in molecular biology

Author: Holloway James Lee
Publication venue: 'Oregon State University'
Publication date
Field of study

As the volume of genetic sequence data increases due to improved sequencing techniques and increased interest, the computational tools available to analyze the data are becoming inadequate. This thesis seeks to improve a few of the computational methods available to access and analyze data in the genetic sequence databases. The first two results are parallel algorithms based on previously known sequential algorithms. The third result is a new approach, based on assumptions that we believe make sense in the biological context of the problem, to approximating an NP complete problem. The final result is a fundamentally new approach to approximate string matching using the divide and conquer paradigm instead of the dynamic programming approach that has been used almost exclusively in the past. Dynamic programming algorithms to measure the distance between sequences have been known since at least 1972. Recently there has been interest in developing parallel algorithms to measure the distance between two sequences. We have developed an optimal parallel algorithm to find the edit distance, a metric frequently used to measure distance, between two sequences. It is often interesting to find the substrings of length k that appear most frequently in a given string. We give a simple sequential algorithm to solve this problem and an efficient parallel version of the algorithm. The parallel algorithm uses an efficient novel parallel bucket sort. When sequencing a large segment of DNA, the original DNA sequence is reconstructed using the results of sequencing fragments, that may or may not contain errors, of many copies of the original DNA. New algorithms are given to solve the problem of reconstructing the original DNA sequence with and without errors introduced into the fragments. A program based on this algorithm is used to reconstruct the human beta globin region (HUMHBB) when given a set of 300 to 500 mers drawn randomly from the HUMHBB region. Approximate string matching is used in a biological context to model the steps of evolution. While such evolution may proceed base by base using the change, insert, or delete operators, there is also evidence that whole genes may be moved or inverted. We introduce a new problem, the string to string rearrangement problem, that allows movement and inversion of substrings. We give a divide and conquer algorithm for finding a rearrangement of one string within another

ScholarsArchive@OSU

T-cell receptor repertoire sequencing in health and disease

Author: Heather JM
Publication venue: UCL (University College London)
Publication date: 28/08/2015
Field of study

The adaptive immune systems of jawed vertebrates are based upon lymphocytes bearing a huge variety of antigen receptors. Produced by somatic DNA recombination, these receptors are clonally expressed on T- and B-lymphocytes, where they are used to help detect and control infections and help maintain regular bodily function. Full understanding of various aspects of the immune system relies upon accurate measurement of the individual receptors that make up these repertoires. In order to obtain such data, protocols were developed to permit unbiased amplification, high-throughput deep-sequencing, and error-correcting bioinformatic analysis of T-cell receptor sequences. These techniques have been applied to peripheral blood samples to further characterise aspects of the TCR repertoire of healthy individuals, such as V(D)J TCR gene usage and pairing distributions. A large number of sequences are also found to be shared across multiple individuals, including sequences matching receptors belonging to known and proposed T-cell subsets making use of invariant rearrangements. The resolution provided also permitted detection of low-frequency recombination events that use unexpected gene segments, or contained alternative splicing events. Deep-sequencing was further used to study the effect of HIV infection, and subsequent antiretroviral therapy, upon the TCR repertoire. HIV-patient repertoires are typified by marked clonal inequality and perturbed population structures, relative to healthy controls. The data presented support a model in which HIV infection drives expansion of an subset of CD8+ clones, which -- in combination with the virally-mediated loss of CD4+ cells -- is responsible for driving repertoires towards an idiosyncratic population with low diversity. Moreover these altered repertoire features do not significantly recover after three months of therapy. Deep-sequencing therefore presents opportunities to investigate the properties of TCR repertoires both in health and disease, which could be useful when analysing a wide variety of immune phenomena

UCL Discovery

Computational approaches to the analysis of the T cell receptor repertoire

Author: Best K
Publication venue: UCL (University College London)
Publication date: 28/04/2016
Field of study

The T cell receptor (TCR) repertoire has the potential to be a highly personalised biomarker of historic or current immune challenges, and may hold clinically relevant information. This thesis reviews aspects of the measurement and analysis of the TCR repertoire, including approaches to obtaining high-throughput sequencing data and using these data to investigate features of the repertoire in health and disease. The thesis then considers three topics related to computational and experimental analysis of the TCR repertoire. First, this thesis explores a technical challenge in obtaining accurate quantitative TCR repertoire sequence data, observing substantial heterogeneity in the PCR amplification step essential for most current high-throughput sequencing protocols. An important conclusion of this chapter is that single molecule barcoding before amplification is essential to obtain robust quantification of clone abundances from sequence data. The second chapter considers the challenges of producing an effective TCR repertoire which can provide broad coverage of potential pathogens while maintaining tolerance to self-peptides. A computational model is explored which incorporates a linear programming representation of peripheral tolerance, with dendritic cells acting as the central agents reshaping the T cell population. The model is shown to maintain a population with restricted responsiveness to self-peptides while retaining a diverse and cross-reactive repertoire. In the final results chapter, TCR repertoire data from immunised mice is used to demonstrate that within a simplified animal model of immune response, the antigen responsive CDR3βs are almost completely private. However, exploration of the protein sequences of the antigen associated CDR3βs suggests that there may be amino acid motifs defining the antigen response. Overall, this thesis demonstrates the application of computational and modelling approaches to address questions regarding the TCR repertoire, facilitating interpretation of high-throughput sequencing data and providing insight into maintenance of diversity in the peripheral T cell population

UCL Discovery

Proceedings of the 26th International Symposium on Theoretical Aspects of Computer Science (STACS'09)

Author: Albers Susanne
Marion Jean-Yves
Publication venue
Publication date: 01/01/2009
Field of study

The Symposium on Theoretical Aspects of Computer Science (STACS) is held alternately in France and in Germany. The conference of February 26-28, 2009, held in Freiburg, is the 26th in this series. Previous meetings took place in Paris (1984), Saarbr¨ucken (1985), Orsay (1986), Passau (1987), Bordeaux (1988), Paderborn (1989), Rouen (1990), Hamburg (1991), Cachan (1992), W¨urzburg (1993), Caen (1994), M¨unchen (1995), Grenoble (1996), L¨ubeck (1997), Paris (1998), Trier (1999), Lille (2000), Dresden (2001), Antibes (2002), Berlin (2003), Montpellier (2004), Stuttgart (2005), Marseille (2006), Aachen (2007), and Bordeaux (2008). ..

Hochschulschriftenserver - Universität Frankfurt am Main

空間効率と時間効率の良い文字列辞書

Author: Kanda Shunsuke
Publication venue
Publication date: 02/07/2018
Field of study

Tokushima University Institutional Repository