3 research outputs found

    Captura de patrones en archivos de logs mediante el uso de expresiones regulares en GPUs

    Full text link
    The information contained in a system is normally stored into log files. Most of the time, these files store the information in plain text with many not formatted information. It is then necessary to extract parts of this information to be able to understand what is going on such system. Currently, such information can be extracted using programs that make use of extended regular expressions. The use of regular expressions allows the search of patterns but it can be also used to extract data from the searched pattern. Most of the programs that implement regular expressions are based on finite automatas, such as non-deterministic (NFA) or deterministic (DFA). We aim to explore the use of finite automatas to extract data from log files using a Graphic Processor Unit (GPU) device to speedup the process. Moreover, we will also explore data parallelism over the lines present on the log file. Currently, the work done in GPU with regular expressions is limited to matching tasks only, without any capture feature. We present a solution that solves this lack of pattern capture in current implementations. Our development uses as base the implementation of TNFA and converts it to a TDFA before running the GPU task. We explore the new CUDA feature named unified memory, supported since CUDA 6, together with streams to achieve the best possible performance in our GPU implementation. Using real log files and regular expressions made to extract specific data, our evaluation shows that it can be up to 9 faster than the sequential implementation.La informaci贸n contenida en un sistema normalmente se almacena en archivos de registros, conocidos com煤nmente como logs. La mayor parte de las veces, estos archivos almacenan la informaci贸n en texto plano, con mucha informaci贸n sin formatear. Por ello es necesario extraer partes de esta informaci贸n, de forma que se pueda saber qu茅 est谩 ocurriendo en dicho sistema. Actualmente, esta informaci贸n se puede extraer usando programas que aprovechan las expresiones regulares extendidas. Su uso permite la b煤squeda de patrones, pero tambi茅n se pueden emplear para extraer datos del patr贸n buscado. La mayor铆a de los programas que implementan expresiones regulares se basan en aut贸matas finitos, tales como los no deterministas (NFA) y los deterministas (DFA). El objetivo de este Trabajo Fin de M谩ster es explorar el uso de aut贸matas finitos para extraer datos de archivos de log usando una GPU para acelerar el proceso. Es m谩s, tambi茅n exploramos el paralelismo que se puede aplicar sobre las l铆neas de un archivo de log. En la actualidad, el trabajo realizado con GPUs y expresiones regulares se limita a tareas de b煤squeda de patrones, sin ninguna funcionalidad de captura. Presentamos una soluci贸n que resuelve esta falta de funcionalidad en las implementaciones actuales. Nuestro desarrollo usa como base una implementaci贸n de TNFA y la convierte a TDFA antes de ejecutar la tarea en la GPU. Exploramos la nueva funcionalidad de CUDA denominada memoria unificada, que se soporta desde la versi贸n 6 de CUDA, as铆 como el uso de flujos o streams para alcanzar el mejor rendimiento posible en nuestra implementaci贸n en GPU. Al usar archivos de log reales y expresiones regulares para extraer datos espec铆ficos, nuestra evaluaci贸n muestra que la implementaci贸n paralela es hasta 9 veces m谩s r谩pida que la implementaci贸n secuencial

    Stream Processing using Grammars and Regular Expressions

    Full text link
    In this dissertation we study regular expression based parsing and the use of grammatical specifications for the synthesis of fast, streaming string-processing programs. In the first part we develop two linear-time algorithms for regular expression based parsing with Perl-style greedy disambiguation. The first algorithm operates in two passes in a semi-streaming fashion, using a constant amount of working memory and an auxiliary tape storage which is written in the first pass and consumed by the second. The second algorithm is a single-pass and optimally streaming algorithm which outputs as much of the parse tree as is semantically possible based on the input prefix read so far, and resorts to buffering as many symbols as is required to resolve the next choice. Optimality is obtained by performing a PSPACE-complete pre-analysis on the regular expression. In the second part we present Kleenex, a language for expressing high-performance streaming string processing programs as regular grammars with embedded semantic actions, and its compilation to streaming string transducers with worst-case linear-time performance. Its underlying theory is based on transducer decomposition into oracle and action machines, and a finite-state specialization of the streaming parsing algorithm presented in the first part. In the second part we also develop a new linear-time streaming parsing algorithm for parsing expression grammars (PEG) which generalizes the regular grammars of Kleenex. The algorithm is based on a bottom-up tabulation algorithm reformulated using least fixed points and evaluated using an instance of the chaotic iteration scheme by Cousot and Cousot

    Fast submatch extraction using obdds

    No full text
    Network-based intrusion detection systems (NIDS) commonly use pattern languages to identify packets of interest. Similarly, security information and event management (SIEM)systemsrely onpatternlanguages for real-time analysis of security alerts and event logs. Both NIDS and SIEM systems usepatternlanguages extendedfrom regular expressions. One such extension, the submatch construct, allows the extraction of substrings from a string matching a pattern. Existing solutions for submatch extraction are based on non-deterministic finite automata (NFAs) or recursive backtracking. NFA-based algorithms are time-inefficient. Recursive backtrackingalgorithms perform poorly on pathological inputs generated by algorithmic complexity attacks. We propose a new approach for submatch extraction that uses ordered binary decision diagrams (OBDDs) to represent and operate pattern matching. Our evaluation using patterns from the Snort HTTP rule set and a commercial SIEM system shows that our approach achieves its ideal performance when patterns are combined. In the best case, our approach is faster than RE2 and PCRE by one to two orders of magnitude
    corecore