3 research outputs found
Captura de patrones en archivos de logs mediante el uso de expresiones regulares en GPUs
The information contained in a system is normally stored into log files. Most of
the time, these files store the information in plain text with many not formatted information.
It is then necessary to extract parts of this information to be able to understand what is going
on such system. Currently, such information can be extracted using programs that make use
of extended regular expressions. The use of regular expressions allows the search of patterns
but it can be also used to extract data from the searched pattern. Most of the programs that
implement regular expressions are based on finite automatas, such as non-deterministic (NFA)
or deterministic (DFA). We aim to explore the use of finite automatas to extract data from log
files using a Graphic Processor Unit (GPU) device to speedup the process. Moreover, we will
also explore data parallelism over the lines present on the log file. Currently, the work done in
GPU with regular expressions is limited to matching tasks only, without any capture feature.
We present a solution that solves this lack of pattern capture in current implementations. Our
development uses as base the implementation of TNFA and converts it to a TDFA before running
the GPU task. We explore the new CUDA feature named unified memory, supported since CUDA
6, together with streams to achieve the best possible performance in our GPU implementation.
Using real log files and regular expressions made to extract specific data, our evaluation shows
that it can be up to 9 faster than the sequential implementation.La informaci贸n contenida en un sistema normalmente se almacena en archivos
de registros, conocidos com煤nmente como logs. La mayor parte de las veces, estos archivos
almacenan la informaci贸n en texto plano, con mucha informaci贸n sin formatear. Por ello es
necesario extraer partes de esta informaci贸n, de forma que se pueda saber qu茅 est谩 ocurriendo en
dicho sistema. Actualmente, esta informaci贸n se puede extraer usando programas que aprovechan
las expresiones regulares extendidas. Su uso permite la b煤squeda de patrones, pero tambi茅n
se pueden emplear para extraer datos del patr贸n buscado. La mayor铆a de los programas que
implementan expresiones regulares se basan en aut贸matas finitos, tales como los no deterministas
(NFA) y los deterministas (DFA). El objetivo de este Trabajo Fin de M谩ster es explorar el uso de
aut贸matas finitos para extraer datos de archivos de log usando una GPU para acelerar el proceso.
Es m谩s, tambi茅n exploramos el paralelismo que se puede aplicar sobre las l铆neas de un archivo
de log. En la actualidad, el trabajo realizado con GPUs y expresiones regulares se limita a tareas
de b煤squeda de patrones, sin ninguna funcionalidad de captura. Presentamos una soluci贸n que
resuelve esta falta de funcionalidad en las implementaciones actuales. Nuestro desarrollo usa
como base una implementaci贸n de TNFA y la convierte a TDFA antes de ejecutar la tarea
en la GPU. Exploramos la nueva funcionalidad de CUDA denominada memoria unificada, que
se soporta desde la versi贸n 6 de CUDA, as铆 como el uso de flujos o streams para alcanzar
el mejor rendimiento posible en nuestra implementaci贸n en GPU. Al usar archivos de log
reales y expresiones regulares para extraer datos espec铆ficos, nuestra evaluaci贸n muestra que
la implementaci贸n paralela es hasta 9 veces m谩s r谩pida que la implementaci贸n secuencial
Stream Processing using Grammars and Regular Expressions
In this dissertation we study regular expression based parsing and the use of
grammatical specifications for the synthesis of fast, streaming
string-processing programs.
In the first part we develop two linear-time algorithms for regular
expression based parsing with Perl-style greedy disambiguation. The first
algorithm operates in two passes in a semi-streaming fashion, using a constant
amount of working memory and an auxiliary tape storage which is written in the
first pass and consumed by the second. The second algorithm is a single-pass
and optimally streaming algorithm which outputs as much of the parse tree as is
semantically possible based on the input prefix read so far, and resorts to
buffering as many symbols as is required to resolve the next choice. Optimality
is obtained by performing a PSPACE-complete pre-analysis on the regular
expression.
In the second part we present Kleenex, a language for expressing
high-performance streaming string processing programs as regular grammars with
embedded semantic actions, and its compilation to streaming string transducers
with worst-case linear-time performance. Its underlying theory is based on
transducer decomposition into oracle and action machines, and a finite-state
specialization of the streaming parsing algorithm presented in the first part.
In the second part we also develop a new linear-time streaming parsing
algorithm for parsing expression grammars (PEG) which generalizes the regular
grammars of Kleenex. The algorithm is based on a bottom-up tabulation algorithm
reformulated using least fixed points and evaluated using an instance of the
chaotic iteration scheme by Cousot and Cousot
Fast submatch extraction using obdds
Network-based intrusion detection systems (NIDS) commonly use pattern languages to identify packets of interest. Similarly, security information and event management (SIEM)systemsrely onpatternlanguages for real-time analysis of security alerts and event logs. Both NIDS and SIEM systems usepatternlanguages extendedfrom regular expressions. One such extension, the submatch construct, allows the extraction of substrings from a string matching a pattern. Existing solutions for submatch extraction are based on non-deterministic finite automata (NFAs) or recursive backtracking. NFA-based algorithms are time-inefficient. Recursive backtrackingalgorithms perform poorly on pathological inputs generated by algorithmic complexity attacks. We propose a new approach for submatch extraction that uses ordered binary decision diagrams (OBDDs) to represent and operate pattern matching. Our evaluation using patterns from the Snort HTTP rule set and a commercial SIEM system shows that our approach achieves its ideal performance when patterns are combined. In the best case, our approach is faster than RE2 and PCRE by one to two orders of magnitude