571 research outputs found

    Normalization of Unstructured Log Data into Streams of Structured Event Objects

    Get PDF
    Monitoring plays a crucial role in the operation of any sizeable distributed IT infrastructure. Whether it is a university network or cloud datacenter, monitoring information is continuously used in a wide spectrum of ways ranging from mission-critical jobs, e.g. accounting or incident handling, to equally important development-related tasks, e.g. debugging or fault-detection. Whilst pursuing a novel vision of new-generation event-driven monitoring systems, we have identified that a particularly rich source of monitoring information, computer logs, is also one of the most problematic in terms of automated processing. Log data are predominantly generated in an ad-hoc manner using a variety of incompatible formats with the most important pieces of information, i.e. log messages, in the form of unstructured strings. This clashes with our long-term goal of designing a system enabling its users to transparently define real-time continuous queries over homogeneous streams of properly defined monitoring event objects with explicitly described structure. Our goal is to bridge this gap by normalizing the poorly structured log data into streams of structured event objects. The combined challenge of this goal is structuring the log data, whilst considering the high velocity with which they are generated in modern IT infrastructures. This paper summarizes the contributions of a dissertation thesis "Normalization of Unstructured Log Data into Streams of Structured Event Objects" dealing with the matter at hand in detail

    Exploiting Input Sanitization for Regex Denial of Service

    Get PDF
    Web services use server-side input sanitization to guard against harmful input. Some web services publish their sanitization logic to make their client interface more usable, e.g., allowing clients to debug invalid requests locally. However, this usability practice poses a security risk. Specifically, services may share the regexes they use to sanitize input strings — and regex-based denial of service (ReDoS) is an emerging threat. Although prominent service outages caused by ReDoS have spurred interest in this topic, we know little about the degree to which live web services are vulnerable to ReDoS. In this paper, we conduct the first black-box study measuring the extent of ReDoS vulnerabilities in live web services. We apply the Consistent Sanitization Assumption: that client-side sanitization logic, including regexes, is consistent with the sanitization logic on the server-side. We identify a service’s regex-based input sanitization in its HTML forms or its API, find vulnerable regexes among these regexes, craft ReDoS probes, and pinpoint vulnerabilities. We analyzed the HTML forms of 1,000 services and the APIs of 475 services. Of these, 355 services publish regexes; 17 services publish unsafe regexes; and 6 services are vulnerable to ReDoS through their APIs (6 domains; 15 subdomains). Both Microsoft and Amazon Web Services patched their web services as a result of our disclosure. Since these vulnerabilities were from API specifications, not HTML forms, we proposed a ReDoS defense for a popular API validation library, and our patch has been merged. To summarize: in client-visible sanitization logic, some web services advertise ReDoS vulnerabilities in plain sight. Our results motivate short-term patches and long-term fundamental solutions

    Facilitating High Performance Code Parallelization

    Get PDF
    With the surge of social media on one hand and the ease of obtaining information due to cheap sensing devices and open source APIs on the other hand, the amount of data that can be processed is as well vastly increasing. In addition, the world of computing has recently been witnessing a growing shift towards massively parallel distributed systems due to the increasing importance of transforming data into knowledge in today’s data-driven world. At the core of data analysis for all sorts of applications lies pattern matching. Therefore, parallelizing pattern matching algorithms should be made efficient in order to cater to this ever-increasing abundance of data. We propose a method that automatically detects a user’s single threaded function call to search for a pattern using Java’s standard regular expression library, and replaces it with our own data parallel implementation using Java bytecode injection. Our approach facilitates parallel processing on different platforms consisting of shared memory systems (using multithreading and NVIDIA GPUs) and distributed systems (using MPI and Hadoop). The major contributions of our implementation consist of reducing the execution time while at the same time being transparent to the user. In addition to that, and in the same spirit of facilitating high performance code parallelization, we present a tool that automatically generates Spark Java code from minimal user-supplied inputs. Spark has emerged as the tool of choice for efficient big data analysis. However, users still have to learn the complicated Spark API in order to write even a simple application. Our tool is easy to use, interactive and offers Spark’s native Java API performance. To the best of our knowledge and until the time of this writing, such a tool has not been yet implemented

    An Introduction to Programming for Bioscientists: A Python-based Primer

    Full text link
    Computing has revolutionized the biological sciences over the past several decades, such that virtually all contemporary research in the biosciences utilizes computer programs. The computational advances have come on many fronts, spurred by fundamental developments in hardware, software, and algorithms. These advances have influenced, and even engendered, a phenomenal array of bioscience fields, including molecular evolution and bioinformatics; genome-, proteome-, transcriptome- and metabolome-wide experimental studies; structural genomics; and atomistic simulations of cellular-scale molecular assemblies as large as ribosomes and intact viruses. In short, much of post-genomic biology is increasingly becoming a form of computational biology. The ability to design and write computer programs is among the most indispensable skills that a modern researcher can cultivate. Python has become a popular programming language in the biosciences, largely because (i) its straightforward semantics and clean syntax make it a readily accessible first language; (ii) it is expressive and well-suited to object-oriented programming, as well as other modern paradigms; and (iii) the many available libraries and third-party toolkits extend the functionality of the core language into virtually every biological domain (sequence and structure analyses, phylogenomics, workflow management systems, etc.). This primer offers a basic introduction to coding, via Python, and it includes concrete examples and exercises to illustrate the language's usage and capabilities; the main text culminates with a final project in structural bioinformatics. A suite of Supplemental Chapters is also provided. Starting with basic concepts, such as that of a 'variable', the Chapters methodically advance the reader to the point of writing a graphical user interface to compute the Hamming distance between two DNA sequences.Comment: 65 pages total, including 45 pages text, 3 figures, 4 tables, numerous exercises, and 19 pages of Supporting Information; currently in press at PLOS Computational Biolog

    Scalable Algorithms for NFA Multi-Striding and NFA-Based Deep Packet Inspection on GPUs

    Get PDF
    Finite state automata (FSA) are used by many network processing applications to match complex sets of regular expressions in network packets. In order to make FSA-based matching possible even at the ever-increasing speed of modern networks, multi-striding has been introduced. This technique increases input parallelism by transforming the classical FSA that consumes input byte by byte into an equivalent one that consumes input in larger units. However, the algorithms used today for this transformation are so complex that they often result unfeasible for large and complex rule sets. This paper presents a set of new algorithms that extend the applicability of multi-striding to complex rule sets. These algorithms can transform non-deterministic finite automata (NFA) into their multi-stride form with reduced memory and time requirements. Moreover, they exploit the massive parallelism of graphical processing units for NFA-based matching. The final result is a boost of the overall processing speed on typical regex-based packet processing applications, with a speedup of almost one order of magnitude compared to the current state-of-the-art algorithms

    A Parallel Computational Approach for String Matching- A Novel Structure with Omega Model

    Get PDF
    In r e cent day2019;s parallel string matching problem catch the attention of so many researchers because of the importance in different applications like IRS, Genome sequence, data cleaning etc.,. While it is very easily stated and many of the simple algorithms perform very well in practice, numerous works have been published on the subject and research is still very active. In this paper we propose a omega parallel computing model for parallel string matching. The algorithm is designed to work on omega model pa rallel architecture where text is divided for parallel processing and special searching at division point is required for consistent and complete searching. This algorithm reduces the number of comparisons and parallelization improves the time efficiency. Experimental results show that, on a multi - processor system, the omega model implementation of the proposed parallel string matching algorithm can reduce string matching time

    Mastering Regular Expressions

    Get PDF
    • …
    corecore