53 research outputs found

    XML Document Parsing: Operational and Performance Characteristics

    Full text link

    Parsing XML Using Parallel Traversal of Streaming Trees

    Full text link
    Abstract. XML has been widely adopted across a wide spectrum of applica-tions. Its parsing efficiency, however, remains a concern, and can be a bottleneck. With the current trend towards multicore CPUs, parallelization to improve per-formance is increasingly relevant. In many applications, the XML is streamed from the network, and thus the complete XML document is never in memory at any single moment in time. Parallel parsing of such a stream can be equated to parallel depth-first traversal of a streaming tree. Existing research on parallel tree traversal has assumed the entire tree was available in-memory, and thus cannot be directly applied. In this paper we investigate parallel, SAX-style parsing of XML via a parallel, depth-first traversal of the streaming document. We show good scalability up to about 6 cores on a Linux platform.

    QUALITY IMPROVEMENT AND VALIDATION TECHNIQUES ON SOFTWARE SPECIFICATION AND DESIGN

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Improving Programming Support for Hardware Accelerators Through Automata Processing Abstractions

    Full text link
    The adoption of hardware accelerators, such as Field-Programmable Gate Arrays, into general-purpose computation pipelines continues to rise, driven by recent trends in data collection and analysis as well as pressure from challenging physical design constraints in hardware. The architectural designs of many of these accelerators stand in stark contrast to the traditional von Neumann model of CPUs. Consequently, existing programming languages, maintenance tools, and techniques are not directly applicable to these devices, meaning that additional architectural knowledge is required for effective programming and configuration. Current programming models and techniques are akin to assembly-level programming on a CPU, thus placing significant burden on developers tasked with using these architectures. Because programming is currently performed at such low levels of abstraction, the software development process is tedious and challenging and hinders the adoption of hardware accelerators. This dissertation explores the thesis that theoretical finite automata provide a suitable abstraction for bridging the gap between high-level programming models and maintenance tools familiar to developers and the low-level hardware representations that enable high-performance execution on hardware accelerators. We adopt a principled hardware/software co-design methodology to develop a programming model providing the key properties that we observe are necessary for success, namely performance and scalability, ease of use, expressive power, and legacy support. First, we develop a framework that allows developers to port existing, legacy code to run on hardware accelerators by leveraging automata learning algorithms in a novel composition with software verification, string solvers, and high-performance automata architectures. Next, we design a domain-specific programming language to aid programmers writing pattern-searching algorithms and develop compilation algorithms to produce finite automata, which supports efficient execution on a wide variety of processing architectures. Then, we develop an interactive debugger for our new language, which allows developers to accurately identify the locations of bugs in software while maintaining support for high-throughput data processing. Finally, we develop two new automata-derived accelerator architectures to support additional applications, including the detection of security attacks and the parsing of recursive and tree-structured data. Using empirical studies, logical reasoning, and statistical analyses, we demonstrate that our prototype artifacts scale to real-world applications, maintain manageable overheads, and support developers' use of hardware accelerators. Collectively, the research efforts detailed in this dissertation help ease the adoption and use of hardware accelerators for data analysis applications, while supporting high-performance computation.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/155224/1/angstadt_1.pd

    Well-Formed and Scalable Invasive Software Composition

    Get PDF
    Software components provide essential means to structure and organize software effectively. However, frequently, required component abstractions are not available in a programming language or system, or are not adequately combinable with each other. Invasive software composition (ISC) is a general approach to software composition that unifies component-like abstractions such as templates, aspects and macros. ISC is based on fragment composition, and composes programs and other software artifacts at the level of syntax trees. Therefore, a unifying fragment component model is related to the context-free grammar of a language to identify extension and variation points in syntax trees as well as valid component types. By doing so, fragment components can be composed by transformations at respective extension and variation points so that always valid composition results regarding the underlying context-free grammar are yielded. However, given a language’s context-free grammar, the composition result may still be incorrect. Context-sensitive constraints such as type constraints may be violated so that the program cannot be compiled and/or interpreted correctly. While a compiler can detect such errors after composition, it is difficult to relate them back to the original transformation step in the composition system, especially in the case of complex compositions with several hundreds of such steps. To tackle this problem, this thesis proposes well-formed ISC—an extension to ISC that uses reference attribute grammars (RAGs) to specify fragment component models and fragment contracts to guard compositions with context-sensitive constraints. Additionally, well-formed ISC provides composition strategies as a means to configure composition algorithms and handle interferences between composition steps. Developing ISC systems for complex languages such as programming languages is a complex undertaking. Composition-system developers need to supply or develop adequate language and parser specifications that can be processed by an ISC composition engine. Moreover, the specifications may need to be extended with rules for the intended composition abstractions. Current approaches to ISC require complete grammars to be able to compose fragments in the respective languages. Hence, the specifications need to be developed exhaustively before any component model can be supplied. To tackle this problem, this thesis introduces scalable ISC—a variant of ISC that uses island component models as a means to define component models for partially specified languages while still the whole language is supported. Additionally, a scalable workflow for agile composition-system development is proposed which supports a development of ISC systems in small increments using modular extensions. All theoretical concepts introduced in this thesis are implemented in the Skeletons and Application Templates framework SkAT. It supports “classic”, well-formed and scalable ISC by leveraging RAGs as its main specification and implementation language. Moreover, several composition systems based on SkAT are discussed, e.g., a well-formed composition system for Java and a C preprocessor-like macro language. In turn, those composition systems are used as composers in several example applications such as a library of parallel algorithmic skeletons

    Development of a multi-layered botmaster based analysis framework

    Get PDF
    Botnets are networks of compromised machines called bots that come together to form the tool of choice for hackers in the exploitation and destruction of computer networks. Most malicious botnets have the ability to be rented out to a broad range of potential customers, with each customer having an attack agenda different from the other. The result is a botnet that is under the control of multiple botmasters, each of which implement their own attacks and transactions at different times in the botnet. In order to fight botnets, details about their structure, users, and their users motives need to be discovered. Since current botnets require the information about the initial bootstrapping of a bot to a botnet, the monitoring of botnets are possible. Botnet monitoring is used to discover the details of a botnet, but current botnet monitoring projects mainly identify the magnitude of the botnet problem and tend to overt some fundamental problems, such as the diversified sources of the attacks. To understand the use of botnets in more detail, the botmasters that command the botnets need to be studied. In this thesis we focus on identifying the threat of botnets based on each individual botmaster. We present a multi-layered analysis framework which identifies the transactions of each botmaster and then we correlate the transactions with the physical evolution of the botnet. With these characteristics we discover what role each botmaster plays in the overall botnet operation. We demonstrate our results in our system: MasterBlaster, which discovers the level of interaction between each botmaster and the botnet. Our system has been evaluated in real network traces. Our results show that investigating the roles of each botmaster in a botnet should be essential and demonstrates its potential benefit for identifying and conducting additional research on analyzing botmaster interactions. We believe our work will pave the way for more fine-grained analysis of botnets which will lead to better protection capabilities and more rapid attribution of cyber crimes committed using botnets

    Toward Automated Network Management and Operations.

    Full text link
    Network management plays a fundamental role in the operation and well-being of today's networks. Despite the best effort of existing support systems and tools, management operations in large service provider and enterprise networks remain mostly manual. Due to the larger scale of modern networks, more complex network functionalities, and higher network dynamics, human operators are increasingly short-handed. As a result, network misconfigurations are frequent, and can result in violated service-level agreements and degraded user experience. In this dissertation, we develop various tools and systems to understand, automate, augment, and evaluate network management operations. Our thesis is that by introducing formal abstractions, like deterministic finite automata, Petri-Nets and databases, we can build new support systems that systematically capture domain knowledge, automate network management operations, enforce network-wide properties to prevent misconfigurations, and simultaneously reduce manual effort. The theme for our systems is to build a knowledge plane based on the proposed abstractions, allowing network-wide reasoning and guidance for network operations. More importantly, the proposed systems require no modification to the existing Internet infrastructure and network devices, simplifying adoption. We show that our systems improve both timeliness and correctness in performing realistic and large-scale network operations. Finally, to address the current limitations and difficulty of evaluating novel network management systems, we have designed a distributed network testing platform that relies on network and device virtualization to provide realistic environments and isolation to production networks.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/78837/1/chenxu_1.pd

    Captura de patrones en archivos de logs mediante el uso de expresiones regulares en GPUs

    Full text link
    The information contained in a system is normally stored into log files. Most of the time, these files store the information in plain text with many not formatted information. It is then necessary to extract parts of this information to be able to understand what is going on such system. Currently, such information can be extracted using programs that make use of extended regular expressions. The use of regular expressions allows the search of patterns but it can be also used to extract data from the searched pattern. Most of the programs that implement regular expressions are based on finite automatas, such as non-deterministic (NFA) or deterministic (DFA). We aim to explore the use of finite automatas to extract data from log files using a Graphic Processor Unit (GPU) device to speedup the process. Moreover, we will also explore data parallelism over the lines present on the log file. Currently, the work done in GPU with regular expressions is limited to matching tasks only, without any capture feature. We present a solution that solves this lack of pattern capture in current implementations. Our development uses as base the implementation of TNFA and converts it to a TDFA before running the GPU task. We explore the new CUDA feature named unified memory, supported since CUDA 6, together with streams to achieve the best possible performance in our GPU implementation. Using real log files and regular expressions made to extract specific data, our evaluation shows that it can be up to 9 faster than the sequential implementation.La información contenida en un sistema normalmente se almacena en archivos de registros, conocidos comúnmente como logs. La mayor parte de las veces, estos archivos almacenan la información en texto plano, con mucha información sin formatear. Por ello es necesario extraer partes de esta información, de forma que se pueda saber qué está ocurriendo en dicho sistema. Actualmente, esta información se puede extraer usando programas que aprovechan las expresiones regulares extendidas. Su uso permite la búsqueda de patrones, pero también se pueden emplear para extraer datos del patrón buscado. La mayoría de los programas que implementan expresiones regulares se basan en autómatas finitos, tales como los no deterministas (NFA) y los deterministas (DFA). El objetivo de este Trabajo Fin de Máster es explorar el uso de autómatas finitos para extraer datos de archivos de log usando una GPU para acelerar el proceso. Es más, también exploramos el paralelismo que se puede aplicar sobre las líneas de un archivo de log. En la actualidad, el trabajo realizado con GPUs y expresiones regulares se limita a tareas de búsqueda de patrones, sin ninguna funcionalidad de captura. Presentamos una solución que resuelve esta falta de funcionalidad en las implementaciones actuales. Nuestro desarrollo usa como base una implementación de TNFA y la convierte a TDFA antes de ejecutar la tarea en la GPU. Exploramos la nueva funcionalidad de CUDA denominada memoria unificada, que se soporta desde la versión 6 de CUDA, así como el uso de flujos o streams para alcanzar el mejor rendimiento posible en nuestra implementación en GPU. Al usar archivos de log reales y expresiones regulares para extraer datos específicos, nuestra evaluación muestra que la implementación paralela es hasta 9 veces más rápida que la implementación secuencial

    Ressourcen Optimierung von SOA-Technologien in eingebetteten Netzwerken

    Get PDF
    Embedded networks are fundamental infrastructures of many different kinds of domains, such as home or industrial automation, the automotive industry, and future smart grids. Yet they can be very heterogeneous, containing wired and wireless nodes with different kinds of resources and service capabilities, such as sensing, acting, and processing. Driven by new opportunities and business models, embedded networks will play an ever more important role in the future, interconnecting more and more devices, even from other network domains. Realizing applications for such types of networks, however, is a highly challenging task, since various aspects have to be considered, including communication between a diverse assortment of resource-constrained nodes, such as microcontrollers, as well as flexible node infrastructure. Service Oriented Architecture (SOA) with Web services would perfectly meet these unique characteristics of embedded networks and ease the development of applications. Standardized Web services, however, are based on plain-text XML, which is not suitable for microcontroller-based devices with their very limited resources due to XML's verbosity, its memory and bandwidth usage, as well as its associated significant processing overhead. This thesis presents methods and strategies for realizing efficient XML-based Web service communication in embedded networks by means of binary XML using EXI format. We present a code generation approach to create optimized and dedicated service applications in resource-constrained embedded networks. In so doing, we demonstrate how EXI grammar can be optimally constructed and applied to the Web service and service requester context. In addition, so as to realize an optimized service interaction in embedded networks, we design and develop an optimized filter-enabled service data dissemination that takes into account the individual resource capabilities of the nodes and the connection quality within embedded networks. We show different approaches for efficiently evaluating binary XML data and applying it to resource constrained devices, such as microcontrollers. Furthermore, we will present the effectful placement of binary XML filters in embedded networks with the aim of reducing both, the computational load of constrained nodes and the network traffic. Various evaluation results of V2G applications prove the efficiency of our approach as compared to existing solutions and they also prove the seamless and successful applicability of SOA-based technologies in the microcontroller-based environment

    An information extraction model for recommending the most applied case

    Get PDF
    The amount of information produced by different domains is constantly increasing. One domain that particularly produces large amounts of information is the legal domain, where information is mainly used for research purposes. However, too much time is spent by legal researchers on searching for useful information. Information is found by using special search engines or by consulting hard copies of legal literature. The main research question that this study addressed is “What techniques can be incorporated into a model that recommends the most applied case for a field of law?”. The Design Science Research (DSR) methodology was used to address the research objectives. The model developed is the theoretical contribution produced from following the DSR methodology. A case study organisation, called LexisNexis, was to help investigate the real-world problem. The initial investigation into the real-world problem revealed that too much time is spent on searching for the Most Applied Case (MAC) and no formal or automated processes were used. An analysis of an informal process followed by legal researchers enabled the identification of different concepts that could be combined to create a prescriptive model to recommend the MAC. A critical analysis of the literature was conducted to obtain a better understanding of the legal domain and the techniques that can be applied to assist with problems faced in this domain, related to information retrieval and extraction. This resulted in the creation of an IE Model based only on theory. Questionnaires were sent to experts to obtain a further understanding of the legal domain, highlight problems faced, and identify which attributes of a legal case can be used to help recommend the MAC. During the Design and Development activity of the DSR methodology, a prescriptive MAC Model for recommending the MAC was created based on findings from the literature review and questionnaires. The MAC Model consists of processes concerning: Information retrieval (IR); Information extraction (IE); Information storage; and Query-independent ranking. Analysis of IR and IE helped to identify problems experienced when processing text. Furthermore, appropriate techniques and algorithms were identified that can process legal documents and extract specific facts. The extracted facts were then further processed to allow for storage and processing by query-independent ranking algorithms. The processes incorporated into the model were then used to create a proof-of-concept prototype called the IE Prototype. The IE Prototype implements two processes called the IE process and the Database process. The IE process analyses different sections of a legal case to extract specific facts. The Database process then ensures that the extracted facts are stored in a document database for future querying purposes. The IE Prototype was evaluated using the technical risk and efficacy strategy from the Framework for Evaluation of Design Science. Both formative and summative evaluations were conducted. Formative evaluations were conducted to identify functional issues of the prototype whilst summative evaluations made use of real-world legal cases to test the prototype. Multiple experiments were conducted on legal cases, known as source cases, that resulted in facts from the source cases being extracted. For the purpose of the experiments, the term “source case” was used to distinguish between a legal case in its entirety and a legal case’s list of cases referred to. Two types of NoSQL databases were investigated for implementation namely, a graph database and a document database. Setting up the graph database required little time. However, development issues prevented the graph database from being successfully implemented in the proof-of-concept prototype. A document database was successfully implemented as an alternative for the proof-of-concept prototype. Analysis of the source cases used to evaluate the IE Prototype revealed that 96% of the source cases were categorised as being partially extracted. The results also revealed that the IE Prototype was capable of processing large amounts of source cases at a given time
    corecore