4 research outputs found

    Building A Big Data Analytical Pipeline With Hadoop For Processing Enterprise XML Data

    Get PDF
    The current paper shows an end-to-end approach how to process XML files in the Hadoop ecosystem. The work demonstrates a way how to handle problems faced during the analysis of a large amounts of XML files. The paper presents a completed Extract, Load and Transform (ELT) cycle, which is based on the open source software stack Apache Hadoop, which became a standard for processing of a huge amounts of data. This work shows that applying open source solutions to a particular set of problems could not be enough. In fact, most of big data processing open source tools were implemented only to address a limited number of the use cases. This work explains and shows, why exactly specific use cases may require significant extension with a self-developed multiple software components. The use case described in the paper deals with huge amounts of semi-structured XML files, which supposed to be persisted and processed daily

    Parallelizing XML data-streaming workflows via MapReduce

    Get PDF
    AbstractIn prior work it has been shown that the design of scientific workflows can benefit from a collection-oriented modeling paradigm which views scientific workflows as pipelines of XML stream processors. In this paper, we present approaches for exploiting data parallelism in XML processing pipelines through novel compilation strategies to the MapReduce framework. Pipelines in our approach consist of sequences of processing steps that receive XML-structured data and produce, often through calls to “black-box” (scientific) functions, modified (i.e., updated) XML structures. Our main contributions are (i) the development of a set of strategies for compiling scientific workflows, modeled as XML processing pipelines, into parallel MapReduce networks, and (ii) a discussion of their advantages and trade-offs, based on a thorough experimental evaluation of the various translation strategies. Our evaluation uses the Hadoop MapReduce system as an implementation platform. Our results show that execution times of XML workflow pipelines can be significantly reduced using our compilation strategies. These efficiency gains, together with the benefits of MapReduce (e.g., fault tolerance) make our approach ideal for executing large-scale, compute-intensive XML-based scientific workflows

    Data provisioning in simulation workflows

    Get PDF
    Computer-based simulations become more and more important, e.g., to imitate real-world experiments such as crash tests, which would otherwise be too expensive or not feasible at all. Thereby, simulation workflows may be used to control the interaction with simulation tools performing necessary numerical calculations. The input data needed by these tools often come from diverse data sources that manage their data in a multiplicity of proprietary formats. Hence, simulation workflows additionally have to carry out many complex data provisioning tasks. These tasks filter and transform heterogeneous input data in such a way that underlying simulation tools can properly ingest them. Furthermore, some simulations use different tools that need to exchange data between each other. Here, even more complex data transformations are needed to cope with the differences in data formats and data granularity as they are expected by involved tools. Nowadays, scientists conducting simulations typically have to design their simulation workflows on their own. So, they have to implement many low-level data transformations that realize the data provisioning for and the data exchange between simulation tools. In doing so, they waste time for workflow design, which hinders them to concentrate on their core issue, i.e., the simulation itself. This thesis introduces several novel concepts and methods that significantly alleviate the design of the complex data provisioning in simulation workflows. Firstly, it addresses the issue that most existing workflow systems offer multiple and diverse data provisioning techniques. So, scientists are frequently overwhelmed with selecting certain techniques that are appropriate for their workflows. This thesis discusses how to conquer the multiplicity and diversity of available techniques by their systematic classification. The resulting classes of techniques are then compared with each other considering relevant functional and non-functional requirements for data provisioning in simulation workflows. The major outcome of this classification and comparison is a set of guidelines that assist scientists in choosing proper data provisioning techniques. Another problem with existing workflow systems is that they often do not support all kinds of data resources or data management operations required by concrete computer-based simulations. So, this thesis proposes extensions of conventional workflow languages that offer a generic solution to data provisioning in arbitrary simulation workflows. These extensions allow for specifying any data management operation that may be described via the query or command languages of involved data resources, e.g., arbitrary SQL statements or shell commands. The proposed extensions of workflow languages still do not remove the burden from scientists to specify many complex data management operations using low-level query and command languages. Hence, this thesis introduces a novel pattern-based approach that even further enhances the abstraction support for simulation workflow design. Instead of specifying many workflow tasks, scientists only need to select a small number of abstract patterns to describe the high-level simulation process they have in mind. Furthermore, scientists are familiar with the parameters to be specified for the patterns, because these parameters correspond to terms or concepts that are related to their domain-specific simulation methodology. A rule-based transformation approach offers flexible means to finally map high-level patterns onto executable simulation workflows. Another major contribution is a pattern hierarchy arranging different kinds of patterns according to clearly distinguished abstraction levels. This facilitates a holistic separation of concerns and provides a systematic framework to incorporate different kinds of persons and their various skills into workflow design, e.g., not only scientists, but also data engineers. Altogether, the pattern-based approach conquers the data complexity associated with simulation workflows, which allows scientists to concentrate on their core issue again, namely on the simulation itself. The last contribution is a complementary optimization method to increase the performance of local data processing in simulation workflows. This method introduces various techniques that partition relevant local data processing tasks between the components of a workflow system in a smart way. Thereby, such tasks are either assigned to the workflow execution engine or to a tightly integrated local database system. Corresponding experiments revealed that, even for a moderate data size of about 0.5 MB, this method is able to reduce workflow duration by nearly a factor of 9
    corecore