3 research outputs found

    Big Data Management Using Scientific Workflows

    Get PDF
    Humanity is rapidly approaching a new era, where every sphere of activity will be informed by the ever-increasing amount of data. Making use of big data has the potential to improve numerous avenues of human activity, including scientific research, healthcare, energy, education, transportation, environmental science, and urban planning, just to name a few. However, making such progress requires managing terabytes and even petabytes of data, generated by billions of devices, products, and events, often in real time, in different protocols, formats and types. The volume, velocity, and variety of big data, known as the 3 Vs , present formidable challenges, unmet by the traditional data management approaches. Traditionally, many data analyses have been performed using scientific workflows, tools for formalizing and structuring complex computational processes. While scientific workflows have been used extensively in structuring complex scientific data analysis processes, little work has been done to enable scientific workflows to cope with the three big data challenges on the one hand, and to leverage the dynamic resource provisioning capability of cloud computing to analyze big data on the other hand. In this dissertation, to facilitate efficient composition, verification, and execution of distributed large-scale scientific workflows, we first propose a formal approach to scientific workflow verification, including a workflow model, and the notion of a well-typed workflow. Our approach translates a scientific workflow into an equivalent typed lambda expression, and typechecks the workflow. We then propose a typetheoretic approach to the shimming problem in scientific workflows, which occurs when connecting related but incompatible components. We reduce the shimming problem to a runtime coercion problem in the theory of type systems, and propose a fully automated and transparent solution. Our technique algorithmically inserts invisible shims into the workflow specification, thereby resolving the shimming problem for any well-typed workflow. Next, we identify a set of important challenges for running big data workflows in the cloud. We then propose a generic, implementation-independent system architecture that addresses many of these challenges. Finally, we develop a cloud-enabled big data workflow management system, called DATAVIEW, that delivers a specific implementation of our proposed architecture. To further validate our proposed architecture, we conduct a case study in which we design and run a big data workflow from the automotive domain using the Amazon EC2 cloud environment

    Benchmarking Bottom-Up and Top-Down Strategies to Sparql-To-Sql Query Translation

    Get PDF
    Many researchers have proposed using conventional relational databases to store and query large Semantic Web datasets. The most complex component of this approach is SPARQL-to-SQL query translation. Existing algorithms perform this translation using either bottom-up or top-down strategy and result in semantically equivalent but syntactically different relational queries. Do relational query optimizers always produce identical query execution plans for semantically equivalent bottom-up and top-down queries? Which of the two strategies yields faster SQL queries? To address these questions, this work studies bottom-up and top-down translations of SPARQL queries with nested optional graph patterns. This work presents: (1) A basic graph pattern translation algorithm that yields flat SQL queries, (2) A bottom-up nested optional graph pattern translation algorithm, (3) A top-down nested optional graph pattern translation algorithm, and (4) A performance study featuring SPARQL queries with nested optional graph patterns over RDF databases created in Oracle, DB2, and PostgreSQL
    corecore