109 research outputs found

    Scientific Workflows for Metabolic Flux Analysis

    Get PDF
    Metabolic engineering is a highly interdisciplinary research domain that interfaces biology, mathematics, computer science, and engineering. Metabolic flux analysis with carbon tracer experiments (13 C-MFA) is a particularly challenging metabolic engineering application that consists of several tightly interwoven building blocks such as modeling, simulation, and experimental design. While several general-purpose workflow solutions have emerged in recent years to support the realization of complex scientific applications, the transferability of these approaches are only partially applicable to 13C-MFA workflows. While problems in other research fields (e.g., bioinformatics) are primarily centered around scientific data processing, 13C-MFA workflows have more in common with business workflows. For instance, many bioinformatics workflows are designed to identify, compare, and annotate genomic sequences by "pipelining" them through standard tools like BLAST. Typically, the next workflow task in the pipeline can be automatically determined by the outcome of the previous step. Five computational challenges have been identified in the endeavor of conducting 13 C-MFA studies: organization of heterogeneous data, standardization of processes and the unification of tools and data, interactive workflow steering, distributed computing, and service orientation. The outcome of this thesis is a scientific workflow framework (SWF) that is custom-tailored for the specific requirements of 13 C-MFA applications. The proposed approach – namely, designing the SWF as a collection of loosely-coupled modules that are glued together with web services – alleviates the realization of 13C-MFA workflows by offering several features. By design, existing tools are integrated into the SWF using web service interfaces and foreign programming language bindings (e.g., Java or Python). Although the attributes "easy-to-use" and "general-purpose" are rarely associated with distributed computing software, the presented use cases show that the proposed Hadoop MapReduce framework eases the deployment of computationally demanding simulations on cloud and cluster computing resources. An important building block for allowing interactive researcher-driven workflows is the ability to track all data that is needed to understand and reproduce a workflow. The standardization of 13 C-MFA studies using a folder structure template and the corresponding services and web interfaces improves the exchange of information for a group of researchers. Finally, several auxiliary tools are developed in the course of this work to complement the SWF modules, i.e., ranging from simple helper scripts to visualization or data conversion programs. This solution distinguishes itself from other scientific workflow approaches by offering a system of loosely-coupled components that are flexibly arranged to match the typical requirements in the metabolic engineering domain. Being a modern and service-oriented software framework, new applications are easily composed by reusing existing components

    MapReduce to couple a bio-mechanical and a systems-biological simulation

    Get PDF
    Recently, workflow technology has fostered the hope of the scientific community in that they could help complex scientific simulations to become easier to implement and maintain. The subject of this thesis is an existing workflow for a multi-scalar simulation which calculates the flux of porous mass in human bones. The simulation consists of separate systems-biological and bio-mechanical simulation steps coupled through additional data processing steps. The workflow exhibits a high potential for parallelism which is only used to a marginal degree. Thus we investigate whether "Big Data" concepts such as MapReduce or NoSQL can be integrated into the workflow. A prototype of the workflow is developed using the Apache Hadoop ecosystem to parallelize the simulation and this prototype compared against a hand-parallelized baseline prototype in terms of performance and scalability. NoSQL concepts for storing inputs and results are utilized with an emphasis on HDFS, the Hadoop File System, as a schemaless distributed file system and MySQL Cluster as an intermediary between a classic database system and a NoSQL system. Lastly, the MapReduce-based prototype is implemented in the WS-BPEL workflow language using the SIMPL[RRS+11] framework and a customWeb Service to access Hadoop functionality. We show the simplicity of the resulting workflow model and argue that the approach greatly decreases implementation effort and at the same time enables simulations to scale to very large data volumes at ease.Workflow Technologien werden aktuell verstärkt eingesetzt in der Hoffnung, hierdurch komplexe wissenschaftliche Simulationsabläufe einfacher umsetzen zu können. Das Thema dieser Arbeit ist ein existierender Workflow, der eine multiskalare Simulation des Massenflusses im porösen menschlichen Knochenmaterial umsetzt. Diese Simulation besteht aus getrennten systembiologischen und biomechanischen Berechnungen, die durch weitere Datenverarbeitungsschritte miteinander verbunden sind. Der Workflow weist ein erhebliches Potenzial zur Parallelisierung auf, welches allerdings nur geringfügig genutzt wird. Wir untersuchen daher, inwieweit sich "Big Data"-Konzepte wie etwa MapReduce oder NoSQL-Datenbanksysteme auf den Workflow übertragen lassen. Ein Prototyp des Workflows wird mithilfe des Apache Hadoop-Ökosystems zur Parallelisierung der Simulation entwickelt und mit einem von Hand parallelisierten zweiten Prototyp in Bezug auf Effizienz und Skalierbarkeit verglichen. NoSQL-Konzepte zum Speichern von Eingaben und Resultaten werden angewendet, hierbei liegt der Fokus auf HDFS, dem Hadoop File System, als schemalosem, verteiltem Dateisystem und MySQL Cluster als einem Hybriden aus klassischem Datenbanksystem und einem NoSQL-Ansatz. Zuletzt wird der MapReduce-basierte Prototyp in die Workflow-Beschreibungssprache WSBPEL übertragen, wobei das SIMPL-Rahmenwerk[RRS+11] und ein spezieller Web Service zur Anbindung an Hadoop zum Einsatz kommen. Wir zeigen die Einfachkeit des resultierenden Workflows und halten fest, dass der gewählte Ansatz nicht nur den Implementierungsaufwand für Workflows verringert, sondern es auch einfacher macht, sich größerem Datenaufkommen anzupassen

    Virtual Cluster Management for Analysis of Geographically Distributed and Immovable Data

    Get PDF
    Thesis (Ph.D.) - Indiana University, Informatics and Computing, 2015Scenarios exist in the era of Big Data where computational analysis needs to utilize widely distributed and remote compute clusters, especially when the data sources are sensitive or extremely large, and thus unable to move. A large dataset in Malaysia could be ecologically sensitive, for instance, and unable to be moved outside the country boundaries. Controlling an analysis experiment in this virtual cluster setting can be difficult on multiple levels: with setup and control, with managing behavior of the virtual cluster, and with interoperability issues across the compute clusters. Further, datasets can be distributed among clusters, or even across data centers, so that it becomes critical to utilize data locality information to optimize the performance of data-intensive jobs. Finally, datasets are increasingly sensitive and tied to certain administrative boundaries, though once the data has been processed, the aggregated or statistical result can be shared across the boundaries. This dissertation addresses management and control of a widely distributed virtual cluster having sensitive or otherwise immovable data sets through a controller. The Virtual Cluster Controller (VCC) gives control back to the researcher. It creates virtual clusters across multiple cloud platforms. In recognition of sensitive data, it can establish a single network overlay over widely distributed clusters. We define a novel class of data, notably immovable data that we call "pinned data", where the data is treated as a first-class citizen instead of being moved to where needed. We draw from our earlier work with a hierarchical data processing model, Hierarchical MapReduce (HMR), to process geographically distributed data, some of which are pinned data. The applications implemented in HMR use extended MapReduce model where computations are expressed as three functions: Map, Reduce, and GlobalReduce. Further, by facilitating information sharing among resources, applications, and data, the overall performance is improved. Experimental results show that the overhead of VCC is minimum. The HMR outperforms traditional MapReduce model while processing a particular class of applications. The evaluations also show that information sharing between resources and application through the VCC shortens the hierarchical data processing time, as well satisfying the constraints on the pinned data

    Web technologies for environmental big data

    Get PDF
    Recent evolutions in computing science and web technology provide the environmental community with continuously expanding resources for data collection and analysis that pose unprecedented challenges to the design of analysis methods, workflows, and interaction with data sets. In the light of the recent UK Research Council funded Environmental Virtual Observatory pilot project, this paper gives an overview of currently available implementations related to web-based technologies for processing large and heterogeneous datasets and discuss their relevance within the context of environmental data processing, simulation and prediction. We found that, the processing of the simple datasets used in the pilot proved to be relatively straightforward using a combination of R, RPy2, PyWPS and PostgreSQL. However, the use of NoSQL databases and more versatile frameworks such as OGC standard based implementations may provide a wider and more flexible set of features that particularly facilitate working with larger volumes and more heterogeneous data sources

    Web service composition: A survey of techniques and tools

    Get PDF
    Web services are a consolidated reality of the modern Web with tremendous, increasing impact on everyday computing tasks. They turned the Web into the largest, most accepted, and most vivid distributed computing platform ever. Yet, the use and integration of Web services into composite services or applications, which is a highly sensible and conceptually non-trivial task, is still not unleashing its full magnitude of power. A consolidated analysis framework that advances the fundamental understanding of Web service composition building blocks in terms of concepts, models, languages, productivity support techniques, and tools is required. This framework is necessary to enable effective exploration, understanding, assessing, comparing, and selecting service composition models, languages, techniques, platforms, and tools. This article establishes such a framework and reviews the state of the art in service composition from an unprecedented, holistic perspective

    CHOReOS Middleware Specification (D3.1)

    Get PDF
    This deliverable specifies the main concepts of the CHOReOS middleware architecture. Starting from the Future Internet (FI) challenges for scalability, heterogeneity, mobility, awareness, and adaptation that have been investigated in prior work done in WP1, we introduce the aforementioned concepts to deal with the requirements derived from the FI challenges. In particular, we propose an extensible and scalable service discovery approach for the organization and discovery of services that relies on multiple service discovery protocols. Moreover, we introduce an extensible and scalable approach, based on the service bus paradigm, for service access that features the integration and adaptation of multiple interaction protocols. Furthermore, we propose solutions that enable the execution of FI service compositions that range from compositions of choreographed services, developed according to the CHOReOS development process, to massive compositions of things. Finally, we detail the Cloud & Grid middleware facilities that support the overall middleware and the choreographies that are built on it, via a unified API that provides access to multiple cloud infrastructures (e.g., Amazon EC2, HP Open Cirrus, private clouds)

    Toward Customizable Multi-tenant SaaS Applications

    Get PDF
    abstract: Nowadays, Computing is so pervasive that it has become indeed the 5th utility (after water, electricity, gas, telephony) as Leonard Kleinrock once envisioned. Evolved from utility computing, cloud computing has emerged as a computing infrastructure that enables rapid delivery of computing resources as a utility in a dynamically scalable, virtualized manner. However, the current industrial cloud computing implementations promote segregation among different cloud providers, which leads to user lockdown because of prohibitive migration cost. On the other hand, Service-Orented Computing (SOC) including service-oriented architecture (SOA) and Web Services (WS) promote standardization and openness with its enabling standards and communication protocols. This thesis proposes a Service-Oriented Cloud Computing Architecture by combining the best attributes of the two paradigms to promote an open, interoperable environment for cloud computing development. Mutil-tenancy SaaS applicantions built on top of SOCCA have more flexibility and are not locked down by a certain platform. Tenants residing on a multi-tenant application appear to be the sole owner of the application and not aware of the existence of others. A multi-tenant SaaS application accommodates each tenant’s unique requirements by allowing tenant-level customization. A complex SaaS application that supports hundreds, even thousands of tenants could have hundreds of customization points with each of them providing multiple options, and this could result in a huge number of ways to customize the application. This dissertation also proposes innovative customization approaches, which studies similar tenants’ customization choices and each individual users behaviors, then provides guided semi-automated customization process for the future tenants. A semi-automated customization process could enable tenants to quickly implement the customization that best suits their business needs.Dissertation/ThesisDoctoral Dissertation Computer Science 201
    • …
    corecore