1,736 research outputs found

    Framework for Automated Partitioning of Scientific Workflows on the Cloud

    Get PDF
    Teaduslikud töövood on saanud populaarseks standardiks, et lihtsal viisil esitada ning lahendada erinevaid teaduslikke ülesandeid. Üldiselt koosnevad need töövood suurtest hulkadest ülesannetest, mis nõuavad tihti palju erinevaid arvuti ressursse, mistõttu jooksutatakse neid kas pilvearvutust, hajustöötlust või superarvuteid kasutades. Varem on tõestatud, et kui rakendada pilves töövoo erinevate osade jagamiseks k-way partitsioneerimis algoritmi, siis üleüldine kommunikatsioon pilves väheneb. Antud magistritöös programmeriti raamistik, et seda protsessi automatiseerida. Loodud raamistik võimaldab automaatselt partitsioneerida igasugusegi töövoo, mis on mõeldud Pegasuse programmiga jooksutamiseks. Raamistik, kasutades CloudML'i, seab automaatselt pilves üles klastri masinaid, konfigureerib ning sätestab kõik vajaliku tarkvara ning jooksutab ja partitsioneerib etteantud töövoo. Lisaks, kuvatakse pärast töövoo lõpetamist ka ajalise kalkulatsiooni visualisatsioon. Seda kasutades saab lõppkasutaja aimu, mitu tuuma peaks töövoo jooksutamiseks kasutama, et lõpetada eksperiment mingis kindlas ajavahemikus.Scientific workflows have become a standardized way for scientists to represent a set of tasks to overcome or solve a certain problem. Usually these workflows consist of numerous amount of jobs that are both CPU heavy and I/O intensive that are executed using some kind of workflow management system either on clouds, grids, supercomputers, etc. Previously, it has been shown that using k-way partitioning algorithm to distribute a workflow's tasks between multiple machines in the cloud reduces the overall data communication and therefore lowers the cost of the bandwidth usage. In this thesis, a framework was built in order to automate this process - partition any workflow submitted by a scientist that is meant to be run on Pegasus workflow management system in the cloud with ease. The framework provisions the instances in the cloud using CloudML, configures and installs all the software needed for the execution, runs and partitions the scientific workflow and finally shows the time estimation of the workflow, so that the user would have an approximate guidelines on, how many resources one should provision in order to finish an experiment under a certain time-frame

    Adaptive Optimizations for Stream-based Workflows

    Get PDF

    Simulation of the performance of complex data-intensive workflows

    Get PDF
    PhD ThesisRecently, cloud computing has been used for analytical and data-intensive processes as it offers many attractive features, including resource pooling, on-demand capability and rapid elasticity. Scientific workflows use these features to tackle the problems of complex data-intensive applications. Data-intensive workflows are composed of many tasks that may involve large input data sets and produce large amounts of data as output, which typically runs in highly dynamic environments. However, the resources should be allocated dynamically depending on the demand changes of the work flow, as over-provisioning increases the cost and under-provisioning causes Service Level Agreement (SLA) violation and poor Quality of Service (QoS). Performance prediction of complex workflows is a necessary step prior to the deployment of the workflow. Performance analysis of complex data-intensive workflows is a challenging task due to the complexity of their structure, diversity of big data, and data dependencies, in addition to the required examination to the performance and challenges associated with running their workflows in the real cloud. In this thesis, a solution is explored to address these challenges, using a Next Generation Sequencing (NGS) workflow pipeline as a case study, which may require hundreds/ thousands of CPU hours to process a terabyte of data. We propose a methodology to model, simulate and predict runtime and the number of resources used by the complex data-intensive workflows. One contribution of our simulation methodology is that it provides an ability to extract the simulation parameters (e.g., MIPs and BW values) that are required for constructing a training set and a fairly accurate prediction of the run time for input for cluster sizes much larger than ones used in training of the prediction model. The proposed methodology permits the derivation of run time prediction based on historical data from the provenance fi les. We present the run time prediction of the complex workflow by considering different cases of its running in the cloud such as execution failure and library deployment time. In case of failure, the framework can apply the prediction only partially considering the successful parts of the pipeline, in the other case the framework can predict with or without considering the time to deploy libraries. To further improve the accuracy of prediction, we propose a simulation model that handles I/O contention

    Workflow scheduling for service oriented cloud computing

    Get PDF
    Service Orientation (SO) and grid computing are two computing paradigms that when put together using Internet technologies promise to provide a scalable yet flexible computing platform for a diverse set of distributed computing applications. This practice gives rise to the notion of a computing cloud that addresses some previous limitations of interoperability, resource sharing and utilization within distributed computing. In such a Service Oriented Computing Cloud (SOCC), applications are formed by composing a set of services together. In addition, hierarchical service layers are also possible where general purpose services at lower layers are composed to deliver more domain specific services at the higher layer. In general an SOCC is a horizontally scalable computing platform that offers its resources as services in a standardized fashion. Workflow based applications are a suitable target for SOCC where workflow tasks are executed via service calls within the cloud. One or more workflows can be deployed over an SOCC and their execution requires scheduling of services to workflow tasks as the task become ready following their interdependencies. In this thesis heuristics based scheduling policies are evaluated for scheduling workflows over a collection of services offered by the SOCC. Various execution scenarios and workflow characteristics are considered to understand the implication of the heuristic based workflow scheduling

    Helmholtz Portfolio Theme Large-Scale Data Management and Analysis (LSDMA)

    Get PDF
    The Helmholtz Association funded the "Large-Scale Data Management and Analysis" portfolio theme from 2012-2016. Four Helmholtz centres, six universities and another research institution in Germany joined to enable data-intensive science by optimising data life cycles in selected scientific communities. In our Data Life cycle Labs, data experts performed joint R&D together with scientific communities. The Data Services Integration Team focused on generic solutions applied by several communities

    Big Geospatial Data processing in the IQmulus Cloud

    Get PDF
    Remote sensing instruments are continuously evolving in terms of spatial, spectral and temporal resolutions and hence provide exponentially increasing amounts of raw data. These volumes increase significantly faster than computing speeds. All these techniques record lots of data, yet in different data models and representations; therefore, resulting datasets require harmonization and integration prior to deriving meaningful information from them. All in all, huge datasets are available but raw data is almost of no value if not processed, semantically enriched and quality checked. The derived information need to be transferred and published to all level of possible users (from decision makers to citizens). Up to now, there are only limited automatic procedures for this; thus, a wealth of information is latent in many datasets. This paper presents the first achievements of the IQmulus EU FP7 research and development project with respect to processing and analysis of big geospatial data in the context of flood and waterlogging detection

    SparkFlow : towards high-performance data analytics for Spark-based genome analysis

    Get PDF
    The recent advances in DNA sequencing technology triggered next-generation sequencing (NGS) research in full scale. Big Data (BD) is becoming the main driver in analyzing these large-scale bioinformatic data. However, this complicated process has become the system bottleneck, requiring an amalgamation of scalable approaches to deliver the needed performance and hide the deployment complexity. Utilizing cutting-edge scientific workflows can robustly address these challenges. This paper presents a Spark-based alignment workflow called SparkFlow for massive NGS analysis over singularity containers. SparkFlow is highly scalable, reproducible, and capable of parallelizing computation by utilizing data-level parallelism and load balancing techniques in HPC and Cloud environments. The proposed workflow capitalizes on benchmarking two state-of-art NGS workflows, i.e., BaseRecalibrator and ApplyBQSR. SparkFlow realizes the ability to accelerate large-scale cancer genomic analysis by scaling vertically (HyperThreading) and horizontally (provisions on-demand). Our result demonstrates a trade-off inevitably between the targeted applications and processor architecture. SparkFlow achieves a decisive improvement in NGS computation performance, throughput, and scalability while maintaining deployment complexity. The paper’s findings aim to pave the way for a wide range of revolutionary enhancements and future trends within the High-performance Data Analytics (HPDA) genome analysis realm.Postprin
    corecore