803 research outputs found

    Trusted resource allocation in volunteer edge-cloud computing for scientific applications

    Get PDF
    Data-intensive science applications in fields such as e.g., bioinformatics, health sciences, and material discovery are becoming increasingly dynamic and demanding with resource requirements. Researchers using these applications which are based on advanced scientific workflows frequently require a diverse set of resources that are often not available within private servers or a single Cloud Service Provider (CSP). For example, a user working with Precision Medicine applications would prefer only those CSPs who follow guidelines from HIPAA (Health Insurance Portability and Accountability Act) for implementing their data services and might want services from other CSPs for economic viability. With the generation of more and more data these workflows often require deployment and dynamic scaling of multi-cloud resources in an efficient and high-performance manner (e.g., quick setup, reduced computation time, and increased application throughput). At the same time, users seek to minimize the costs of configuring the related multi-cloud resources. While performance and cost are among the key factors to decide upon CSP resource selection, the scientific workflows often process proprietary/confidential data that introduces additional constraints of security postures. Thus, users have to make an informed decision on the selection of resources that are most suited for their applications while trading off between the key factors of resource selection which are performance, agility, cost, and security (PACS). Furthermore, even with the most efficient resource allocation across multi-cloud, the cost to solution might not be economical for all users which have led to the development of new paradigms of computing such as volunteer computing where users utilize volunteered cyber resources to meet their computing requirements. For economical and readily available resources, it is essential that such volunteered resources can integrate well with cloud resources for providing the most efficient computing infrastructure for users. In this dissertation, individual stages such as user requirement collection, user's resource preferences, resource brokering and task scheduling, in lifecycle of resource brokering for users are tackled. For collection of user requirements, a novel approach through an iterative design interface is proposed. In addition, fuzzy interference-based approach is proposed to capture users' biases and expertise for guiding their resource selection for their applications. The results showed improvement in performance i.e. time to execute in 98 percent of the studied applications. The data collected on user's requirements and preferences is later used by optimizer engine and machine learning algorithms for resource brokering. For resource brokering, a new integer linear programming based solution (OnTimeURB) is proposed which creates multi-cloud template solutions for resource allocation while also optimizing performance, agility, cost, and security. The solution was further improved by the addition of a machine learning model based on naive bayes classifier which captures the true QoS of cloud resources for guiding template solution creation. The proposed solution was able to improve the time to execute for as much as 96 percent of the largest applications. As discussed above, to fulfill necessity of economical computing resources, a new paradigm of computing viz-a-viz Volunteer Edge Computing (VEC) is proposed which reduces cost and improves performance and security by creating edge clusters comprising of volunteered computing resources close to users. The initial results have shown improved time of execution for application workflows against state-of-the-art solutions while utilizing only the most secure VEC resources. Consequently, we have utilized reinforcement learning based solutions to characterize volunteered resources for their availability and flexibility towards implementation of security policies. The characterization of volunteered resources facilitates efficient allocation of resources and scheduling of workflows tasks which improves performance and throughput of workflow executions. VEC architecture is further validated with state-of-the-art bioinformatics workflows and manufacturing workflows.Includes bibliographical references

    Performance Evaluation of Parallel Haemodynamic Computations on Heterogeneous Clouds

    Get PDF
    The article presents performance evaluation of parallel haemodynamic flow computations on heterogeneous resources of the OpenStack cloud infrastructure. The main focus is on the parallel performance analysis, energy consumption and virtualization overhead of the developed software service based on ANSYS Fluent platform which runs on Docker containers of the private university cloud. The haemodynamic aortic valve flow described by incompressible Navier-Stokes equations is considered as a target application of the hosted cloud infrastructure. The parallel performance of the developed software service is assessed measuring the parallel speedup of computations carried out on virtualized heterogeneous resources. The performance measured on Docker containers is compared with that obtained by using the native hardware. The alternative solution algorithms are explored in terms of the parallel performance and power consumption. The investigation of a trade-off between the computing speed and the consumed energy is performed by using Pareto front analysis and a linear scalarization method

    Addendum to Informatics for Health 2017: Advancing both science and practice

    Get PDF
    This article presents presentation and poster abstracts that were mistakenly omitted from the original publication

    Scientific Computing Meets Big Data Technology: An Astronomy Use Case

    Full text link
    Scientific analyses commonly compose multiple single-process programs into a dataflow. An end-to-end dataflow of single-process programs is known as a many-task application. Typically, tools from the HPC software stack are used to parallelize these analyses. In this work, we investigate an alternate approach that uses Apache Spark -- a modern big data platform -- to parallelize many-task applications. We present Kira, a flexible and distributed astronomy image processing toolkit using Apache Spark. We then use the Kira toolkit to implement a Source Extractor application for astronomy images, called Kira SE. With Kira SE as the use case, we study the programming flexibility, dataflow richness, scheduling capacity and performance of Apache Spark running on the EC2 cloud. By exploiting data locality, Kira SE achieves a 2.5x speedup over an equivalent C program when analyzing a 1TB dataset using 512 cores on the Amazon EC2 cloud. Furthermore, we show that by leveraging software originally designed for big data infrastructure, Kira SE achieves competitive performance to the C implementation running on the NERSC Edison supercomputer. Our experience with Kira indicates that emerging Big Data platforms such as Apache Spark are a performant alternative for many-task scientific applications

    A Study of Scalability and Cost-effectiveness of Large-scale Scientific Applications over Heterogeneous Computing Environment

    Get PDF
    Recent advances in large-scale experimental facilities ushered in an era of data-driven science. These large-scale data increase the opportunity to answer many fundamental questions in basic science. However, these data pose new challenges to the scientific community in terms of their optimal processing and transfer. Consequently, scientists are in dire need of robust high performance computing (HPC) solutions that can scale with terabytes of data. In this thesis, I address the challenges in three major aspects of scientific big data processing as follows: 1) Developing scalable software and algorithms for data- and compute-intensive scientific applications. 2) Proposing new cluster architectures that these software tools need for good performance. 3) Transferring the big scientific dataset among clusters situated at geographically disparate locations. In the first part, I develop scalable algorithms to process huge amounts of scientific big data using the power of recent analytic tools such as, Hadoop, Giraph, NoSQL, etc. At a broader level, these algorithms take the advantage of locality-based computing that can scale with increasing amount of data. The thesis mainly addresses the challenges involved in large-scale genome analysis applications such as, genomic error correction and genome assembly which made their way to the forefront of big data challenges recently. In the second part of the thesis, I perform a systematic benchmark study using the above-mentioned algorithms on different distributed cyberinfrastructures to pinpoint the limitations in a traditional HPC cluster to process big data. Then I propose the solution to those limitations by balancing the I/O bandwidth of the solid state drive (SSD) with the computational speed of high-performance CPUs. A theoretical model has been also proposed to help the HPC system designers who are striving for system balance. In the third part of the thesis, I develop a high throughput architecture for transferring these big scientific datasets among geographically disparate clusters. The architecture leverages the power of Ethereum\u27s Blockchain technology and Swarm\u27s peer-to-peer (P2P) storage technology to transfer the data in secure, tamper-proof fashion. Instead of optimizing the computation in a single cluster, in this part, my major motivation is to foster translational research and data interoperability in collaboration with multiple institutions

    Automated, Systematic and Parallel Approaches to Software Testing in Bioinformatics

    Get PDF
    Software quality assurance becomes especially critical if bioinformatics tools are to be used in a translational medical setting, such as analysis and interpretation of biological data. We must ensure that only validated algorithms are used, and that they are implemented correctly in the analysis pipeline – and not disrupted by hardware or software failure. In this thesis, I review common quality assurance practice and guidelines for bioinformatics software testing. Furthermore, I present a novel cloud-based framework to enable automated testing of genetic sequence alignment programs. This framework performs testing based on gold standard simulation data sets, and metamorphic testing. I demonstrate the effectiveness of this cloudbased framework using two widely used sequence alignment programs, BWA and Bowtie, and some fault-seeded ‘mutant’ versions of BWA and Bowtie. This preliminary study demonstrates that this type of cloud-based software testing framework is an effective and promising way to implement quality assurance in bioinformatics software that is used in genomic medicine
    corecore