82,206 research outputs found

    Teaching data science and cloud computing in low and middle income countries

    Get PDF
    Large, publicly available data sets present a challenge and an opportunity for researchers based in Low and Middle Income Countries (LMIC). The challenge for these researchers is how they can make use of such data sets given their poor connectivity and infrastructure. The opportunity is the ability to perform leading edge research using these data sets and hence avoid having to invest substantial resources in generating the data sets. The offshoot of this will be to generate solutions to the substantial local problems encountered in these countries and create an educated workforce in data science. Cloud computing in particular may well close the infrastructural gap here. In this paper we discuss our experiences of teaching a variety of summer schools on data intensive analysis in bioinformatics in China, Namibia and Malaysia. On the basis of these experiences we propose that a larger series of summer schools in data science and cloud computing in LMIC would create a cadre of data scientists to start this process. We finally discuss the possibility of the provision of cloud computing resources where the usage costs are controlled so that it is affordable for LMIC researchers

    Performance optimization of big data computing workflows for batch and stream data processing in multi-clouds

    Get PDF
    Workflow techniques have been widely used as a major computing solution in many science domains. With the rapid deployment of cloud infrastructures around the globe and the economic benefits of cloud-based computing and storage services, an increasing number of scientific workflows have migrated or are in active transition to clouds. As the scale of scientific applications continues to grow, it is now common to deploy various data- and network-intensive computing workflows such as serial computing workflows, MapReduce/Spark-based workflows, and Storm-based stream data processing workflows in multi-cloud environments, where inter-cloud data transfer oftentimes plays a significant role in both workflow performance and financial cost. Rigorous mathematical models are constructed to analyze the intra- and inter-cloud execution process of scientific workflows and a class of budget-constrained workflow mapping problems are formulated to optimize the network performance of big data workflows in multi-cloud environments. Research shows that these problems are all NP-complete and a heuristic solution is designed for each that takes into consideration module execution, data transfer, and I/O operations. The performance superiority of the proposed solutions over existing methods are illustrated through extensive simulations and further verified by real-life workflow experiments deployed in public clouds

    Genomic cloud computing:Legal and ethical points to consider

    Get PDF
    The biggest challenge in twenty-first century data-intensive genomic science, is developing vast computer infrastructure and advanced software tools to perform comprehensive analyses of genomic data sets for biomedical research and clinical practice. Researchers are increasingly turning to cloud computing both as a solution to integrate data from genomics, systems biology and biomedical data mining and as an approach to analyze data to solve biomedical problems. Although cloud computing provides several benefits such as lower costs and greater efficiency, it also raises legal and ethical issues. In this article, we discuss three key 'points to consider' (data control; data security, confidentiality and transfer; and accountability) based on a preliminary review of several publicly available cloud service providers' Terms of Service. These 'points to consider' should be borne in mind by genomic research organizations when negotiating legal arrangements to store genomic data on a large commercial cloud service provider's servers. Diligent genomic cloud computing means leveraging security standards and evaluation processes as a means to protect data and entails many of the same good practices that researchers should always consider in securing their local infrastructure.</p

    A survey of the European Open Science Cloud services for expanding the capacity and capabilities of multidisciplinary scientific applications

    Get PDF
    Open Science is a paradigm in which scientific data, procedures, tools and results are shared transparently and reused by society as a whole. The initiative known as the European Open Science Cloud (EOSC) is an effort in Europe to provide an open, trusted, virtual and federated computing environment to execute scientific applications, and to store, share and re-use research data across borders and scientific disciplines. Additionally, scientific services are becoming increasingly data-intensive, not only in terms of computationally intensive tasks but also in terms of storage resources. Computing paradigms such as High Performance Computing (HPC) and Cloud Computing are applied to e-science applications to meet these demands. However, adapting applications and services to these paradigms is not a trivial task, commonly requiring a deep knowledge of the underlying technologies, which often constitutes a barrier for its uptake by scientists in general. In this context, EOSC-SYNERGY, a collaborative project involving more than 20 institutions from eight European countries pooling their knowledge and experience to enhance EOSC\u27s capabilities and capacities, aims to bring EOSC closer to the scientific communities. This article provides a summary analysis of the adaptations made in the ten thematic services of EOSC-SYNERGY to embrace this paradigm. These services are grouped into four categories: Earth Observation, Environment, Biomedicine, and Astrophysics. The analysis will lead to the identification of commonalities, best practices and common requirements, regardless of the thematic area of the service. Experience gained from the thematic services could be transferred to new services for the adoption of the EOSC ecosystem framework

    Choreo: network-aware task placement for cloud applications

    Get PDF
    Cloud computing infrastructures are increasingly being used by network-intensive applications that transfer significant amounts of data between the nodes on which they run. This paper shows that tenants can do a better job placing applications by understanding the underlying cloud network as well as the demands of the applications. To do so, tenants must be able to quickly and accurately measure the cloud network and profile their applications, and then use a network-aware placement method to place applications. This paper describes Choreo, a system that solves these problems. Our experiments measure Amazon's EC2 and Rackspace networks and use three weeks of network data from applications running on the HP Cloud network. We find that Choreo reduces application completion time by an average of 8%-14% (max improvement: 61%) when applications are placed all at once, and 22%-43% (max improvement: 79%) when they arrive in real-time, compared to alternative placement schemes.National Science Foundation (U.S.) (Grant 0645960)National Science Foundation (U.S.) (Grant 1065219)National Science Foundation (U.S.) (Grant 1040072

    A survey of the European Open Science Cloud services for expanding the capacity and capabilities of multidisciplinary scientific applications

    Get PDF
    Open Science is a paradigm in which scientific data, procedures, tools and results are shared transparently and reused by society. The European Open Science Cloud (EOSC) initiative is an effort in Europe to provide an open, trusted, virtual and federated computing environment to execute scientific applications and store, share and reuse research data across borders and scientific disciplines. Additionally, scientific services are becoming increasingly data-intensive, not only in terms of computationally intensive tasks but also in terms of storage resources. To meet those resource demands, computing paradigms such as High-Performance Computing (HPC) and Cloud Computing are applied to e-science applications. However, adapting applications and services to these paradigms is a challenging task, commonly requiring a deep knowledge of the underlying technologies, which often constitutes a general barrier to its uptake by scientists. In this context, EOSC-Synergy, a collaborative project involving more than 20 institutions from eight European countries pooling their knowledge and experience to enhance EOSC’s capabilities and capacities, aims to bring EOSC closer to the scientific communities. This article provides a summary analysis of the adaptations made in the ten thematic services of EOSC-Synergy to embrace this paradigm. These services are grouped into four categories: Earth Observation, Environment, Biomedicine, and Astrophysics. The analysis will lead to the identification of commonalities, best practices and common requirements, regardless of the thematic area of the service. Experience gained from the thematic services can be transferred to new services for the adoption of the EOSC ecosystem framework. The article made several recommendations for the integration of thematic services in the EOSC ecosystem regarding Authentication and Authorization (federated regional or thematic solutions based on EduGAIN mainly), FAIR data and metadata preservation solutions (both at cataloguing and data preservation—such as EUDAT’s B2SHARE), cloud platform-agnostic resource management services (such as Infrastructure Manager) and workload management solutions.This work was supported by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 857647, EOSC-Synergy, European Open Science Cloud - Expanding Capacities by building Capabilities. Moreover, this work is partially funded by grant No 2015/24461-2, São Paulo Research Foundation (FAPESP). Francisco Brasileiro is a CNPq/Brazil researcher (grant 308027/2020-5).Peer Reviewed"Article signat per 20 autors/es: Amanda Calatrava, Hernán Asorey, Jan Astalos, Alberto Azevedo, Francesco Benincasa, Ignacio Blanquer, Martin Bobak, Francisco Brasileiro, Laia Codó, Laura del Cano, Borja Esteban, Meritxell Ferret, Josef Handl, Tobias Kerzenmacher, Valentin Kozlov, Aleš Křenek, Ricardo Martins, Manuel Pavesio, Antonio Juan Rubio-Montero, Juan Sánchez-Ferrero "Postprint (published version

    Easy Edges: Automating Connectivity and Scheduling at the Edge

    Get PDF
    Today, cloud computing is a well-established paradigm that enjoys a moderate level of popularity among Science Gateways and Virtual Research Environments. Historically, Science Gateways have empowered researchers with traditional onpremise High-Performance Computing (HPC), but recent years have seen many turning to the cloud to fill that space. One often cited cloud complaint is the distance that data must travel. Sensors and input devices generate data locally in the lab, but even for a basic calculation, all those bits must make the costly journey to a data centre hundreds of miles away. The proliferation of the Internet of Things, though, is quickly changing this. As microchips decrease in size, the compute power of those sensors and input devices is increasing, along with every other compute or networking device in the lab. These devices, at the edge of the network, can now be exploited for less intensive computations, such as pre-processing or analytics. This creates an ideal opportunity for Science Gateways. As well as connecting research communities to the cloud, they can bring edges into the mix and get the advantages of both. In that, lies a considerable challenge though. Edge devices come in all shapes and sizes – or, from a computing point of view, all operating systems and system architectures. Many such devices are connected by Wi-Fi, so connection issues are common and unpredictable. Once deployed, communications between edge and cloud must be secured. DevOps engineers are solving these issues with OCIcompliant containers (Docker), configuration management (Ansible®), container orchestrators (Kubernetes), and edgeenabling add-ons (k0s, k3s, KubeEdge or microk8s). But configuring a container orchestration cluster in the cloud takes expert knowledge and effort, and navigating the landscape of edge add-ons takes even more. These are lengthy, manual processes, not suited to a Science Gateway administrator. This abstract proposes a generic solution to the problem of edge connectivity by abstracting away this complexity and automating the process. Users bring their containers and edge devices, and Kubernetes, Ansible and KubeEdge or k3s do the rest, with some help from MiCADO, a dynamic cloud orchestrator. Figure 1 shows this process. An Ansible Playbook creates a Kubernetes cluster in the cloud, deploying an appropriate Kubernetes edge add-on in the process. Then, a user provides a description of their application (container images and configuration) and edge nodes (connection details). MiCADO supports these descriptions in TOSCA, but higher-level tools can be used to generate TOSCA from descriptive metadata. Actioning this description triggers Kubernetes and a second Ansible Playbook, which create the connection to the edge and schedule containers appropriately across edge and cloud. In MiCADO, the TOSCASubmitter is responsible for these triggers. A KubernetesAdaptor transpiles the relevant container descriptions to Kubernetes Manifests and executes them. An AnsibleAdaptor, newly created for connecting edges on the fly,automates configuration and execution of the Playbook. This approach is being used for a Science Gateway in the PITHIA-NRF Project, and for an Industry Gateway in the DIGITbrain Project. It supports a simpler route to edge computing that can be exploited by Science Gateways for efficient computing along the cloud-to-edge continuum

    Towards an Environment for doing Data Science that runs in Browsers

    Get PDF
    International audience—This article proposes a path for doing Data Science using browsers as computing and data nodes. This novel idea is motivated by the cross-fertilized fields of desktop grid computing, data management in grids and clouds, Web technologies such as Nosql tools, models of interactions and programming models in grids, cloud and Web technologies. We propose a methodology for the modeling, analyzing, implemention and simulation of a prototype able to run a MapReduce job in browsers. This work allows to better understand how to envision the big picture of Data Science in the context of the Javascript language for programming the middleware, the interactions between components and browsers as the operating system. We explain what types of applications may be impacted by this novel approach and, from a general point of view how a formal modeling of the interactions serves as a general guidelines for the implementation. Formal modeling in our methodology is a necessary condition but it is not sufficient. We also make round-trips between the modeling and the Javascript or used tools to enrich the interaction model that is the key point, or to put more details into the implementation. It is the first time to the best of our knowledge that Data Science is operating in the context of browsers that exchange codes and data for solving computational and data intensive programs. Computational and data intensive terms should be understand according to the context of applications that we think to be suitable for our system

    High Resolution Nature Runs and the Big Data Challenge

    Get PDF
    NASA's Global Modeling and Assimilation Office at Goddard Space Flight Center is undertaking a series of very computationally intensive Nature Runs and a downscaled reanalysis. The nature runs use the GEOS-5 as an Atmospheric General Circulation Model (AGCM) while the reanalysis uses the GEOS-5 in Data Assimilation mode. This paper will present computational challenges from three runs, two of which are AGCM and one is downscaled reanalysis using the full DAS. The nature runs will be completed at two surface grid resolutions, 7 and 3 kilometers and 72 vertical levels. The 7 km run spanned 2 years (2005-2006) and produced 4 PB of data while the 3 km run will span one year and generate 4 BP of data. The downscaled reanalysis (MERRA-II Modern-Era Reanalysis for Research and Applications) will cover 15 years and generate 1 PB of data. Our efforts to address the big data challenges of climate science, we are moving toward a notion of Climate Analytics-as-a-Service (CAaaS), a specialization of the concept of business process-as-a-service that is an evolving extension of IaaS, PaaS, and SaaS enabled by cloud computing. In this presentation, we will describe two projects that demonstrate this shift. MERRA Analytic Services (MERRA/AS) is an example of cloud-enabled CAaaS. MERRA/AS enables MapReduce analytics over MERRA reanalysis data collection by bringing together the high-performance computing, scalable data management, and a domain-specific climate data services API. NASA's High-Performance Science Cloud (HPSC) is an example of the type of compute-storage fabric required to support CAaaS. The HPSC comprises a high speed Infinib and network, high performance file systems and object storage, and a virtual system environments specific for data intensive, science applications. These technologies are providing a new tier in the data and analytic services stack that helps connect earthbound, enterprise-level data and computational resources to new customers and new mobility-driven applications and modes of work. In our experience, CAaaS lowers the barriers and risk to organizational change, fosters innovation and experimentation, and provides the agility required to meet our customers' increasing and changing need
    • …
    corecore