178 research outputs found

    Scientific Computing Meets Big Data Technology: An Astronomy Use Case

    Full text link
    Scientific analyses commonly compose multiple single-process programs into a dataflow. An end-to-end dataflow of single-process programs is known as a many-task application. Typically, tools from the HPC software stack are used to parallelize these analyses. In this work, we investigate an alternate approach that uses Apache Spark -- a modern big data platform -- to parallelize many-task applications. We present Kira, a flexible and distributed astronomy image processing toolkit using Apache Spark. We then use the Kira toolkit to implement a Source Extractor application for astronomy images, called Kira SE. With Kira SE as the use case, we study the programming flexibility, dataflow richness, scheduling capacity and performance of Apache Spark running on the EC2 cloud. By exploiting data locality, Kira SE achieves a 2.5x speedup over an equivalent C program when analyzing a 1TB dataset using 512 cores on the Amazon EC2 cloud. Furthermore, we show that by leveraging software originally designed for big data infrastructure, Kira SE achieves competitive performance to the C implementation running on the NERSC Edison supercomputer. Our experience with Kira indicates that emerging Big Data platforms such as Apache Spark are a performant alternative for many-task scientific applications

    Applying big data paradigms to a large scale scientific workflow: lessons learned and future directions

    Get PDF
    The increasing amounts of data related to the execution of scientific workflows has raised awareness of their shift towards parallel data-intensive problems. In this paper, we deliver our experience combining the traditional high-performance computing and grid-based approaches with Big Data analytics paradigms, in the context of scientific ensemble workflows. Our goal was to assess and discuss the suitability of such data-oriented mechanisms for production-ready workflows, especially in terms of scalability. We focused on two key elements in the Big Data ecosystem: the data-centric programming model, and the underlying infrastructure that integrates storage and computation in each node. We experimented with a representative MPI-based iterative workflow from the hydrology domain, EnKF-HGS, which we re-implemented using the Spark data analysis framework. We conducted experiments on a local cluster, a private cloud running OpenNebula, and the Amazon Elastic Compute Cloud (AmazonEC2). The results we obtained were analysed to synthesize the lessons we learned from this experience, while discussing promising directions for further research.This work was supported by the Spanish Ministry of Economics and Competitiveness grant TIN-2013-41350-P, the IC1305 COST Action “Network for Sustainable Ultrascale Computing Platforms” (NESUS), and the FPU Training Program for Academic and Teaching Staff Grant FPU15/00422 by the Spanish Ministry of Education

    An Integrated Big and Fast Data Analytics Platform for Smart Urban Transportation Management

    Full text link
    (c) 20xx IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.[EN] Smart urban transportation management can be considered as a multifaceted big data challenge. It strongly relies on the information collected into multiple, widespread, and heterogeneous data sources as well as on the ability to extract actionable insights from them. Besides data, full stack (from platform to services and applications) Information and Communications Technology (ICT) solutions need to be specifically adopted to address smart cities challenges. Smart urban transportation management is one of the key use cases addressed in the context of the EUBra-BIGSEA (Europe-Brazil Collaboration of Big Data Scientific Research through Cloud-Centric Applications) project. This paper specifically focuses on the City Administration Dashboard, a public transport analytics application that has been developed on top of the EUBra-BIGSEA platform and used by the Municipality stakeholders of Curitiba, Brazil, to tackle urban traffic data analysis and planning challenges. The solution proposed in this paper joins together a scalable big and fast data analytics platform, a flexible and dynamic cloud infrastructure, data quality and entity matching algorithms as well as security and privacy techniques. By exploiting an interoperable programming framework based on Python Application Programming Interface (API), it allows an easy, rapid and transparent development of smart cities applications.This work was supported by the European Commission through the Cooperation Programme under EUBra-BIGSEA Horizon 2020 Grant [Este projeto e resultante da 3a Chamada Coordenada BR-UE em Tecnologias da Informacao e Comunicacao (TIC), anunciada pelo Ministerio de Ciencia, Tecnologia e Inovacao (MCTI)] under Grant 690116.Fiore, S.; Elia, D.; Pires, CE.; Mestre, DG.; Cappiello, C.; Vitali, M.; Andrade, N.... (2019). An Integrated Big and Fast Data Analytics Platform for Smart Urban Transportation Management. IEEE Access. 7:117652-117677. https://doi.org/10.1109/ACCESS.2019.2936941S117652117677

    Web technologies for environmental big data

    Get PDF
    Recent evolutions in computing science and web technology provide the environmental community with continuously expanding resources for data collection and analysis that pose unprecedented challenges to the design of analysis methods, workflows, and interaction with data sets. In the light of the recent UK Research Council funded Environmental Virtual Observatory pilot project, this paper gives an overview of currently available implementations related to web-based technologies for processing large and heterogeneous datasets and discuss their relevance within the context of environmental data processing, simulation and prediction. We found that, the processing of the simple datasets used in the pilot proved to be relatively straightforward using a combination of R, RPy2, PyWPS and PostgreSQL. However, the use of NoSQL databases and more versatile frameworks such as OGC standard based implementations may provide a wider and more flexible set of features that particularly facilitate working with larger volumes and more heterogeneous data sources

    Data Placement And Task Mapping Optimization For Big Data Workflows In The Cloud

    Get PDF
    Data-centric workflows naturally process and analyze a huge volume of datasets. In this new era of Big Data there is a growing need to enable data-centric workflows to perform computations at a scale far exceeding a single workstation\u27s capabilities. Therefore, this type of applications can benefit from distributed high performance computing (HPC) infrastructures like cluster, grid or cloud computing. Although data-centric workflows have been applied extensively to structure complex scientific data analysis processes, they fail to address the big data challenges as well as leverage the capability of dynamic resource provisioning in the Cloud. The concept of “big data workflows” is proposed by our research group as the next generation of data-centric workflow technologies to address the limitations of exist-ing workflows technologies in addressing big data challenges. Executing big data workflows in the Cloud is a challenging problem as work-flow tasks and data are required to be partitioned, distributed and assigned to the cloud execution sites (multiple virtual machines). In running such big data work-flows in the cloud distributed across several physical locations, the workflow execution time and the cloud resource utilization efficiency highly depends on the initial placement and distribution of the workflow tasks and datasets across the multiple virtual machines in the Cloud. Several workflow management systems have been developed for scientists to facilitate the use of workflows; however, data and work-flow task placement issue has not been sufficiently addressed yet. In this dissertation, I propose BDAP strategy (Big Data Placement strategy) for data placement and TPS (Task Placement Strategy) for task placement, which improve workflow performance by minimizing data movement across multiple virtual machines in the Cloud during the workflow execution. In addition, I propose CATS (Cultural Algorithm Task Scheduling) for workflow scheduling, which improve workflow performance by minimizing workflow execution cost. In this dissertation, I 1) formalize data and task placement problems in workflows, 2) propose a data placement algorithm that considers both initial input dataset and intermediate datasets obtained during workflow run, 3) propose a task placement algorithm that considers placement of workflow tasks before workflow run, 4) propose a workflow scheduling strategy to minimize the workflow execution cost once the deadline is provided by user and 5)perform extensive experiments in the distributed environment to validate that our proposed strategies provide an effective data and task placement solution to distribute and place big datasets and tasks into the appropriate virtual machines in the Cloud within reasonable time

    Orchestrating Complex Application Architectures in Heterogeneous Clouds

    Full text link
    [EN] Private cloud infrastructures are now widely deployed and adopted across technology industries and research institutions. Although cloud computing has emerged as a reality, it is now known that a single cloud provider cannot fully satisfy complex user requirements. This has resulted in a growing interest in developing hybrid cloud solutions that bind together distinct and heterogeneous cloud infrastructures. In this paper we describe the orchestration approach for heterogeneous clouds that has been implemented and used within the INDIGO-DataCloud project. This orchestration model uses existing open-source software like OpenStack and leverages the OASIS Topology and Specification for Cloud Applications (TOSCA) open standard as the modeling language. Our approach uses virtual machines and Docker containers in an homogeneous and transparent way providing consistent application deployment for the users. This approach is illustrated by means of two different use cases in different scientific communities, implemented using the INDIGO-DataCloud solutions.The authors want to acknowledge the support of the INDIGO-Datacloud (grant number 653549) project, funded by the European Commission's Horizon 2020 Framework Program.Caballer Fernández, M.; Zala, S.; López, Á.; Moltó, G.; Orviz, P.; Velten, M. (2018). Orchestrating Complex Application Architectures in Heterogeneous Clouds. Journal of Grid Computing. 16(1):3-18. https://doi.org/10.1007/s10723-017-9418-yS318161Aguilar Gómez, F., de Lucas, J.M., García, D., Monteoliva, A.: Hydrodynamics and water quality forecasting over a cloud computing environment: indigo-datacloud. In: EGU General Assembly Conference Abstracts, vol. 19, p 9684 (2017)de Alfonso, C., Caballer, M., Alvarruiz, F., Hernández, V.: An energy management system for cluster infrastructures. Comput. Electr. Eng. 39(8), 2579–2590 (2013). http://www.sciencedirect.com/science/article/pii/S0045790613001365Amazon Web Services (AWS): Amazon Web Services (AWS). https://aws.amazon.com/ (2017)Amazon Web Services (AWS): CloudFormation. https://aws.amazon.com/cloudformation/ (2017)Apache Software Foundation: Apache Mesos. http://mesos.apache.org/ (2017)ARIA, ARIA. http://ariatosca.incubator.apache.org/ (2017)Bumpus, W.: NIST Cloud Computing Standards Roadmap. Tech. rep., National Institute of Standards and Technology (NIST). https://doi.org/10.6028/NIST.SP.500-291r2 (2013)Caballer, M., Blanquer, I., Moltó, G., de Alfonso, C.: Dynamic management of virtual infrastructures. J Grid Comput. 13(1), 53–70 (2015). https://doi.org/10.1007/s10723-014-9296-5Campos Plasencia, I., Fernández-del Castillo, E., Heinemeyer, S., López García, Á., Pahlen, F., Borges, G.: Phenomenology tools on cloud infrastructures using OpenStack. Eur. Phys. J. C 73(4), 2375 (2013). https://doi.org/10.1140/epjc/s10052-013-2375-0Celar: Celar. http://www.cloudwatchhub.eu/celar (2017)Chen, Y., de Lucas, J.M., Aguilar, F., Fiore, S., Rossi, M., Ferrari, T.: Indigo: building a datacloud framework to support open science. In: EGU General Assembly Conference Abstracts, vol. 18, p 16610 (2016)Chronos: Chronos. https://mesos.github.io/chronos/ (2017)Cloudify: Cloudify. http://getcloudify.org (2017)Davidović, D., Cetinić, E., Skala, K.: European research area and digital humanitiesDistefano, S., Serazzi, G.: Performance driven WS orchestration and deployment in service oriented infrastructure. J Grid Comput. 12(2), 347–369 (2014). https://doi.org/10.1007/s10723-014-9293-8EGI FedCloud: EGI FedCloud. https://www.egi.eu/federation/egi-federated-cloud/ (2017)Eucalyptus: Eucalyptus. https://www.eucalyptus.com/ (2017)Fiore, S., D’Anca, A., Palazzo, C., Foster, I., Williams, D.N., Aloisio, G.: Ophidia: toward big data analytics for eScience. Procedia Comput. Sci. 18, 2376–2385 (2013). https://doi.org/10.1016/j.procs.2013.05.409Fiore, S., Palazzo, C., D’Anca, A., Elia, D., Londero, E., Knapic, C., Monna, S., Marcucci, N.M., Aguilar, F., Płóciennik, M., et al.: Big data analytics on large-scale scientific datasets in the indigo-datacloud project. In: Proceedings of the Computing Frontiers Conference, pp 343–348. ACM (2017)Fiore, S., Płóciennik, M., Doutriaux, C., Palazzo, C., Boutte, J., żok, T., Elia, D., Owsiak, M., D’Anca, A., Shaheen, Z., et al.: Distributed and cloud-based multi-model analytics experiments on large volumes of climate change data in the earth system grid federation eco-system. In: 2016 IEEE International Conference on Big Data (Big Data), pp 2911–2918. IEEE (2016)Galante, G., Erpen de Bona, L.C., Mury, A.R., Schulze, B., da Rosa Righi, R.: An analysis of public clouds elasticity in the execution of scientific applications: a survey. J Grid Comput.,1–24. https://doi.org/10.1007/s10723-016-9361-3 (2016)Google Cloud Platform (GCP): Google Cloud Platform (GCP). https://cloud.google.com/ (2017)Hochstein, L. (ed.): Ansible: Up and Running, Automating Configuration Management and Deployment the Easy Way. O’Reilly Media (2014)Idabc: European Interoperability Framework for pan-European eGovernment Services. European Commission version 1, 1–25. https://doi.org/10.1109/HICSS.2007.68 (2004)IM: IM. http://www.grycap.upv.es/im (2017)INDIGO-DataCloud: D1.8 - General Architecture. Tech. rep., INDIGO-DataCloud Consortium (2015)INDIGO-DataCloud: Ansible Galaxy repository for INDIGO-DataCloud. https://galaxy.ansible.com/indigo-dc/ (2017)INDIGO-DataCloud: Disvis/Powerfit Ansible Role in Ansible Galaxy. https://galaxy.ansible.com/indigo-dc/disvis-powerfit/ (2017)INDIGO-DataCloud: INDIGO-DataCloud. https://www.indigo-datacloud.eu/ (2017)INDIGO-DataCloud: INDIGO-DataCloud DockerHub application repository. https://hub.docker.com/u/indigodatacloudapps/ (2017)INDIGO-DataCloud: INDIGO-DataCloud PaaS Orchestrator. https://github.com/indigo-dc/orchestrator (2017)INDIGO-DataCloud: INDIGO-DataCloud RepoSync. https://github.com/indigo-dc/java-reposync (2017)INDIGO-DataCloud: INDIGO-DataCloud TOSCA templates. https://github.com/indigo-dc/tosca-templates (2017)INDIGO-DataCloud: TOSCA Across Clouds. https://github.com/indigo-dc/tosca-types/blob/master/examples/web_mysql_tosca_across_clouds.yaml (2017)INDIGO-DataCloud: TOSCA template for deploying an Elastic Mesos Cluster. http://github.com/indigo-dc/tosca-types/blob/master/examples/mesos_elastic_cluster.yaml (2017)INDIGO-DataCloud: TOSCA template for Powerfit application. https://github.com/indigo-dc/tosca-types/blob/master/examples/powerfit.yaml (2017)Kacsuk, P., Kecskemeti, G., Kertesz, A., Nemeth, Z., Kovȧcs, J., Farkas, Z.: Infrastructure aware scientific workflows and infrastructure aware workflow managers in science gateways. J Grid Comput., 641–654. https://doi.org/10.1007/s10723-016-9380-0 (2016)Korambath, P., Wang, J., Kumar, A., Hochstein, L., Schott, B., Graybill, R., Baldea, M., Davis, J.: Deploying kepler workflows as services on a cloud infrastructure for smart manufacturing. Procedia Comput. Sci. 29, 2254–2259 (2014)Koski, K., Hormia-Poutanen, K., Chatzopoulos, M., Legrė, Y., Day, B.: Position Paper: European Open Science Cloud for Research. Tech. Rep. october, EUDAT, LIBER, OpenAIRE, EGI, GĖANT Bari (2015)Krieger, M.T., Torreno, O., Trelles, O., Kranzlmüller, D.: Building an open source cloud environment with auto-scaling resources for executing bioinformatics and biomedical workflows. Futur. Gener. Comput. Syst. 67, 329–340 (2017). https://doi.org/10.1016/j.future.2016.02.008Kurkcuoglu Soner, Z., Bonvin, A.: Science in the clouds: virtualizing haddock powerfit and disvis using indigo-datacloud solutions (2016)Lipton, P.C.T., Moser, S.I., Palma, D.V., Spatzier, T.I.: Topology and Orchestration Specification for Cloud Applications. Tech. rep., OASIS Standard (2013)Liu, C., Mao, Y., Van der Merwe, J., Fernandez, M.: Cloud Resource Orchestration: a Data-Centric Approach. In: Proceedings of the Biennial Conference on Innovative Data Systems Research (CIDR), pp 1–8. Citeseer (2011)López García, Á., Fernández-del Castillo, E.: Analysis of scientific cloud computing requirements. In: Proceedings of the IBERGRID 2013 Conference, p 147 158 (2013)López García, Á., Fernández-del Castillo, E., Orviz Fernández, P.: Standards for enabling heterogeneous IaaS cloud federations. Comput. Standard Inter. 47, 19–23 (2016). https://doi.org/10.1016/j.csi.2016.02.002López García, Á., Zangrando, L., Sgaravatto, M., Llorens, V., Vallero, S., Zaccolo, V., Bagnasco, S., Taneja, S., Pra, S.D., Salomoni, D., Donvito G.: Improved cloud resource allocation: how INDIGO-datacloud is overcoming the current limitations in cloud schedulers. arXiv: 1707.06403 (2017)Lorido-Botran, T., Miguel-Alonso, J., Lozano, J.A.: A review of auto-scaling techniques for elastic applications in cloud environments. J Grid Comput. 12(4), 559–592 (2014). https://doi.org/10.1007/s10723-014-9314-7Marathon: Marathon. https://mesosphere.github.io/marathon/ (2017)Metsch, T., Edmonds, A.: Open Cloud Computing Interface-Infrastructure. Tech. rep., Open Grid Forum (2010)Metsch, T., Edmonds, A.: Open Cloud Computing Interface-RESTful HTTP Rendering. Tech. rep., Open Grid Forum (2011)Microsoft Azure: Microsoft Azure. https://azure.microsoft.com/ (2017)Moltó, G., Caballer, M., Pérez, A., Alfonso, D.C., Blanquer, I.: Coherent application delivery on hybrid distributed computing infrastructures of virtual machines and docker containers. In: 2017 25Th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP). https://doi.org/10.1109/PDP.2017.29 , pp 486–490 (2017)Monna, S., Marcucci, N.M., Marinaro, G., Fiore, S., D’Anca, A., Antonacci, M., Beranzoli, L., Favali, P.: An Emso data case study within the indigo-Dc project. In: EGU General Assembly Conference Abstracts, vol. 19, p 12493 (2017)Nyrén, R., Metsch, T., Edmonds, A., Papaspyrou, A.: Open Cloud Computing Interface–Core. Tech. rep., Open Grid Forum (2010)OASIS: Organization for the Advancement of Structured Information Standards (OASIS). https://www.oasis-open.org (2015)Open Telekom Cloud (OTC): Open Telekom Cloud (OTC). https://cloud.telekom.de/en/ (2017)OpenNebula: OneFlow. http://docs.opennebula.org/5.2/advanced_components/application_flow_and_auto-scaling/index.html (2017)OpenNebula Project: OpenNebula. https://www.opennebula.org (2017)OpenStack Foundation: Heat Orchestration Template (HOT) Guide. https://docs.openstack.org/heat/latest/template_guide/hot_guide.html (2017)OpenStack Foundation: OpenStack. https://www.openstack.org (2017)OpenStack Foundation: Openstack Heat. http://wiki.openstack.org/wiki/Heat (2017)OpenStack Foundation: OpenStack Heat Translator. https://github.com/openstack/heat-translator (2017)OpenStack Foundation: OpenStack heat-translator project contribution statistics. http://stackalytics.com/?release=all&metric=commits&module=heat-translator (2017)OpenStack Foundation: OpenStack Tacker. https://wiki.openstack.org/wiki/Tacker (2017)OpenStack Foundation: OpenStack tosca-parser project contribution statistics. http://stackalytics.com/?release=all&metric=commits&module=tosca-parser (2017)OpenStack Foundation: TOSCA Parser. https://github.com/openstack/tosca-parser (2017)OpenTOSCA: OpenTOSCA. http://www.opentosca.org/ (2017)Owsiak, M., Plociennik, M., Palak, B., Zok, T., Reux, C., Di Gallo, L., Kalupin, D., Johnson, T., Schneider, M.: Running simultaneous kepler sessions for the parallelization of parametric scans and optimization studies applied to complex workflows. J Comput. Sci. 20, 103–111 (2017)Palma, D., Rutkowski, M., Spatzier T.: TOSCA Simple Profile in YAML Version 1.1. Tech. rep., OASIS Standard. http://docs.oasis-open.org/tosca/TOSCA-Simple-Profile-YAML/v1.1/TOSCA-Simple-Profile-YAML-v1.1.html (2016)Petcu, D.: Consuming resources and services from multiple clouds: from terminology to cloudware support. J Grid Comput. 12(2), 321–345 (2014). https://doi.org/10.1007/s10723-013-9290-3Plóciennik, M., Fiore, S., Donvito, G., Owsiak, M., Fargetta, M., Barbera, R., Bruno, R., Giorgio, E., Williams, D.N., Aloisio, G.: Two-level dynamic workflow orchestration in the INDIGO DataCloud for large-scale, climate change data analytics experiments. Procedia Comput. Sci. 80, 722–733 (2016). https://doi.org/10.1016/j.procs.2016.05.359Płóciennik, M., Fiore, S., Donvito, G., Owsiak, M., Fargetta, M., Barbera, R., Bruno, R., Giorgio, E., Williams, D.N., Aloisio, G.: Two-level dynamic workflow orchestration in the indigo datacloud for large-scale, climate change data analytics experiments. Procedia Comput. Sci. 80, 722–733 (2016)Python: Python Package Index (PyPI). https://pypi.python.org/pypi (2017)Ramakrishnan, L., Jackson, K.R., Canon, S., Cholia, S., Shalf, J.: Defining future platform requirements for e-Science clouds. In: Proceedings of the 1st ACM Symposium on Cloud Computing - SoCC ’10. https://doi.org/10.1145/1807128.1807145 , p 101 (2010)Ramakrishnan, L., Zbiegel, P.T.T.T.: Magellan: experiences from a science cloud. In: Proceedings of the 2Nd International Workshop on Scientific Cloud Computing. http://dl.acm.org/citation.cfm?id=1996119 , pp 49–58 (2011)Salomoni, D., Campos, I., Gaido, L., Donvito, G., Antonacci, M., Fuhrman, P., Marco, J., Lopez-Garcia, A., Orviz, P., Blanquer, I., et al.: Indigo-datacloud: foundations and architectural description of a platform as a service oriented to scientific computing. arXiv: http://arXiv.org/abs/1603.09536 (2016)Sánchez-Expósito, S., Martín, P., Ruiz, J.E., Verdes-Montenegro, L., Garrido, J., Sirvent, R., Falcó, A.R., Badia, R.M., Lezzi, D.: Web services as building blocks for science gateways in astrophysics. J Grid Comput. 14(4), 673–685 (2016). https://doi.org/10.1007/s10723-016-9382-ySlipStream: SlipStream. http://sixsq.com/products/slipstream/ (2017)Stockton, D.B., Santamaria, F.: Automating NEURON simulation deployment in cloud resources. Neuroinformatics 15(1), 51–70 (2017). https://doi.org/10.1007/s12021-016-9315-8Teckelmann, R., Reich, C., Sulistio, A.: Mapping of Cloud Standards to the Taxonomy of Interoperability in Iaas. In: 2011 IEEE Third International Conference on Cloud Computing Technology and Science (Cloudcom), pp 522–526. IEEE (2011)Toor, S., Osmani, L., Eerola, P., Kraemer, O., Lindén, T., Tarkoma, S., White, J.: A scalable infrastructure for CMS data analysis based on OpenStack Cloud and Gluster file system. J Phys.: Conf. Ser. 513(6), 062,047 (2014). https://doi.org/10.1088/1742-6596/513/6/062047 . http://stacks.iop.org/1742-6596/513/i=6/a=062047?key=crossref.84033a04265ce343371c7f38064e7143UK Government Cabinet Office: Open Standards Principles. https://www.gov.uk/government/publications/open-standards-principles/open-standards-principles (2015)Yangui, S., Marshall, I.J., Laisne, J.P., Tata, S.: Compatibleone: the open source cloud broker. J Grid Comput. 12(1), 93–109 (2014)Zhao, Y., Li, Y., Raicu, I., Lu, S., Tian, W., Liu, H.: Enabling scalable scientific workflow management in the cloud. Futur. Gener. Comput. Syst. 46, 3–16 (2015). https://doi.org/10.1016/j.future.2014.10.023van Zundert, G., Trellet, M., Schaarschmidt, J., Kurkcuoglu, Z., David, M., Verlato, M., Rosato, A., Bonvin, A.: The DisVis and PowerFit web servers: explorative and integrative modeling of biomolecular complexes. J. Mol. Biol. 429(3), 399–407 (2013). http://www.sciencedirect.com/science/article/pii/S002228361630527

    Introducing distributed dynamic data-intensive (D3) science: Understanding applications and infrastructure

    Get PDF
    A common feature across many science and engineering applications is the amount and diversity of data and computation that must be integrated to yield insights. Data sets are growing larger and becoming distributed; and their location, availability and properties are often time-dependent. Collectively, these characteristics give rise to dynamic distributed data-intensive applications. While "static" data applications have received significant attention, the characteristics, requirements, and software systems for the analysis of large volumes of dynamic, distributed data, and data-intensive applications have received relatively less attention. This paper surveys several representative dynamic distributed data-intensive application scenarios, provides a common conceptual framework to understand them, and examines the infrastructure used in support of applications.Comment: 38 pages, 2 figure
    corecore