44 research outputs found

    COMP Superscalar, an interoperable programming framework

    Get PDF
    COMPSs is a programming framework that aims to facilitate the parallelization of existing applications written in Java, C/C++ and Python scripts. For that purpose, it offers a simple programming model based on sequential development in which the user is mainly responsible for identifying the functions to be executed as asynchronous parallel tasks and annotating them with annotations or standard Python decorators. A runtime system is in charge of exploiting the inherent concurrency of the code, automatically detecting and enforcing the data dependencies between tasks and spawning these tasks to the available resources, which can be nodes in a cluster, clouds or grids. In cloud environments, COMPSs provides scalability and elasticity features allowing the dynamic provision of resources.This work has been supported by the following institutions: the Spanish Government with grant SEV-2011-00067 of the Severo Ochoa Program and contract Computacion de Altas Prestaciones VI (TIN2012-34557); by the SGR programme (2014-SGR-1051) of the Catalan Government; by the project The Human Brain Project, funded by the European Commission under contract 604102; by the ASCETiC project funded by the European Commission under contract 610874; by the EUBrazilCloudConnect project funded by the European Commission under contract 614048; and by the Intel-BSC Exascale Lab collaboration.Peer ReviewedPostprint (published version

    Programming models to support data science workflows

    Get PDF
    Data Science workflows have become a must to progress in many scientific areas such as life, health, and earth sciences. In contrast to traditional HPC workflows, they are more heterogeneous; combining binary executions, MPI simulations, multi-threaded applications, custom analysis (possibly written in Java, Python, C/C++ or R), and real-time processing. Furthermore, in the past, field experts were capable of programming and running small simulations. However, nowadays, simulations requiring hundreds or thousands of cores are widely used and, to this point, efficiently programming them becomes a challenge even for computer sciences. Thus, programming languages and models make a considerable effort to ease the programmability while maintaining acceptable performance. This thesis contributes to the adaptation of High-Performance frameworks to support the needs and challenges of Data Science workflows by extending COMPSs, a mature, general-purpose, task-based, distributed programming model. First, we enhance our prototype to orchestrate different frameworks inside a single programming model so that non-expert users can build complex workflows where some steps require highly optimised state of the art frameworks. This extension includes the @binary, @OmpSs, @MPI, @COMPSs, and @MultiNode annotations for both Java and Python workflows. Second, we integrate container technologies to enable developers to easily port, distribute, and scale their applications to distributed computing platforms. This combination provides a straightforward methodology to parallelise applications from sequential codes along with efficient image management and application deployment that ease the packaging and distribution of applications. We distinguish between static, HPC, and dynamic container management and provide representative use cases for each scenario using Docker, Singularity, and Mesos. Third, we design, implement and integrate AutoParallel, a Python module to automatically find an appropriate task-based parallelisation of affine loop nests and execute them in parallel in a distributed computing infrastructure. It is based on sequential programming and requires one single annotation (the @parallel Python decorator) so that anyone with intermediate-level programming skills can scale up an application to hundreds of cores. Finally, we propose a way to extend task-based management systems to support continuous input and output data to enable the combination of task-based workflows and dataflows (Hybrid Workflows) using one single programming model. Hence, developers can build complex Data Science workflows with different approaches depending on the requirements without the effort of combining several frameworks at the same time. Also, to illustrate the capabilities of Hybrid Workflows, we have built a Distributed Stream Library that can be easily integrated with existing task-based frameworks to provide support for dataflows. The library provides a homogeneous, generic, and simple representation of object and file streams in both Java and Python; enabling complex workflows to handle any data type without dealing directly with the streaming back-end.Els fluxos de treball de Data Science s’han convertit en una necessitat per progressar en moltes àrees científiques com les ciències de la vida, la salut i la terra. A diferència dels fluxos de treball tradicionals per a la CAP, els fluxos de Data Science són més heterogenis; combinant l’execució de binaris, simulacions MPI, aplicacions multiprocés, anàlisi personalitzats (possiblement escrits en Java, Python, C / C ++ o R) i computacions en temps real. Mentre que en el passat els experts de cada camp eren capaços de programar i executar petites simulacions, avui dia, aquestes simulacions representen un repte fins i tot per als experts ja que requereixen centenars o milers de nuclis. Per aquesta raó, els llenguatges i models de programació actuals s’esforcen considerablement en incrementar la programabilitat mantenint un rendiment acceptable. Aquesta tesi contribueix a l’adaptació de models de programació per a la CAP per afrontar les necessitats i reptes dels fluxos de Data Science estenent COMPSs, un model de programació distribuïda madur, de propòsit general, i basat en tasques. En primer lloc, millorem el nostre prototip per orquestrar diferent programari per a que els usuaris no experts puguin crear fluxos complexos usant un únic model on alguns passos requereixin tecnologies altament optimitzades. Aquesta extensió inclou les anotacions de @binary, @OmpSs, @MPI, @COMPSs, i @MultiNode per a fluxos en Java i Python. En segon lloc, integrem tecnologies de contenidors per permetre als desenvolupadors portar, distribuir i escalar fàcilment les seves aplicacions en plataformes distribuïdes. A més d’una metodologia senzilla per a paral·lelitzar aplicacions a partir de codis seqüencials, aquesta combinació proporciona una gestió d’imatges i una implementació d’aplicacions eficients que faciliten l’empaquetat i la distribució d’aplicacions. Distingim entre la gestió de contenidors estàtica, CAP i dinàmica i proporcionem casos d’ús representatius per a cada escenari amb Docker, Singularity i Mesos. En tercer lloc, dissenyem, implementem i integrem AutoParallel, un mòdul de Python per determinar automàticament la paral·lelització basada en tasques de nius de bucles afins i executar-los en paral·lel en una infraestructura distribuïda. AutoParallel està basat en programació seqüencial, requereix una sola anotació (el decorador @parallel) i permet a un usuari intermig escalar una aplicació a centenars de nuclis. Finalment, proposem una forma d’estendre els sistemes basats en tasques per admetre dades d’entrada i sortida continus; permetent així la combinació de fluxos de treball i dades (Fluxos Híbrids) en un únic model. Conseqüentment, els desenvolupadors poden crear fluxos complexos seguint diferents patrons sense l’esforç de combinar diversos models al mateix temps. A més, per a il·lustrar les capacitats dels Fluxos Híbrids, hem creat una biblioteca (DistroStreamLib) que s’integra fàcilment amb els models basats en tasques per suportar fluxos de dades. La biblioteca proporciona una representació homogènia, genèrica i simple de seqüències contínues d’objectes i arxius en Java i Python; permetent gestionar qualsevol tipus de dades sense tractar directament amb el back-end de streaming.Los flujos de trabajo de Data Science se han convertido en una necesidad para progresar en muchas áreas científicas como las ciencias de la vida, la salud y la tierra. A diferencia de los flujos de trabajo tradicionales para la CAP, los flujos de Data Science son más heterogéneos; combinando la ejecución de binarios, simulaciones MPI, aplicaciones multiproceso, análisis personalizados (posiblemente escritos en Java, Python, C/C++ o R) y computaciones en tiempo real. Mientras que en el pasado los expertos de cada campo eran capaces de programar y ejecutar pequeñas simulaciones, hoy en día, estas simulaciones representan un desafío incluso para los expertos ya que requieren cientos o miles de núcleos. Por esta razón, los lenguajes y modelos de programación actuales se esfuerzan considerablemente en incrementar la programabilidad manteniendo un rendimiento aceptable. Esta tesis contribuye a la adaptación de modelos de programación para la CAP para afrontar las necesidades y desafíos de los flujos de Data Science extendiendo COMPSs, un modelo de programación distribuida maduro, de propósito general, y basado en tareas. En primer lugar, mejoramos nuestro prototipo para orquestar diferentes software para que los usuarios no expertos puedan crear flujos complejos usando un único modelo donde algunos pasos requieran tecnologías altamente optimizadas. Esta extensión incluye las anotaciones de @binary, @OmpSs, @MPI, @COMPSs, y @MultiNode para flujos en Java y Python. En segundo lugar, integramos tecnologías de contenedores para permitir a los desarrolladores portar, distribuir y escalar fácilmente sus aplicaciones en plataformas distribuidas. Además de una metodología sencilla para paralelizar aplicaciones a partir de códigos secuenciales, esta combinación proporciona una gestión de imágenes y una implementación de aplicaciones eficientes que facilitan el empaquetado y la distribución de aplicaciones. Distinguimos entre gestión de contenedores estática, CAP y dinámica y proporcionamos casos de uso representativos para cada escenario con Docker, Singularity y Mesos. En tercer lugar, diseñamos, implementamos e integramos AutoParallel, un módulo de Python para determinar automáticamente la paralelización basada en tareas de nidos de bucles afines y ejecutarlos en paralelo en una infraestructura distribuida. AutoParallel está basado en programación secuencial, requiere una sola anotación (el decorador @parallel) y permite a un usuario intermedio escalar una aplicación a cientos de núcleos. Finalmente, proponemos una forma de extender los sistemas basados en tareas para admitir datos de entrada y salida continuos; permitiendo así la combinación de flujos de trabajo y datos (Flujos Híbridos) en un único modelo. Consecuentemente, los desarrolladores pueden crear flujos complejos siguiendo diferentes patrones sin el esfuerzo de combinar varios modelos al mismo tiempo. Además, para ilustrar las capacidades de los Flujos Híbridos, hemos creado una biblioteca (DistroStreamLib) que se integra fácilmente a los modelos basados en tareas para soportar flujos de datos. La biblioteca proporciona una representación homogénea, genérica y simple de secuencias continuas de objetos y archivos en Java y Python; permitiendo manejar cualquier tipo de datos sin tratar directamente con el back-end de streaming.Postprint (published version

    Programming models to support data science workflows

    Get PDF
    Data Science workflows have become a must to progress in many scientific areas such as life, health, and earth sciences. In contrast to traditional HPC workflows, they are more heterogeneous; combining binary executions, MPI simulations, multi-threaded applications, custom analysis (possibly written in Java, Python, C/C++ or R), and real-time processing. Furthermore, in the past, field experts were capable of programming and running small simulations. However, nowadays, simulations requiring hundreds or thousands of cores are widely used and, to this point, efficiently programming them becomes a challenge even for computer sciences. Thus, programming languages and models make a considerable effort to ease the programmability while maintaining acceptable performance. This thesis contributes to the adaptation of High-Performance frameworks to support the needs and challenges of Data Science workflows by extending COMPSs, a mature, general-purpose, task-based, distributed programming model. First, we enhance our prototype to orchestrate different frameworks inside a single programming model so that non-expert users can build complex workflows where some steps require highly optimised state of the art frameworks. This extension includes the @binary, @OmpSs, @MPI, @COMPSs, and @MultiNode annotations for both Java and Python workflows. Second, we integrate container technologies to enable developers to easily port, distribute, and scale their applications to distributed computing platforms. This combination provides a straightforward methodology to parallelise applications from sequential codes along with efficient image management and application deployment that ease the packaging and distribution of applications. We distinguish between static, HPC, and dynamic container management and provide representative use cases for each scenario using Docker, Singularity, and Mesos. Third, we design, implement and integrate AutoParallel, a Python module to automatically find an appropriate task-based parallelisation of affine loop nests and execute them in parallel in a distributed computing infrastructure. It is based on sequential programming and requires one single annotation (the @parallel Python decorator) so that anyone with intermediate-level programming skills can scale up an application to hundreds of cores. Finally, we propose a way to extend task-based management systems to support continuous input and output data to enable the combination of task-based workflows and dataflows (Hybrid Workflows) using one single programming model. Hence, developers can build complex Data Science workflows with different approaches depending on the requirements without the effort of combining several frameworks at the same time. Also, to illustrate the capabilities of Hybrid Workflows, we have built a Distributed Stream Library that can be easily integrated with existing task-based frameworks to provide support for dataflows. The library provides a homogeneous, generic, and simple representation of object and file streams in both Java and Python; enabling complex workflows to handle any data type without dealing directly with the streaming back-end.Els fluxos de treball de Data Science s’han convertit en una necessitat per progressar en moltes àrees científiques com les ciències de la vida, la salut i la terra. A diferència dels fluxos de treball tradicionals per a la CAP, els fluxos de Data Science són més heterogenis; combinant l’execució de binaris, simulacions MPI, aplicacions multiprocés, anàlisi personalitzats (possiblement escrits en Java, Python, C / C ++ o R) i computacions en temps real. Mentre que en el passat els experts de cada camp eren capaços de programar i executar petites simulacions, avui dia, aquestes simulacions representen un repte fins i tot per als experts ja que requereixen centenars o milers de nuclis. Per aquesta raó, els llenguatges i models de programació actuals s’esforcen considerablement en incrementar la programabilitat mantenint un rendiment acceptable. Aquesta tesi contribueix a l’adaptació de models de programació per a la CAP per afrontar les necessitats i reptes dels fluxos de Data Science estenent COMPSs, un model de programació distribuïda madur, de propòsit general, i basat en tasques. En primer lloc, millorem el nostre prototip per orquestrar diferent programari per a que els usuaris no experts puguin crear fluxos complexos usant un únic model on alguns passos requereixin tecnologies altament optimitzades. Aquesta extensió inclou les anotacions de @binary, @OmpSs, @MPI, @COMPSs, i @MultiNode per a fluxos en Java i Python. En segon lloc, integrem tecnologies de contenidors per permetre als desenvolupadors portar, distribuir i escalar fàcilment les seves aplicacions en plataformes distribuïdes. A més d’una metodologia senzilla per a paral·lelitzar aplicacions a partir de codis seqüencials, aquesta combinació proporciona una gestió d’imatges i una implementació d’aplicacions eficients que faciliten l’empaquetat i la distribució d’aplicacions. Distingim entre la gestió de contenidors estàtica, CAP i dinàmica i proporcionem casos d’ús representatius per a cada escenari amb Docker, Singularity i Mesos. En tercer lloc, dissenyem, implementem i integrem AutoParallel, un mòdul de Python per determinar automàticament la paral·lelització basada en tasques de nius de bucles afins i executar-los en paral·lel en una infraestructura distribuïda. AutoParallel està basat en programació seqüencial, requereix una sola anotació (el decorador @parallel) i permet a un usuari intermig escalar una aplicació a centenars de nuclis. Finalment, proposem una forma d’estendre els sistemes basats en tasques per admetre dades d’entrada i sortida continus; permetent així la combinació de fluxos de treball i dades (Fluxos Híbrids) en un únic model. Conseqüentment, els desenvolupadors poden crear fluxos complexos seguint diferents patrons sense l’esforç de combinar diversos models al mateix temps. A més, per a il·lustrar les capacitats dels Fluxos Híbrids, hem creat una biblioteca (DistroStreamLib) que s’integra fàcilment amb els models basats en tasques per suportar fluxos de dades. La biblioteca proporciona una representació homogènia, genèrica i simple de seqüències contínues d’objectes i arxius en Java i Python; permetent gestionar qualsevol tipus de dades sense tractar directament amb el back-end de streaming.Los flujos de trabajo de Data Science se han convertido en una necesidad para progresar en muchas áreas científicas como las ciencias de la vida, la salud y la tierra. A diferencia de los flujos de trabajo tradicionales para la CAP, los flujos de Data Science son más heterogéneos; combinando la ejecución de binarios, simulaciones MPI, aplicaciones multiproceso, análisis personalizados (posiblemente escritos en Java, Python, C/C++ o R) y computaciones en tiempo real. Mientras que en el pasado los expertos de cada campo eran capaces de programar y ejecutar pequeñas simulaciones, hoy en día, estas simulaciones representan un desafío incluso para los expertos ya que requieren cientos o miles de núcleos. Por esta razón, los lenguajes y modelos de programación actuales se esfuerzan considerablemente en incrementar la programabilidad manteniendo un rendimiento aceptable. Esta tesis contribuye a la adaptación de modelos de programación para la CAP para afrontar las necesidades y desafíos de los flujos de Data Science extendiendo COMPSs, un modelo de programación distribuida maduro, de propósito general, y basado en tareas. En primer lugar, mejoramos nuestro prototipo para orquestar diferentes software para que los usuarios no expertos puedan crear flujos complejos usando un único modelo donde algunos pasos requieran tecnologías altamente optimizadas. Esta extensión incluye las anotaciones de @binary, @OmpSs, @MPI, @COMPSs, y @MultiNode para flujos en Java y Python. En segundo lugar, integramos tecnologías de contenedores para permitir a los desarrolladores portar, distribuir y escalar fácilmente sus aplicaciones en plataformas distribuidas. Además de una metodología sencilla para paralelizar aplicaciones a partir de códigos secuenciales, esta combinación proporciona una gestión de imágenes y una implementación de aplicaciones eficientes que facilitan el empaquetado y la distribución de aplicaciones. Distinguimos entre gestión de contenedores estática, CAP y dinámica y proporcionamos casos de uso representativos para cada escenario con Docker, Singularity y Mesos. En tercer lugar, diseñamos, implementamos e integramos AutoParallel, un módulo de Python para determinar automáticamente la paralelización basada en tareas de nidos de bucles afines y ejecutarlos en paralelo en una infraestructura distribuida. AutoParallel está basado en programación secuencial, requiere una sola anotación (el decorador @parallel) y permite a un usuario intermedio escalar una aplicación a cientos de núcleos. Finalmente, proponemos una forma de extender los sistemas basados en tareas para admitir datos de entrada y salida continuos; permitiendo así la combinación de flujos de trabajo y datos (Flujos Híbridos) en un único modelo. Consecuentemente, los desarrolladores pueden crear flujos complejos siguiendo diferentes patrones sin el esfuerzo de combinar varios modelos al mismo tiempo. Además, para ilustrar las capacidades de los Flujos Híbridos, hemos creado una biblioteca (DistroStreamLib) que se integra fácilmente a los modelos basados en tareas para soportar flujos de datos. La biblioteca proporciona una representación homogénea, genérica y simple de secuencias continuas de objetos y archivos en Java y Python; permitiendo manejar cualquier tipo de datos sin tratar directamente con el back-end de streaming

    Automatic, efficient and scalable provenance registration for FAIR HPC workflows

    Get PDF
    Provenance registration is becoming more and more important, as we increase the size and number of experiments performed using computers. In particular, when provenance is recorded in HPC environments, it must be efficient and scalable. In this paper, we propose a provenance registration method for scientific workflows, efficient enough to run in supercomputers (thus, it could run in other environments with more relaxed restrictions, such as distributed ones). It also must be scalable in order to deal with large workflows, that are more typically used in HPC. We also target transparency for the user, shielding them from having to specify how provenance must be recorded. We implement our design using the COMPSs programming model as a Workflow Management System (WfMS) and use RO-Crate as a well-established specification to record and publish provenance. Experiments are provided, demonstrating the run time efficiency and scalability of our solution.This work has been supported by the Spanish Government (PID2019-107255GB-C21), by Generalitat de Catalunya (contract 2017-SGR-01414) and the EU’s Horizon research and innovation programme under Grant agreement No 101058129 (DT-GEO). Also, it has been contributed in the CECH project, co-funded with 50% by the European Regional Development Fund under the framework of the ERFD Operative Programme for Catalunya 2014-2020, with a grant of 1.527.637,88 C. LRN, JMF and SCG are partly supported by INB Grant (PT17/0009/0001 - ISCIII-SGEFI / ERDF), and their work received funding from the EU’s Horizon 2020 research and innovation programme under grant agreements EOSC-Life No 824087, and EJP RD No 825575.Peer ReviewedPostprint (author's final draft

    Supporting biodiversity studies with the EUBrazilOpenBio Hybrid Data Infrastructure

    Get PDF
    [EN] EUBrazilOpenBio is a collaborative initiative addressing strategic barriers in biodiversity research by integrating open access data and user-friendly tools widely available in Brazil and Europe. The project deploys the EU-Brazil Hybrid Data Infrastructure that allows the sharing of hardware, software and data on-demand. This infrastructure provides access to several integrated services and resources to seamlessly aggregate taxonomic, biodiversity and climate data, used by processing services implementing checklist cross-mapping and ecological niche modelling. A Virtual Research Environment was created to provide users with a single entry point to processing and data resources. This article describes the architecture, demonstration use cases and some experimental results and validation.EUBrazilOpenBio - Open Data and Cloud Computing e-Infrastructure for Biodiversity (2011-2013) is a Small or medium-scale focused research project (STREP) funded by the European Commission under the Cooperation Programme, Framework Programme Seven (FP7) Objective FP7-ICT-2011- EU-Brazil Research and Development cooperation, and the National Council for Scientific and Technological Development of Brazil (CNPq) of the Brazilian Ministry of Science, Technology and Innovation (MCTI) under the corresponding matching Brazilian Call for proposals MCT/CNPq 066/2010. BSC authors also acknowledge the support of the grant SEV-2011-00067 of Severo Ochoa Program, awarded by the Spanish Government and the Spanish Ministry of Science and Innovation under contract TIN2012-34557 and the Generalitat de Catalunya (contract 2009-SGR-980).Amaral, R.; Badia, RM.; Blanquer Espert, I.; Braga-Neto, R.; Candela, L.; Castelli, D.; Flann, C.... (2015). Supporting biodiversity studies with the EUBrazilOpenBio Hybrid Data Infrastructure. Concurrency and Computation: Practice and Experience. 27(2):376-394. https://doi.org/10.1002/cpe.3238S376394272EUBrazilOpenBio Consortium EU-Brazil Open Data and Cloud Computing e-Infrastructure for Biodiversity http://www.eubrazilopenbio.eu/Triebel, D., Hagedorn, G., & Rambold, G. (2012). An appraisal of megascience platforms for biodiversity information. MycoKeys, 5, 45-63. doi:10.3897/mycokeys.5.4302Edwards, J. L. (2000). Interoperability of Biodiversity Databases: Biodiversity Information on Every Desktop. Science, 289(5488), 2312-2314. doi:10.1126/science.289.5488.2312Grassle, F. (2000). The Ocean Biogeographic Information System (OBIS): An On-line, Worldwide Atlas for Accessing, Modeling and Mapping Marine Biological Data in a Multidimensional Geographic Context. Oceanography, 13(3), 5-7. doi:10.5670/oceanog.2000.01Constable, H., Guralnick, R., Wieczorek, J., Spencer, C., & Peterson, A. T. (2010). VertNet: A New Model for Biodiversity Data Sharing. PLoS Biology, 8(2), e1000309. doi:10.1371/journal.pbio.1000309Roskov Y Kunze T Paglinawan L Orrell T Nicolson D Culham A Bailly N Kirk P Bourgoin T Baillargeon G Hernandez F De Wever A Species 2000 & ITIS Catalogue of Life 2013 www.catalogueoflife.org/col/speciesLink Consortium speciesLink http://splink.cria.org.br 2013 http://splink.cria.org.brList of Species of the Brazilian Flora Consortium List of Species of the Brazilian Flora http://floradobrasil.jbrj.gov.br/ 2013 http://floradobrasil.jbrj.gov.br/Wieczorek, J., Bloom, D., Guralnick, R., Blum, S., Döring, M., Giovanni, R., … Vieglais, D. (2012). Darwin Core: An Evolving Community-Developed Biodiversity Data Standard. PLoS ONE, 7(1), e29715. doi:10.1371/journal.pone.0029715De Giovanni R Copp C Döring M Güntscg A Vieglais D Hobern D Torre J Wieczorek J Gales R Hyam R Blum S Perry S TAPIR - TDWG Access Protocol for Information Retrieval http://www.tdwg.org/activities/abcd/Jetz, W., McPherson, J. M., & Guralnick, R. P. (2012). Integrating biodiversity distribution knowledge: toward a global map of life. Trends in Ecology & Evolution, 27(3), 151-159. doi:10.1016/j.tree.2011.09.007NICE Srl Enginframe 2013 http://www.nice-software.com/products/enginframeHiden, H., Woodman, S., Watson, P., & Cala, J. (2013). Developing cloud applications using the e-Science Central platform. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 371(1983), 20120085. doi:10.1098/rsta.2012.0085Glatard, T., Montagnat, J., Lingrand, D., & Pennec, X. (2008). Flexible and Efficient Workflow Deployment of Data-Intensive Applications On Grids With MOTEUR. The International Journal of High Performance Computing Applications, 22(3), 347-360. doi:10.1177/1094342008096067Kacsuk, P., & Sipos, G. (2005). Multi-Grid, Multi-User Workflows in the P-GRADE Grid Portal. Journal of Grid Computing, 3(3-4), 221-238. doi:10.1007/s10723-005-9012-6Manuali, C., Laganà, A., & Rampino, S. (2010). GriF: A Grid framework for a Web Service approach to reactive scattering. Computer Physics Communications, 181(7), 1179-1185. doi:10.1016/j.cpc.2010.03.001Goecks, J., Nekrutenko, A., Taylor, J., & Galaxy Team, T. (2010). Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biology, 11(8), R86. doi:10.1186/gb-2010-11-8-r86XSEDE consortium Extreme science and engineering discovery environment 2013 https://www.xsede.org/NanoHUB.org Online simulation and more for nanotechnology http://nanohub.org/SCI-BUS consortium Scientific gateway based user support 2011 https://www.sci-bus.eu/Kacsuk, P., Farkas, Z., Kozlovszky, M., Hermann, G., Balasko, A., Karoczkai, K., & Marton, I. (2012). WS-PGRADE/gUSE Generic DCI Gateway Framework for a Large Variety of User Communities. Journal of Grid Computing, 10(4), 601-630. doi:10.1007/s10723-012-9240-5Candela L Castelli D Pagano P D4science: an e-infrastructure for supporting virtual research environments Post-Proceedings of the 5th Italian Res. Conf. on Digital Libraries - IRCDL 2009 2009 166 169Lobo, J. M., Jiménez-Valverde, A., & Hortal, J. (2010). The uncertain nature of absences and their importance in species distribution modelling. Ecography, 33(1), 103-114. doi:10.1111/j.1600-0587.2009.06039.xGrinnell, J. (1917). Field Tests of Theories Concerning Distributional Control. The American Naturalist, 51(602), 115-128. doi:10.1086/279591Peterson, A. T., Soberón, J., Pearson, R. G., Anderson, R. P., Martínez-Meyer, E., Nakamura, M., & Araújo, M. B. (2011). Ecological Niches and Geographic Distributions (MPB-49). doi:10.23943/princeton/9780691136868.001.0001Brazilian Virtual Herbarium Consortium Brazilian Virtual Herbarium 2013 http://biogeo.inct.florabrasil.net/De Souza Muñoz, M. E., De Giovanni, R., de Siqueira, M. F., Sutton, T., Brewer, P., Pereira, R. S., … Canhos, V. P. (2009). openModeller: a generic approach to species’ potential distribution modelling. GeoInformatica, 15(1), 111-135. doi:10.1007/s10707-009-0090-7Hirzel, A. H., Hausser, J., Chessel, D., & Perrin, N. (2002). ECOLOGICAL-NICHE FACTOR ANALYSIS: HOW TO COMPUTE HABITAT-SUITABILITY MAPS WITHOUT ABSENCE DATA? Ecology, 83(7), 2027-2036. doi:10.1890/0012-9658(2002)083[2027:enfaht]2.0.co;2Anderson, R. P., Lew, D., & Peterson, A. T. (2003). Evaluating predictive models of species’ distributions: criteria for selecting optimal models. Ecological Modelling, 162(3), 211-232. doi:10.1016/s0304-3800(02)00349-6Farber, O., & Kadmon, R. (2003). Assessment of alternative approaches for bioclimatic modeling with special emphasis on the Mahalanobis distance. Ecological Modelling, 160(1-2), 115-130. doi:10.1016/s0304-3800(02)00327-7Phillips, S. J., Anderson, R. P., & Schapire, R. E. (2006). Maximum entropy modeling of species geographic distributions. Ecological Modelling, 190(3-4), 231-259. doi:10.1016/j.ecolmodel.2005.03.026Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (2001). Estimating the Support of a High-Dimensional Distribution. Neural Computation, 13(7), 1443-1471. doi:10.1162/089976601750264965Armbrust, M., Stoica, I., Zaharia, M., Fox, A., Griffith, R., Joseph, A. D., … Rabkin, A. (2010). A view of cloud computing. Communications of the ACM, 53(4), 50. doi:10.1145/1721654.1721672Foster I Zhao Y Raicu I Lu S Cloud computing and grid computing 360-degree compared Grid Computing Environments Workshop, 2008. GCE '08 2008 1 10Candela, L., Castelli, D., & Pagano, P. (2013). Virtual Research Environments: An Overview and a Research Agenda. Data Science Journal, 12(0), GRDI75-GRDI81. doi:10.2481/dsj.grdi-013Tsai W Service-oriented system engineering: a new paradigm Service-oriented system engineering, 2005. sose 2005. IEEE International Workshop 3 6 10.1109/SOSE.2005.34Cattell, R. (2011). Scalable SQL and NoSQL data stores. ACM SIGMOD Record, 39(4), 12. doi:10.1145/1978915.1978919Durão FA Assad RE Silva AF Carvalho JF Garcia VC Trinta FAM USTO.RE: A Private Cloud Storage System 13th International Conference on Web Engineering (ICWE 2013) - Industry Track 2013 452 466Lezzi, D., Rafanell, R., Carrión, A., Espert, I. B., Hernández, V., & Badia, R. M. (2012). Enabling e-Science Applications on the Cloud with COMPSs. Lecture Notes in Computer Science, 25-34. doi:10.1007/978-3-642-29737-3_4Boeres, C., & Rebello, V. E. F. (2004). EasyGrid: towards a framework for the automatic Grid enabling of legacy MPI applications. Concurrency and Computation: Practice and Experience, 16(5), 425-432. doi:10.1002/cpe.821VENUS-C consortium Deliverable 6.1 - report on architecture 2012 http://www.venus-c.eu/Content/Publications.aspx?id=bfac02a9-9bc0-4c8f-80e0-7ceddc5c893bThain, D., Tannenbaum, T., & Livny, M. (2005). Distributed computing in practice: the Condor experience. Concurrency and Computation: Practice and Experience, 17(2-4), 323-356. doi:10.1002/cpe.938Couvares, P., Kosar, T., Roy, A., Weber, J., & Wenger, K. (s. f.). Workflow Management in Condor. Workflows for e-Science, 357-375. doi:10.1007/978-1-84628-757-2_22Sena, A., Nascimento, A., Boeres, C., & Rebello, V. (2008). EasyGrid Enabling of Iterative Tightly-Coupled Parallel MPI Applications. 2008 IEEE International Symposium on Parallel and Distributed Processing with Applications. doi:10.1109/ispa.2008.122Edmonds, A., Metsch, T., & Papaspyrou, A. (2011). Open Cloud Computing Interface in Data Management-Related Setups. Grid and Cloud Database Management, 23-48. doi:10.1007/978-3-642-20045-8_2Livenson, I., & Laure, E. (2011). Towards transparent integration of heterogeneous cloud storage platforms. Proceedings of the fourth international workshop on Data-intensive distributed computing - DIDC ’11. doi:10.1145/1996014.1996020The rOCCI framework http://occi-wg.org/2012/04/02/rocci-a-ruby-occi-framework/Mendelsohn N Gudgin M Ruellan H Nottingham M SOAP message transmission optimization mechanism W3C Recommendation 2005 http://www.w3.org/TR/2005/REC-soap12-mtom-20050125/Sencha Sencha GXT application framework for Google web toolkit 2013 http://www.sencha.com/products/gxt/Vicario, S., Hardisty, A., & Haitas, N. (2011). BioVeL: Biodiversity Virtual e-Laboratory. EMBnet.journal, 17(2), 5. doi:10.14806/ej.17.2.238Lezzi D Rafanell R Torres E De Giovanni R Blanquer I Badia RM Programming ecological niche modeling workflows in the cloud Proceed. of the 27th IEEE Int. Conf. on Advanced Information Networking and Applications 2013 1223 1228Lohmann, L. G. (2006). Untangling the phylogeny of neotropical lianas (Bignonieae, Bignoniaceae). American Journal of Botany, 93(2), 304-318. doi:10.3732/ajb.93.2.304Flann C Use Case Study EUBrazilOpenBio Cross-mapping tool Assessment of usability for regional-GSD comparisons 2013 http://www.eubrazilopenbio.eu/Content/Factfile.aspx?id=0750dcd8-23f2-4bf1-bad4-52aa3277d002Brazilian Ministry of Environment Instrução normativa no. 6, 23 de setembro de 2008 2008 http://www.mma.gov.br/estruturas/179/_arquivos/179_05122008033615.pd

    An elastic software architecture for extreme-scale big data analytics

    Get PDF
    This chapter describes a software architecture for processing big-data analytics considering the complete compute continuum, from the edge to the cloud. The new generation of smart systems requires processing a vast amount of diverse information from distributed data sources. The software architecture presented in this chapter addresses two main challenges. On the one hand, a new elasticity concept enables smart systems to satisfy the performance requirements of extreme-scale analytics workloads. By extending the elasticity concept (known at cloud side) across the compute continuum in a fog computing environment, combined with the usage of advanced heterogeneous hardware architectures at the edge side, the capabilities of the extreme-scale analytics can significantly increase, integrating both responsive data-in-motion and latent data-at-rest analytics into a single solution. On the other hand, the software architecture also focuses on the fulfilment of the non-functional properties inherited from smart systems, such as real-time, energy-efficiency, communication quality and security, that are of paramount importance for many application domains such as smart cities, smart mobility and smart manufacturing.The research leading to these results has received funding from the European Union’s Horizon 2020 Programme under the ELASTIC Project (www.elastic-project.eu), grant agreement No 825473.Peer ReviewedPostprint (published version

    The DeepHealth Toolkit: A key European free and open-source software for deep learning and computer vision ready to exploit heterogeneous HPC and cloud architectures

    Get PDF
    At the present time, we are immersed in the convergence between Big Data, High-Performance Computing and Artificial Intelligence. Technological progress in these three areas has accelerated in recent years, forcing different players like software companies and stakeholders to move quickly. The European Union is dedicating a lot of resources to maintain its relevant position in this scenario, funding projects to implement large-scale pilot testbeds that combine the latest advances in Artificial Intelligence, High-Performance Computing, Cloud and Big Data technologies. The DeepHealth project is an example focused on the health sector whose main outcome is the DeepHealth toolkit, a European unified framework that offers deep learning and computer vision capabilities, completely adapted to exploit underlying heterogeneous High-Performance Computing, Big Data and cloud architectures, and ready to be integrated into any software platform to facilitate the development and deployment of new applications for specific problems in any sector. This toolkit is intended to be one of the European contributions to the field of AI. This chapter introduces the toolkit with its main components and complementary tools, providing a clear view to facilitate and encourage its adoption and wide use by the European community of developers of AI-based solutions and data scientists working in the healthcare sector and others. iThis chapter describes work undertaken in the context of the DeepHealth project, “Deep-Learning and HPC to Boost Biomedical Applications for Health”, which has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111.Peer Reviewed"Article signat per 19 autors/es: Marco Aldinucci, David Atienza, Federico Bolelli, Mónica Caballero, Iacopo Colonnelli, José Flich, Jon A. Gómez, David González, Costantino Grana, Marco Grangetto, Simone Leo, Pedro López, Dana Oniga, Roberto Paredes, Luca Pireddu, Eduardo Quiñones, Tatiana Silva, Enzo Tartaglione & Marina Zapater "Postprint (author's final draft

    Task-based programming in COMPSs to converge from HPC to big data

    Get PDF
    Task-based programming has proven to be a suitable model for high-performance computing (HPC) applications. Different implementations have been good demonstrators of this fact and have promoted the acceptance of task-based programming in the OpenMP standard. Furthermore, in recent years, Apache Spark has gained wide popularity in business and research environments as a programming model for addressing emerging big data problems. COMP Superscalar (COMPSs) is a task-based environment that tackles distributed computing (including Clouds) and is a good alternative for a task-based programming model for big data applications. This article describes why we consider that task-based programming models are a good approach for big data applications. The article includes a comparison of Spark and COMPSs in terms of architecture, programming model, and performance. It focuses on the differences that both frameworks have in structural terms, on their programmability interface, and in terms of their efficiency by means of three widely known benchmarking kernels: Wordcount, Kmeans, and Terasort. These kernels enable the evaluation of the more important functionalities of both programming models and analyze different work flows and conditions. The main results achieved from this comparison are (1) COMPSs is able to extract the inherent parallelism from the user code with minimal coding effort as opposed to Spark, which requires the existing algorithms to be adapted and rewritten by explicitly using their predefined functions, (2) it is an improvement in terms of performance when compared with Spark, and (3) COMPSs has shown to scale better than Spark in most cases. Finally, we discuss the advantages and disadvantages of both frameworks, highlighting the differences that make them unique, thereby helping to choose the right framework for each particular objective.This work is supported by the Spanish Government (SEV2015-0493), by the Spanish Ministry of Science and Innovation (contract TIN2015-65316-P), by Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272). Javier Conejero’s postdoctoral contract is cofinanced by the Ministry of Economy and Competitiveness under the Juan de la Cierva Formación postdoctoral fellowship number FJCI-2015-24651. This work is also supported by the Intel-BSC Exascale Lab. The Human Brain Project receives funding from the EU’s Seventh Framework Programme (FP7/2007-2013) under grant agreement no 604102.Peer ReviewedPostprint (author's final draft
    corecore