9 research outputs found

    Scalable Declarative HEP Analysis Workflows for Containerised Compute Clouds

    Get PDF
    We describe a novel approach for experimental High-Energy Physics (HEP) data analyses that is centred around the declarative rather than imperative paradigm when describing analysis computational tasks. The analysis process can be structured in the form of a Directed Acyclic Graph (DAG), where each graph vertex represents a unit of computation with its inputs and outputs, and the graph edges describe the interconnection of various computational steps. We have developed REANA, a platform for reproducible data analyses, that supports several such DAG workflow specifications. The REANA platform parses the analysis workflow and dispatches its computational steps to various supported computing backends (Kubernetes, HTCondor, Slurm). The focus on declarative rather than imperative programming enables researchers to concentrate on the problem domain at hand without having to think about implementation details such as scalable job orchestration. The declarative programming approach is further exemplified by a multi-level job cascading paradigm that was implemented in the Yadage workflow specification language. We present two recent LHC particle physics analyses, ATLAS searches for dark matter and CMS jet energy correction pipelines, where the declarative approach was successfully applied. We argue that the declarative approach to data analyses, combined with recent advancements in container technology, facilitates the portability of computational data analyses to various compute backends, enhancing the reproducibility and the knowledge preservation behind particle physics data analyses.Peer reviewe

    NUScon: a community-driven platform for quantitative evaluation of nonuniform sampling in NMR

    Get PDF
    Although the concepts of nonuniform sampling (NUS​​​​​​​) and non-Fourier spectral reconstruction in multidimensional NMR began to emerge 4 decades ago (Bodenhausen and Ernst, 1981; Barna and Laue, 1987), it is only relatively recently that NUS has become more commonplace. Advantages of NUS include the ability to tailor experiments to reduce data collection time and to improve spectral quality, whether through detection of closely spaced peaks (i.e., “resolution”) or peaks of weak intensity (i.e., “sensitivity”). Wider adoption of these methods is the result of improvements in computational performance, a growing abundance and flexibility of software, support from NMR spectrometer vendors, and the increased data sampling demands imposed by higher magnetic fields. However, the identification of best practices still remains a significant and unmet challenge. Unlike the discrete Fourier transform, non-Fourier methods used to reconstruct spectra from NUS data are nonlinear, depend on the complexity and nature of the signals, and lack quantitative or formal theory describing their performance. Seemingly subtle algorithmic differences may lead to significant variabilities in spectral qualities and artifacts. A community-based critical assessment of NUS challenge problems has been initiated, called the “Nonuniform Sampling Contest” (NUScon), with the objective of determining best practices for processing and analyzing NUS experiments. We address this objective by constructing challenges from NMR experiments that we inject with synthetic signals, and we process these challenges using workflows submitted by the community. In the initial rounds of NUScon our aim is to establish objective criteria for evaluating the quality of spectral reconstructions. We present here a software package for performing the quantitative analyses, and we present the results from the first two rounds of NUScon. We discuss the challenges that remain and present a roadmap for continued community-driven development with the ultimate aim of providing best practices in this rapidly evolving field. The NUScon software package and all data from evaluating the challenge problems are hosted on the NMRbox platform

    NUScon: a community-driven platform for quantitative evaluation of nonuniform sampling in NMR

    Get PDF
    Although the concepts of nonuniform sampling (NUS) and non-Fourier spectral reconstruction in multidimensional NMR began to emerge 4 decades ago (Bodenhausen and Ernst, 1981; Barna and Laue, 1987), it is only relatively recently that NUS has become more commonplace. Advantages of NUS include the ability to tailor experiments to reduce data collection time and to improve spectral quality, whether through detection of closely spaced peaks (i.e., “resolution”) or peaks of weak intensity (i.e., “sensitivity”). Wider adoption of these methods is the result of improvements in computational performance, a growing abundance and flexibility of software, support from NMR spectrometer vendors, and the increased data sampling demands imposed by higher magnetic fields. However, the identification of best practices still remains a significant and unmet challenge. Unlike the discrete Fourier transform, non-Fourier methods used to reconstruct spectra from NUS data are nonlinear, depend on the complexity and nature of the signals, and lack quantitative or formal theory describing their performance. Seemingly subtle algorithmic differences may lead to significant variabilities in spectral qualities and artifacts. A community-based critical assessment of NUS challenge problems has been initiated, called the “Nonuniform Sampling Contest” (NUScon), with the objective of determining best practices for processing and analyzing NUS experiments. We address this objective by constructing challenges from NMR experiments that we inject with synthetic signals, and we process these challenges using workflows submitted by the community. In the initial rounds of NUScon our aim is to establish objective criteria for evaluating the quality of spectral reconstructions. We present here a software package for performing the quantitative analyses, and we present the results from the first two rounds of NUScon. We discuss the challenges that remain and present a roadmap for continued community-driven development with the ultimate aim of providing best practices in this rapidly evolving field. The NUScon software package and all data from evaluating the challenge problems are hosted on the NMRbox platform

    Methods Included:Standardizing Computational Reuse and Portability with the Common Workflow Language

    Get PDF
    A widely used standard for portable multilingual data analysis pipelines would enable considerable benefits to scholarly publication reuse, research/industry collaboration, regulatory cost control, and to the environment. Published research that used multiple computer languages for their analysis pipelines would include a complete and reusable description of that analysis that is runnable on a diverse set of computing environments. Researchers would be able to easier collaborate and reuse these pipelines, adding or exchanging components regardless of programming language used; collaborations with and within the industry would be easier; approval of new medical interventions that rely on such pipelines would be faster. Time will be saved and environmental impact would also be reduced, as these descriptions contain enough information for advanced optimization without user intervention. Workflows are widely used in data analysis pipelines, enabling innovation and decision-making for the modern society. In many domains the analysis components are numerous and written in multiple different computer languages by third parties. However, lacking a standard for reusable and portable multilingual workflows, then reusing published multilingual workflows, collaborating on open problems, and optimizing their execution would be severely hampered. Moreover, only a standard for multilingual data analysis pipelines that was widely used would enable considerable benefits to research-industry collaboration, regulatory cost control, and to preserving the environment. Prior to the start of the CWL project, there was no standard for describing multilingual analysis pipelines in a portable and reusable manner. Even today / currently, although there exist hundreds of single-vendor and other single-source systems that run workflows, none is a general, community-driven, and consensus-built standard

    At the crossroads of big science, open science, and technology transfer

    Get PDF
    Les grans infraestructures científiques s’enfronten a demandes creixents de responsabilitat pública, no només per la seva contribució al descobriment científic, sinó també per la seva capacitat de generar valor econòmic secundari. Per construir i operar les seves infraestructures sofisticades, sovint generen tecnologies frontereres dissenyant i construint solucions tècniques per a problemes d’enginyeria complexos i sense precedents. En paral·lel, la dècada anterior ha presenciat la ràpida irrupció de canvis tecnològics que han afectat la manera com es fa i es comparteix la ciència, cosa que ha comportat l’emergència del concepte d’Open Science (OS). Els governs avancen ràpidament vers aquest paradigma de OS i demanen a les grans infraestructures científiques que "obrin" els seus processos científics. No obstant, aquestes dues forces s'oposen, ja que la comercialització de tecnologies i resultats científics requereixen normalment d’inversions financeres importants i les empreses només estan disposades a assumir aquest cost si poden protegir la innovació de la imitació o de la competència deslleial. Aquesta tesi doctoral té com a objectiu comprendre com les noves aplicacions de les TIC afecten els resultats de la recerca i la transferència de tecnologia resultant en el context de les grans infraestructures científiques. La tesis pretén descobrir les tensions entre aquests dos vectors normatius, així com identificar els mecanismes que s’utilitzen per superar-les. La tesis es compon de quatre estudis: 1) Un estudi que aplica un mètode de recerca mixt que combina dades de dues enquestes d’escala global realitzades online (2016, 2018), amb dos cas d’estudi de dues comunitats científiques en física d’alta energia i biologia molecular que avaluen els factors explicatius darrere les pràctiques de compartir dades per part dels científics; 2) Un estudi de cas d’Open Targets, una infraestructura d’informació basada en dades considerades bens comuns, on el Laboratori Europeu de Biologia Molecular-EBI i empreses farmacèutiques col·laboren i comparteixen dades científiques i eines tecnològiques per accelerar el descobriment de medicaments; 3) Un estudi d’un conjunt de dades únic de 170 projectes finançats en el marc d’ATTRACT (un nou instrument de la Comissió Europea liderat per les grans infraestructures científiques europees) que té com a objectiu comprendre la naturalesa del procés de serendipitat que hi ha darrere de la transició de tecnologies de grans infraestructures científiques a aplicacions comercials abans no anticipades. ; i 4) un cas d’estudi sobre la tecnologia White Rabbit, un hardware sofisticat de codi obert desenvolupat al Consell Europeu per a la Recerca Nuclear (CERN) en col·laboració amb un extens ecosistema d’empreses.Las grandes infraestructuras científicas se enfrentan a crecientes demandas de responsabilidad pública, no solo por su contribución al descubrimiento científico sino también por su capacidad de generar valor económico para la sociedad. Para construir y operar sus sofisticadas infraestructuras, a menudo generan tecnologías de vanguardia al diseñar y construir soluciones técnicas para problemas de ingeniería complejos y sin precedentes. Paralelamente, la década anterior ha visto la irrupción de rápidos cambios tecnológicos que afectan la forma en que se genera y comparte la ciencia, lo que ha llevado a acuñar el concepto de Open Science (OS). Los gobiernos se están moviendo rápidamente hacia este nuevo paradigma y están pidiendo a las grandes infraestructuras científicas que "abran" el proceso científico. Sin embargo, estas dos fuerzas se oponen, ya que la comercialización de tecnología y productos científicos generalmente requiere importantes inversiones financieras y las empresas están dispuestas a asumir este coste solo si pueden proteger la innovación de la imitación o la competencia desleal. Esta tesis doctoral tiene como objetivo comprender cómo las nuevas aplicaciones de las TIC están afectando los resultados científicos y la transferencia de tecnología resultante en el contexto de las grandes infraestructuras científicas. La tesis pretende descubrir las tensiones entre estas dos fuerzas normativas e identificar los mecanismos que se emplean para superarlas. La tesis se compone de cuatro estudios: 1) Un estudio que emplea un método mixto de investigación que combina datos de dos encuestas de escala global realizadas online (2016, 2018), con dos caso de estudio sobre dos comunidades científicas distintas -física de alta energía y biología molecular- que evalúan los factores explicativos detrás de las prácticas de intercambio de datos científicos; 2) Un caso de estudio sobre Open Targets, una infraestructura de información basada en datos considerados como bienes comunes, donde el Laboratorio Europeo de Biología Molecular-EBI y compañías farmacéuticas colaboran y comparten datos científicos y herramientas tecnológicas para acelerar el descubrimiento de fármacos; 3) Un estudio de un conjunto de datos único de 170 proyectos financiados bajo ATTRACT, un nuevo instrumento de la Comisión Europea liderado por grandes infraestructuras científicas europeas, que tiene como objetivo comprender la naturaleza del proceso fortuito detrás de la transición de las tecnologías de grandes infraestructuras científicas a aplicaciones comerciales previamente no anticipadas ; y 4) un estudio de caso de la tecnología White Rabbit, un sofisticado hardware de código abierto desarrollado en el Consejo Europeo de Investigación Nuclear (CERN) en colaboración con un extenso ecosistema de empresas.Big science infrastructures are confronting increasing demands for public accountability, not only within scientific discovery but also their capacity to generate secondary economic value. To build and operate their sophisticated infrastructures, big science often generates frontier technologies by designing and building technical solutions to complex and unprecedented engineering problems. In parallel, the previous decade has seen the disruption of rapid technological changes impacting the way science is done and shared, which has led to the coining of the concept of Open Science (OS). Governments are quickly moving towards the OS paradigm and asking big science centres to "open up” the scientific process. Yet these two forces run in opposition as the commercialization of scientific outputs usually requires significant financial investments and companies are willing to bear this cost only if they can protect the innovation from imitation or unfair competition. This PhD dissertation aims at understanding how new applications of ICT are affecting primary research outcomes and the resultant technology transfer in the context of big and OS. It attempts to uncover the tensions in these two normative forces and identify the mechanisms that are employed to overcome them. The dissertation is comprised of four separate studies: 1) A mixed-method study combining two large-scale global online surveys to research scientists (2016, 2018), with two case studies in high energy physics and molecular biology scientific communities that assess explanatory factors behind scientific data-sharing practices; 2) A case study of Open Targets, an information infrastructure based upon data commons, where European Molecular Biology Laboratory-EBI and pharmaceutical companies collaborate and share scientific data and technological tools to accelerate drug discovery; 3) A study of a unique dataset of 170 projects funded under ATTRACT -a novel policy instrument of the European Commission lead by European big science infrastructures- which aims to understand the nature of the serendipitous process behind transitioning big science technologies to previously unanticipated commercial applications; and 4) a case study of White Rabbit technology, a sophisticated open-source hardware developed at the European Council for Nuclear Research (CERN) in collaboration with an extensive ecosystem of companies

    Plataformes avançades en el Núvol per a la reproductibilitat d'experiments computacionals

    Full text link
    Tesis por compendio[ES] La tesis presentada se enmarca dentro del ámbito de la ciencia computacional. Dentro de esta, se centra en el desarrollo de herramientas para la ejecución de experimentación científica computacional, el impacto de la cual es cada vez mayor en todos los ámbitos de la ciencia y la ingeniería. Debido a la creciente complejidad de los cálculos realizados, cada vez es necesario un mayor conocimiento de las técnicas y herramientas disponibles para llevar a cabo este tipo de experimentos, ya que pueden requerir, en general, una gran infraestructura computacional para afrontar los altos costes de cómputo. Más aún, la reciente popularización del cómputo en la Nube ofrece una gran variedad de posibilidades para configurar nuestras propias infraestructuras con requisitos específicos. No obstante, el precio a pagar es la complejidad de configurar dichas infraestructuras en este tipo de entornos. Además, el aumento en la complejidad de configuración de los entornos en la nube no hace más que agravar un problema ya existente en el ámbito científico, y es el de la reproducibilidad de los resultados publicados. La falta de documentación, como las versiones de software que se han usado para llevar a cabo el cómputo, o los datos requeridos, provocan que una parte significativa de los resultados de experimentos computacionales publicados no sean reproducibles por otros investigadores. Como consecuencia, se produce un derroche de recursos destinados a la investigación. Como respuesta a esta situación, existen, y continúan desarrollándose, diferentes herramientas para facilitar procesos como el despliegue y configuración de infraestructura, el acceso a los datos, el diseño de flujos de cómputo, etc. con el objetivo de que los investigadores puedan centrarse en el problema a abordar. Precisamente, esta es la base de los trabajos desarrollados en la presente tesis, el desarrollo de herramientas para facilitar que el cómputo científico se beneficie de entornos de computación en la Nube de forma eficiente. El primer trabajo presentado empieza con un estudio exhaustivo de las prestaciones d'un servicio relativamente nuevo, la ejecución serverless de funciones. En este, se determinará la conveniencia de usar este tipo de entornos en el cálculo científico midiendo tanto sus prestaciones de forma aislada, como velocidad de CPU y comunicaciones, como en conjunto mediante el desarrollo de una aplicación de procesamiento MapReduce para entornos serverless. En el siguiente trabajo, se abordará una problemática diferente, y es la reproducibilidad de experimentos computacionales. Para conseguirlo, se presentará un entorno, basado en Jupyter, donde se encapsule tanto el proceso de despliegue y configuración de infraestructura computacional como el acceso a datos y la documentación de la experimentación. Toda esta información quedará registrada en el notebook de Jupyter donde se ejecuta el experimento, permitiendo así a otros investigadores reproducir los resultados simplemente compartiendo el notebook correspondiente. Volviendo al estudio de las prestaciones del primer trabajo, teniendo en cuenta las medidas y bien estudiadas fluctuaciones de éstas en entornos compartidos, como el cómputo en la Nube, en el tercer trabajo se desarrollará un sistema de balanceo de carga diseñado expresamente para este tipo de entornos. Como se mostrará, este componente es capaz de gestionar y corregir de forma precisa fluctuaciones impredecibles en las prestaciones del cómputo en entornos compartidos. Finalmente, y aprovechando el desarrollo anterior, se diseñará una plataforma completamente serverless encargada de repartir y balancear tareas ejecutadas en múltiples infraestructuras independientes. La motivación de este último trabajo viene dada por los altos costes computacionales de ciertos experimentos, los cuales fuerzan a los investigadores a usar múltiples infraestructuras que, en general, pertenecen a diferentes organizaciones.[CA] La tesi presentada a aquest document s'emmarca dins de l'àmbit de la ciència computacional. Dintre d'aquesta, es centra en el desenvolupament d'eines per a l'execució d'experimentació científica computacional, la qual té un impacte cada vegada major en tots els àmbits de la ciència i l'enginyeria. Donada la creixent complexitat dels càlculs realitzats, cada vegada és necessari un major coneixement sobre les tècniques i eines disponibles per a dur a terme aquestes experimentacions, ja que poden requerir, en general, una gran infraestructura computacional per afrontar els alts costos de còmput. Més encara, la recent popularització del còmput en el Núvol ofereix una gran varietat de possibilitats per a configurar les nostres pròpies infraestructures amb requisits específiques. No obstant, el preu a pagar és la complexitat de configurar les esmenades infraestructures a aquest tipus d'entorns. A més, l'augment de la complexitat de configuració dels entorns de còmput no ha fet més que agreujar un problema ja existent a l'àmbit científic, i és la reproductibilitat de resultats publicats. La manca de documentació, com les versions del programari emprat per a dur a terme el còmput, o les dades requerides ocasionen que una part no negligible dels resultats d'experiments computacionals publicats no siguen reproduïbles per altres investigadors. Com a conseqüència, es produeix un malbaratament dels recursos destinats a la investigació. Com a resposta a aquesta situació, existeixen, i continuen desenvolupant-se, diverses eines per facilitar processos com el desplegament i configuració d'infraestructura, l'accés a les dades, el disseny de fluxos de còmput, etc. amb l'objectiu de que els investigadors puguen centrar-se en el problema a abordar. Precisament, aquesta és la base dels treballs desenvolupats durant la tesi que segueix, el desenvolupar eines per a facilitar que el còmput científic es beneficiar-se d'entorns de computació en el Núvol d'una forma eficient. El primer treball presentat comença amb un estudi exhaustiu de les prestacions d'un servei relativament nou, l'execució serverless de funcions. En aquest, es determinarà la conveniència d'emprar este tipus d'entorns en el càlcul científic mesurant tant les seues prestacions de forma aïllada, com velocitat de CPU i la velocitat de les comunicacions, com en conjunt a través del desenvolupament d'una aplicació de processament MapReduce per a entorns serverless. Al següent treball, s'abordarà una problemàtica diferent, i és la reproductibilitat dels experiments computacionals. Per a aconseguir-ho, es presentarà una entorn, basat en Jupyter, on s'englobe tant el desplegament i configuració d'infraestructura computacional, com l'accés a les dades requerides i la documentació de l'experimentació. Tota aquesta informació quedarà registrada al notebook de Jupyter on s'executa l'experiment, permetent així a altres investigadors reproduir els resultats simplement compartint el notebook corresponent. Tornant a l'estudi de les prestacions del primer treball, donades les mesurades i ben estudiades fluctuacions d'aquestes en entorns compartits, com en el còmput en el Núvol, al tercer treball es desenvoluparà un sistema de balanceig de càrrega dissenyat expressament per aquest tipus d'entorns. Com es veurà, aquest component és capaç de gestionar i corregir de forma precisa fluctuacions impredictibles en les prestacions de còmput d'entorns compartits. Finalment, i aprofitant el desenvolupament anterior, es dissenyarà una plataforma completament serverless per a repartir i balancejar tasques executades en múltiples infraestructures de còmput independents. La motivació d'aquest últim treball ve donada pels alts costos computacionals de certes experimentacions, els quals forcen als investigadors a emprar múltiples infraestructures que, en general, pertanyen a diferents organitzacions. Es demostrarà la capacitat de la plataforma per balancejar treballs i minimitzar el malbaratament de recursos[EN] This document is focused on computational science, specifically in the development of tools for executions of scientific computational experiments, whose impact has increased, and still increasing, in all scientific and engineering scopes. Considering the growing complexity of scientific calculus, it is required large and complex computational infrastructures to carry on the experimentation. However, to use this infrastructures, it is required a deep knowledge of the available tools and techniques to be handled efficiently. Moreover, the popularity of Cloud computing environments offers a wide variety of possible configurations for our computational infrastructures, thus complicating the configuration process. Furthermore, this increase in complexity has exacerbated the well known problem of reproducibility in science. The lack of documentation, as the used software versions, or the data required by the experiment, produces non reproducible results in computational experiments. This situation produce a non negligible waste of the resources invested in research. As consequence, several tools have been developed to facilitate the deployment, usage and configuration of complex infrastructures, provide access to data, etc. with the objective to simplify the common steps of computational experiments to researchers. Moreover, the works presented in this document share the same objective, i.e. develop tools to provide an easy, efficient and reproducible usage of cloud computing environments for scientific experimentation. The first presented work begins with an exhaustive study of the suitability of the AWS serverless environment for scientific calculus. In this one, the suitability of this kind of environments for scientific research will be studied. With this aim, the study will measure the CPU and network performance, both isolated and combined, via a MapReduce framework developed completely using serverless services. The second one is focused on the reproducibility problem in computational experiments. To improve reproducibility, the work presents an environment, based on Jupyter, which handles and simplify the deployment, configuration and usage of complex computational infrastructures. Also, includes a straight forward procedure to provide access to data and documentation of the experimentation via the Jupyter notebooks. Therefore, the whole experiment could be reproduced sharing the corresponding notebook. In the third work, a load balance library has been developed to address fluctuations of shared infrastructure capabilities. This effect has been wide studied in the literature and affects specially to cloud computing environments. The developed load balance system, as we will see, can handle and correct accurately unpredictable fluctuations in such environments. Finally, based on the previous work, a completely serverless platform is presented to split and balance job executions among several shared, heterogeneous and independent computing infrastructures. The motivation of this last work is the huge computational cost of many experiments, which forces the researchers to use multiple infrastructures belonging, in general, to different organisations. It will be shown how the developed platform is capable to balance the workload accurately. Moreover, it can fit execution time constrains specified by the user. In addition, the platform assists the computational infrastructures to scale as a function of the incoming workload, avoiding an over-provisioning or under-provisioning. Therefore, the platform provides an efficient usage of the available resources.This study was supported by the program “Ayudas para la contratación de personal investigador en formación de carácter predoctoral, programa VALi+d” under grant number ACIF/2018/148 from the Conselleria d’Educació of the Generalitat Valenciana. The authors would also like to thank the Spanish "Ministerio de Economía, Industria y Competitividad"for the project “BigCLOE” with reference number TIN2016-79951-R.Giménez Alventosa, V. (2022). Plataformes avançades en el Núvol per a la reproductibilitat d'experiments computacionals [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/184010TESISCompendi

    Support for HTCondor high-throughput computing workflows in the REANA reusable analysis platform

    No full text
    REANA is a reusable and reproducible data analysis platform allowing researchers to structure their analysis pipelines and run them on remote containerised compute clouds. REANA supports several different workflows systems (CWL, Serial, Yadage) and uses Kubernetes’ job execution backend. We have designed an abstract job execution component that extends the REANA platform job execution capabilities to support multiple compute backends. We have tested the abstract job execution component with HTCondor and verified the scalability of the designed solution. The results show that the REANA platform would be able to support hybrid scientific workflows where different parts of the analysis pipelines can be executed on multiple computing backends
    corecore