39 research outputs found

    Building Near-Real-Time Processing Pipelines with the Spark-MPI Platform

    Full text link
    Advances in detectors and computational technologies provide new opportunities for applied research and the fundamental sciences. Concurrently, dramatic increases in the three Vs (Volume, Velocity, and Variety) of experimental data and the scale of computational tasks produced the demand for new real-time processing systems at experimental facilities. Recently, this demand was addressed by the Spark-MPI approach connecting the Spark data-intensive platform with the MPI high-performance framework. In contrast with existing data management and analytics systems, Spark introduced a new middleware based on resilient distributed datasets (RDDs), which decoupled various data sources from high-level processing algorithms. The RDD middleware significantly advanced the scope of data-intensive applications, spreading from SQL queries to machine learning to graph processing. Spark-MPI further extended the Spark ecosystem with the MPI applications using the Process Management Interface. The paper explores this integrated platform within the context of online ptychographic and tomographic reconstruction pipelines.Comment: New York Scientific Data Summit, August 6-9, 201

    Experimental Data Curation at Large Instrument Facilities with Open Source Software

    Get PDF
    The National Synchrotron Light Source II operating at Brookhaven National Laboratory since 2014 for the US Department of Energy is one of the newest and brightest storage-ring synchrotron facility in the world.  NSLS-II, like other facilities, provides pre-processing of the raw data and some analysis capabilities to its users. We describe the research collaborations and open source infrastructure  developed at large instrument facilities such as NSLS-II for the purpose of curating high value scientific data along the early stages of the data lifecycle.  Data acquisition and curation tasks include storing experiment configuration, detector metadata, raw data acquisition with infrastructure that converts proprietary instrument formats to industry standards.  In addition, we describe a specific effort for discovering sample information at NSLS-II and tracing the provenance of analysis performed on acquired images.  We show that curation tasks must be embedded into software along the data life cycle for effectiveness and ease of use, and that loosely defined collaborations evolve around shared open source tools.  Finally we discuss best practices for experimental metadata capture in such facilities, data access and the new challenges of scale and complexity posed by AI-based discovery for the synthesis of new materials

    Experimental Data Curation at Large Instrument Facilities with Open Source Software

    Get PDF
    The National Synchrotron Light Source II operating at Brookhaven National Laboratory since 2014 for the US Department of Energy is one of the newest and brightest storage-ring synchrotron facility in the world.  NSLS-II, like other facilities, provides pre-processing of the raw data and some analysis capabilities to its users. We describe the research collaborations and open source infrastructure  developed at large instrument facilities such as NSLS-II for the purpose of curating high value scientific data along the early stages of the data lifecycle.  Data acquisition and curation tasks include storing experiment configuration, detector metadata, raw data acquisition with infrastructure that converts proprietary instrument formats to industry standards.  In addition, we describe a specific effort for discovering sample information at NSLS-II and tracing the provenance of analysis performed on acquired images.  We show that curation tasks must be embedded into software along the data life cycle for effectiveness and ease of use, and that loosely defined collaborations evolve around shared open source tools.  Finally we discuss best practices for experimental metadata capture in such facilities, data access and the new challenges of scale and complexity posed by AI-based discovery for the synthesis of new materials

    Status Report of the DPHEP Study Group: Towards a Global Effort for Sustainable Data Preservation in High Energy Physics

    Full text link
    Data from high-energy physics (HEP) experiments are collected with significant financial and human effort and are mostly unique. An inter-experimental study group on HEP data preservation and long-term analysis was convened as a panel of the International Committee for Future Accelerators (ICFA). The group was formed by large collider-based experiments and investigated the technical and organisational aspects of HEP data preservation. An intermediate report was released in November 2009 addressing the general issues of data preservation in HEP. This paper includes and extends the intermediate report. It provides an analysis of the research case for data preservation and a detailed description of the various projects at experiment, laboratory and international levels. In addition, the paper provides a concrete proposal for an international organisation in charge of the data management and policies in high-energy physics

    Data-intensive science

    Full text link
    Data-intensive science has the potential to transform scientific research and quickly translate scientific progress into complete solutions, policies, and economic success. But this collaborative science is still lacking the effective access and exchange of knowledge among scientists, researchers, and policy makers across a range of disciplines. Bringing together leaders from multiple scientific disciplines, Data-Intensive Science shows how a comprehensive integration of various techniques and technological advances can effectively harness the vast amount of data being generated and significa

    Enabling Technologies for Improved Data Management: Hardware

    Full text link
    The most valuable assets in every scientific community are the expert work force and the research results/data produced. The last decade has seen new experimental and computational techniques developing at an ever-faster pace, encouraging the production of ever-larger quantities of data in ever-shorter time spans. Concurrently the traditional scientific working environment has changed beyond recognition. Today scientists can use a wide spectrum of experimental, computational and analytical facilities, often widely distributed over the UK and Europe. In this environment new challenges are posed for the Management of Data every day, but are we ready to tackle them? Do we know exactly what the challenges are? Is the right technology available and is it applied where necessary? This part of enabling technologies investigates current hardware techniques and their functionalities and provides a comparison between various products
    corecore