27,528 research outputs found

    Building Near-Real-Time Processing Pipelines with the Spark-MPI Platform

    Full text link
    Advances in detectors and computational technologies provide new opportunities for applied research and the fundamental sciences. Concurrently, dramatic increases in the three Vs (Volume, Velocity, and Variety) of experimental data and the scale of computational tasks produced the demand for new real-time processing systems at experimental facilities. Recently, this demand was addressed by the Spark-MPI approach connecting the Spark data-intensive platform with the MPI high-performance framework. In contrast with existing data management and analytics systems, Spark introduced a new middleware based on resilient distributed datasets (RDDs), which decoupled various data sources from high-level processing algorithms. The RDD middleware significantly advanced the scope of data-intensive applications, spreading from SQL queries to machine learning to graph processing. Spark-MPI further extended the Spark ecosystem with the MPI applications using the Process Management Interface. The paper explores this integrated platform within the context of online ptychographic and tomographic reconstruction pipelines.Comment: New York Scientific Data Summit, August 6-9, 201

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    Architecture of Environmental Risk Modelling: for a faster and more robust response to natural disasters

    Full text link
    Demands on the disaster response capacity of the European Union are likely to increase, as the impacts of disasters continue to grow both in size and frequency. This has resulted in intensive research on issues concerning spatially-explicit information and modelling and their multiple sources of uncertainty. Geospatial support is one of the forms of assistance frequently required by emergency response centres along with hazard forecast and event management assessment. Robust modelling of natural hazards requires dynamic simulations under an array of multiple inputs from different sources. Uncertainty is associated with meteorological forecast and calibration of the model parameters. Software uncertainty also derives from the data transformation models (D-TM) needed for predicting hazard behaviour and its consequences. On the other hand, social contributions have recently been recognized as valuable in raw-data collection and mapping efforts traditionally dominated by professional organizations. Here an architecture overview is proposed for adaptive and robust modelling of natural hazards, following the Semantic Array Programming paradigm to also include the distributed array of social contributors called Citizen Sensor in a semantically-enhanced strategy for D-TM modelling. The modelling architecture proposes a multicriteria approach for assessing the array of potential impacts with qualitative rapid assessment methods based on a Partial Open Loop Feedback Control (POLFC) schema and complementing more traditional and accurate a-posteriori assessment. We discuss the computational aspect of environmental risk modelling using array-based parallel paradigms on High Performance Computing (HPC) platforms, in order for the implications of urgency to be introduced into the systems (Urgent-HPC).Comment: 12 pages, 1 figure, 1 text box, presented at the 3rd Conference of Computational Interdisciplinary Sciences (CCIS 2014), Asuncion, Paragua

    Standardization Framework for Sustainability from Circular Economy 4.0

    Get PDF
    The circular economy (CE) is widely known as a way to implement and achieve sustainability, mainly due to its contribution towards the separation of biological and technical nutrients under cyclic industrial metabolism. The incorporation of the principles of the CE in the links of the value chain of the various sectors of the economy strives to ensure circularity, safety, and efficiency. The framework proposed is aligned with the goals of the 2030 Agenda for Sustainable Development regarding the orientation towards the mitigation and regeneration of the metabolic rift by considering a double perspective. Firstly, it strives to conceptualize the CE as a paradigm of sustainability. Its principles are established, and its techniques and tools are organized into two frameworks oriented towards causes (cradle to cradle) and effects (life cycle assessment), and these are structured under the three pillars of sustainability, for their projection within the proposed framework. Secondly, a framework is established to facilitate the implementation of the CE with the use of standards, which constitute the requirements, tools, and indicators to control each life cycle phase, and of key enabling technologies (KETs) that add circular value 4.0 to the socio-ecological transition

    An evaluation of galaxy and ruffus-scripting workflows system for DNA-seq analysis

    Get PDF
    >Magister Scientiae - MScFunctional genomics determines the biological functions of genes on a global scale by using large volumes of data obtained through techniques including next-generation sequencing (NGS). The application of NGS in biomedical research is gaining in momentum, and with its adoption becoming more widespread, there is an increasing need for access to customizable computational workflows that can simplify, and offer access to, computer intensive analyses of genomic data. In this study, the Galaxy and Ruffus frameworks were designed and implemented with a view to address the challenges faced in biomedical research. Galaxy, a graphical web-based framework, allows researchers to build a graphical NGS data analysis pipeline for accessible, reproducible, and collaborative data-sharing. Ruffus, a UNIX command-line framework used by bioinformaticians as Python library to write scripts in object-oriented style, allows for building a workflow in terms of task dependencies and execution logic. In this study, a dual data analysis technique was explored which focuses on a comparative evaluation of Galaxy and Ruffus frameworks that are used in composing analysis pipelines. To this end, we developed an analysis pipeline in Galaxy, and Ruffus, for the analysis of Mycobacterium tuberculosis sequence data. Furthermore, this study aimed to compare the Galaxy framework to Ruffus with preliminary analysis revealing that the analysis pipeline in Galaxy displayed a higher percentage of load and store instructions. In comparison, pipelines in Ruffus tended to be CPU bound and memory intensive. The CPU usage, memory utilization, and runtime execution are graphically represented in this study. Our evaluation suggests that workflow frameworks have distinctly different features from ease of use, flexibility, and portability, to architectural designs

    Heterogeneity in pure microbial systems: experimental measurements and modeling

    Get PDF
    Cellular heterogeneity influences bioprocess performance in ways that until date are not completely elucidated. In order to account for this phenomenon in the design and operation of bioprocesses, reliable analytical and mathematical descriptions are required. We present an overview of the single cell analysis, and the mathematical modeling frameworks that have potential to be used in bioprocess control and optimization, in particular for microbial processes. In order to be suitable for bioprocess monitoring, experimental methods need to be high throughput and to require relatively short processing time. One such method used successfully under dynamic conditions is flow cytometry. Population balance and individual based models are suitable modeling options, the latter one having in particular a good potential to integrate the various data collected through experimentation. This will be highly beneficial for appropriate process design and scale up as a more rigorous approach may prevent a priori unwanted performance losses. It will also help progressing synthetic biology applications to industrial scale
    corecore