27,528 research outputs found
Building Near-Real-Time Processing Pipelines with the Spark-MPI Platform
Advances in detectors and computational technologies provide new
opportunities for applied research and the fundamental sciences. Concurrently,
dramatic increases in the three Vs (Volume, Velocity, and Variety) of
experimental data and the scale of computational tasks produced the demand for
new real-time processing systems at experimental facilities. Recently, this
demand was addressed by the Spark-MPI approach connecting the Spark
data-intensive platform with the MPI high-performance framework. In contrast
with existing data management and analytics systems, Spark introduced a new
middleware based on resilient distributed datasets (RDDs), which decoupled
various data sources from high-level processing algorithms. The RDD middleware
significantly advanced the scope of data-intensive applications, spreading from
SQL queries to machine learning to graph processing. Spark-MPI further extended
the Spark ecosystem with the MPI applications using the Process Management
Interface. The paper explores this integrated platform within the context of
online ptychographic and tomographic reconstruction pipelines.Comment: New York Scientific Data Summit, August 6-9, 201
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
Architecture of Environmental Risk Modelling: for a faster and more robust response to natural disasters
Demands on the disaster response capacity of the European Union are likely to
increase, as the impacts of disasters continue to grow both in size and
frequency. This has resulted in intensive research on issues concerning
spatially-explicit information and modelling and their multiple sources of
uncertainty. Geospatial support is one of the forms of assistance frequently
required by emergency response centres along with hazard forecast and event
management assessment. Robust modelling of natural hazards requires dynamic
simulations under an array of multiple inputs from different sources.
Uncertainty is associated with meteorological forecast and calibration of the
model parameters. Software uncertainty also derives from the data
transformation models (D-TM) needed for predicting hazard behaviour and its
consequences. On the other hand, social contributions have recently been
recognized as valuable in raw-data collection and mapping efforts traditionally
dominated by professional organizations. Here an architecture overview is
proposed for adaptive and robust modelling of natural hazards, following the
Semantic Array Programming paradigm to also include the distributed array of
social contributors called Citizen Sensor in a semantically-enhanced strategy
for D-TM modelling. The modelling architecture proposes a multicriteria
approach for assessing the array of potential impacts with qualitative rapid
assessment methods based on a Partial Open Loop Feedback Control (POLFC) schema
and complementing more traditional and accurate a-posteriori assessment. We
discuss the computational aspect of environmental risk modelling using
array-based parallel paradigms on High Performance Computing (HPC) platforms,
in order for the implications of urgency to be introduced into the systems
(Urgent-HPC).Comment: 12 pages, 1 figure, 1 text box, presented at the 3rd Conference of
Computational Interdisciplinary Sciences (CCIS 2014), Asuncion, Paragua
Standardization Framework for Sustainability from Circular Economy 4.0
The circular economy (CE) is widely known as a way to implement and achieve sustainability, mainly due to its contribution towards the separation of biological and technical nutrients under cyclic industrial metabolism. The incorporation of the principles of the CE in the links of the value chain of the various sectors of the economy strives to ensure circularity, safety, and efficiency. The framework proposed is aligned with the goals of the 2030 Agenda for Sustainable Development regarding the orientation towards the mitigation and regeneration of the metabolic rift by considering a double perspective. Firstly, it strives to conceptualize the CE as a paradigm of sustainability. Its principles are established, and its techniques and tools are organized into two frameworks oriented towards causes (cradle to cradle) and effects (life cycle assessment), and these are structured under the three pillars of sustainability, for their projection within the proposed framework. Secondly, a framework is established to facilitate the implementation of the CE with the use of standards, which constitute the requirements, tools, and indicators to control each life cycle phase, and of key enabling technologies (KETs) that add circular value 4.0 to the socio-ecological transition
An evaluation of galaxy and ruffus-scripting workflows system for DNA-seq analysis
>Magister Scientiae - MScFunctional genomics determines the biological functions of genes on a global scale by
using large volumes of data obtained through techniques including next-generation
sequencing (NGS). The application of NGS in biomedical research is gaining in
momentum, and with its adoption becoming more widespread, there is an increasing
need for access to customizable computational workflows that can simplify, and offer
access to, computer intensive analyses of genomic data. In this study, the Galaxy and
Ruffus frameworks were designed and implemented with a view to address the
challenges faced in biomedical research. Galaxy, a graphical web-based framework,
allows researchers to build a graphical NGS data analysis pipeline for accessible,
reproducible, and collaborative data-sharing. Ruffus, a UNIX command-line framework
used by bioinformaticians as Python library to write scripts in object-oriented style,
allows for building a workflow in terms of task dependencies and execution logic. In
this study, a dual data analysis technique was explored which focuses on a comparative
evaluation of Galaxy and Ruffus frameworks that are used in composing analysis
pipelines. To this end, we developed an analysis pipeline in Galaxy, and Ruffus, for the
analysis of Mycobacterium tuberculosis sequence data. Furthermore, this study aimed
to compare the Galaxy framework to Ruffus with preliminary analysis revealing that the
analysis pipeline in Galaxy displayed a higher percentage of load and store instructions.
In comparison, pipelines in Ruffus tended to be CPU bound and memory intensive. The
CPU usage, memory utilization, and runtime execution are graphically represented in
this study. Our evaluation suggests that workflow frameworks have distinctly different
features from ease of use, flexibility, and portability, to architectural designs
Heterogeneity in pure microbial systems: experimental measurements and modeling
Cellular heterogeneity influences bioprocess performance in ways that until date are not completely elucidated. In order to account for this phenomenon in the design and operation of bioprocesses, reliable analytical and mathematical descriptions are required. We present an overview of the single cell analysis, and the mathematical modeling frameworks that have potential to be used in bioprocess control and optimization, in particular for microbial processes. In order to be suitable for bioprocess monitoring, experimental methods need to be high throughput and to require relatively short processing time. One such method used successfully under dynamic conditions is flow cytometry. Population balance and individual based models are suitable modeling options, the latter one having in particular a good potential to integrate the various data collected through experimentation. This will be highly beneficial for appropriate process design and scale up as a more rigorous approach may prevent a priori unwanted performance losses. It will also help progressing synthetic biology applications to industrial scale
- …