70 research outputs found
A Study in Hadoop Streaming with Matlab for NMR Data Processing
Applying Cloud computing techniques for analyzing large data sets has shown promise in many data-driven scientific applications. Our approach presented here is to use Cloud computing for Nuclear Magnetic Resonance (NMR)data analysis which normally consists of large amounts of data. Biologists often use third party or commercial software for ease of use. Enabling the capability to use this kind of software in a Cloud will be highly advantageous in many ways. Scripting languages especially designed for clouds may not have the flexibility biologists need for their purposes. Although this is true, they are familiar with special software packages that allow them to write complex calculations with minimum effort, but are often not compatible with a Cloud environment. Therefore, biologists who are trying to perform analysis on NMR data, acquire many advantages due to our proposed solution. Our solution gives them the flexibility to Cloud-enable their familiar software and it also enables them to perform calculations on a significant amount of data that was not previously possible. Our study is also applicable to any other environment in need of similar flexibility. We are currently in the initial stage of developing a framework for NMR data analysis
Scientific Workflows for Metabolic Flux Analysis
Metabolic engineering is a highly interdisciplinary research domain that interfaces biology, mathematics, computer science, and engineering. Metabolic flux analysis with carbon tracer experiments (13 C-MFA) is a particularly challenging metabolic engineering application that consists of several tightly interwoven building blocks such as modeling, simulation, and experimental design. While several general-purpose workflow solutions have emerged in recent years to support the realization of complex scientific applications, the transferability of these approaches are only partially applicable to 13C-MFA workflows. While problems in other research fields (e.g., bioinformatics) are primarily centered around scientific data processing, 13C-MFA workflows have more in common with business workflows. For instance, many bioinformatics workflows are designed to identify, compare, and annotate genomic sequences by "pipelining" them through standard tools like BLAST. Typically, the next workflow task in the pipeline can be automatically determined by the outcome of the previous step. Five computational challenges have been identified in the endeavor of conducting 13 C-MFA studies: organization of heterogeneous data, standardization of processes and the unification of tools and data, interactive workflow steering, distributed computing, and service orientation. The outcome of this thesis is a scientific workflow framework (SWF) that is custom-tailored for the specific requirements of 13 C-MFA applications. The proposed approach – namely, designing the SWF as a collection of loosely-coupled modules that are glued together with web services – alleviates the realization of 13C-MFA workflows by offering several features. By design, existing tools are integrated into the SWF using web service interfaces and foreign programming language bindings (e.g., Java or Python). Although the attributes "easy-to-use" and "general-purpose" are rarely associated with distributed computing software, the presented use cases show that the proposed Hadoop MapReduce framework eases the deployment of computationally demanding simulations on cloud and cluster computing resources. An important building block for allowing interactive researcher-driven workflows is the ability to track all data that is needed to understand and reproduce a workflow. The standardization of 13 C-MFA studies using a folder structure template and the corresponding services and web interfaces improves the exchange of information for a group of researchers. Finally, several auxiliary tools are developed in the course of this work to complement the SWF modules, i.e., ranging from simple helper scripts to visualization or data conversion programs. This solution distinguishes itself from other scientific workflow approaches by offering a system of loosely-coupled components that are flexibly arranged to match the typical requirements in the metabolic engineering domain. Being a modern and service-oriented software framework, new applications are easily composed by reusing existing components
High-Performance Modelling and Simulation for Big Data Applications
This open access book was prepared as a Final Publication of the COST Action IC1406 “High-Performance Modelling and Simulation for Big Data Applications (cHiPSet)“ project. Long considered important pillars of the scientific method, Modelling and Simulation have evolved from traditional discrete numerical methods to complex data-intensive continuous analytical optimisations. Resolution, scale, and accuracy have become essential to predict and analyse natural and complex systems in science and engineering. When their level of abstraction raises to have a better discernment of the domain at hand, their representation gets increasingly demanding for computational and data resources. On the other hand, High Performance Computing typically entails the effective use of parallel and distributed processing units coupled with efficient storage, communication and visualisation systems to underpin complex data-intensive applications in distinct scientific and technical domains. It is then arguably required to have a seamless interaction of High Performance Computing with Modelling and Simulation in order to store, compute, analyse, and visualise large data sets in science and engineering. Funded by the European Commission, cHiPSet has provided a dynamic trans-European forum for their members and distinguished guests to openly discuss novel perspectives and topics of interests for these two communities. This cHiPSet compendium presents a set of selected case studies related to healthcare, biological data, computational advertising, multimedia, finance, bioinformatics, and telecommunications
High-Performance Modelling and Simulation for Big Data Applications
This open access book was prepared as a Final Publication of the COST Action IC1406 “High-Performance Modelling and Simulation for Big Data Applications (cHiPSet)“ project. Long considered important pillars of the scientific method, Modelling and Simulation have evolved from traditional discrete numerical methods to complex data-intensive continuous analytical optimisations. Resolution, scale, and accuracy have become essential to predict and analyse natural and complex systems in science and engineering. When their level of abstraction raises to have a better discernment of the domain at hand, their representation gets increasingly demanding for computational and data resources. On the other hand, High Performance Computing typically entails the effective use of parallel and distributed processing units coupled with efficient storage, communication and visualisation systems to underpin complex data-intensive applications in distinct scientific and technical domains. It is then arguably required to have a seamless interaction of High Performance Computing with Modelling and Simulation in order to store, compute, analyse, and visualise large data sets in science and engineering. Funded by the European Commission, cHiPSet has provided a dynamic trans-European forum for their members and distinguished guests to openly discuss novel perspectives and topics of interests for these two communities. This cHiPSet compendium presents a set of selected case studies related to healthcare, biological data, computational advertising, multimedia, finance, bioinformatics, and telecommunications
Providing Information by Resource- Constrained Data Analysis
The Collaborative Research Center SFB 876 (Providing Information by Resource-Constrained Data Analysis) brings together the research fields of data analysis (Data Mining, Knowledge Discovery in Data Bases, Machine Learning, Statistics) and embedded systems and enhances their methods such that information from distributed, dynamic masses of data becomes available anytime and anywhere. The research center approaches these problems with new algorithms respecting the resource constraints in the different scenarios. This Technical Report presents the work of the members of the integrated graduate school
Identification of Data Structure with Machine Learning: From Fisher to Bayesian networks
This thesis proposes a theoretical framework to thoroughly analyse the structure of a dataset in terms of a) metric, b) density and c) feature associations. To look into the first aspect, Fisher's metric learning algorithms are the foundations of a novel manifold based on the information and complexity of a classification model. When looking at the density aspect, the Probabilistic Quantum clustering, a Bayesian version of the original Quantum Clustering is proposed. The clustering results will depend on local density variations, which is a desired feature when dealing with heteroscedastic data. To address the third aspect, the constraint-based PC-algorithm is the starting point of many structure learning algorithms, it is focused on finding feature associations by means of conditional independent tests. This is then used to select Bayesian networks, based on a regularized likelihood score. These three topics of data structure analysis were fully tested with synthetic data examples and real cases, which allowed us to unravel and discuss the advantages and limitations of these algorithms. One of the biggest challenges encountered was related to the application of these methods to a Big Data dataset that was analysed within the framework of a collaboration with a large UK retailer, where the interest was in the identification of the data structure underlying customer shopping baskets
XSEDE: eXtreme Science and Engineering Discovery Environment Third Quarter 2012 Report
The Extreme Science and Engineering Discovery Environment (XSEDE) is the most advanced, powerful, and robust collection of integrated digital resources and services in the world. It is an integrated cyberinfrastructure ecosystem with singular interfaces for allocations, support, and other key services that researchers can use to interactively share computing resources, data, and expertise.This a report of project activities and highlights from the third quarter of 2012.National Science Foundation, OCI-105357
Recommended from our members
Deep learning in mining biological data
Recent technological advancements in data acquisition tools allowed life scientists to acquire multimodal data from different biological application domains. Categorised in three broad types (i.e., images, signals, and sequences), these data are huge in amount and complex in nature. Mining such enormous amount of data for pattern recognition is a big challenge and requires sophisticated data intensive machine learning techniques. Artificial neural network based learning systems are well known for their pattern recognition capabilities and lately their deep architectures - known as deep learning (DL) - have been successfully applied to solve many complex pattern recognition problems. To investigate how DL - especially its different architectures - has contributed and utilised in the mining of biological data pertaining to those three types, a meta analysis has been performed and the resulting resources have been critically analysed. Focusing on the use of DL to analyse patterns in data from diverse biological domains, this work investigates different DL architectures' applications to these data. This is followed by an exploration of available open access data sources pertaining to the three data types along with popular open source DL tools applicable to these data. Also, comparative investigations of these tools from qualitative, quantitative, and benchmarking perspectives are provided. Finally, some open research challenges in using DL to mine biological data are outlined and a number of possible future perspectives are put forward
Data-Driven Rational Drug Design
Vast amount of experimental data in structural biology has been generated, collected and accumulated in the last few decades. This rich dataset is an invaluable mine of knowledge, from which deep insights can be obtained and practical applications can be developed. To achieve that goal, we must be able to manage such Big Data\u27\u27 in science and investigate them expertly. Molecular docking is a field that can prominently make use of the large structural biology dataset. As an important component of rational drug design, molecular docking is used to perform large-scale screening of putative associations between small organic molecules and their pharmacologically relevant protein targets. Given a small molecule (ligand), a molecular docking program simulates its interaction with the target protein, and reports the probable conformation of the protein-ligand complex, and the relative binding affinity compared against other candidate ligands. This dissertation collects my contributions in several aspects of molecular docking. My early contribution focused on developing a novel metric to quantify the structural similarity between two protein-ligand complexes. Benchmarks show that my metric addressed several issues associated with the conventional metric. Furthermore, I extended the functionality of this metric to cross different systems, effectively utilizing the data at the proteome level. After developing the novel metric, I formulated a scoring function that can extract the biological information of the complex, integrate it with the physics components, and finally enhance the performance. Through collaboration, I implemented my model into an ultra-fast, adaptive program, which can take advantage of a range of modern parallel architectures and handle the demanding data processing tasks in large scale molecular docking applications
- …