467 research outputs found
bnstruct: an R package for Bayesian Network structure learning in the presence of missing data.
Abstract
Motivation
A Bayesian Network is a probabilistic graphical model that encodes probabilistic dependencies between a set of random variables. We introduce bnstruct, an open source R package to (i) learn the structure and the parameters of a Bayesian Network from data in the presence of missing values and (ii) perform reasoning and inference on the learned Bayesian Networks. To the best of our knowledge, there is no other open source software that provides methods for all of these tasks, particularly the manipulation of missing data, which is a common situation in practice.
Availability and Implementation
The software is implemented in R and C and is available on CRAN under a GPL licence.
Supplementary information
Supplementary data are available at Bioinformatics online
A rule-based model of insulin signalling pathway
BACKGROUND: The insulin signalling pathway (ISP) is an important biochemical pathway, which regulates some fundamental biological functions such as glucose and lipid metabolism, protein synthesis, cell proliferation, cell differentiation and apoptosis. In the last years, different mathematical models based on ordinary differential equations have been proposed in the literature to describe specific features of the ISP, thus providing a description of the behaviour of the system and its emerging properties. However, protein-protein interactions potentially generate a multiplicity of distinct chemical species, an issue referred to as âcombinatorial complexityâ, which results in defining a high number of state variables equal to the number of possible protein modifications. This often leads to complex, error prone and difficult to handle model definitions. RESULTS: In this work, we present a comprehensive model of the ISP, which integrates three models previously available in the literature by using the rule-based modelling (RBM) approach. RBM allows for a simple description of a number of signalling pathway characteristics, such as the phosphorylation of signalling proteins at multiple sites with different effects, the simultaneous interaction of many molecules of the signalling pathways with several binding partners, and the information about subcellular localization where reactions take place. Thanks to its modularity, it also allows an easy integration of different pathways. After RBM specification, we simulated the dynamic behaviour of the ISP model and validated it using experimental data. We the examined the predicted profiles of all the active species and clustered them in four clusters according to their dynamic behaviour. Finally, we used parametric sensitivity analysis to show the role of negative feedback loops in controlling the robustness of the system. CONCLUSIONS: The presented ISP model is a powerful tool for data simulation and can be used in combination with experimental approaches to guide the experimental design. The model is available at http://sysbiobig.dei.unipd.it/ was submitted to Biomodels Database (https://www.ebi.ac.uk/biomodels-main/# MODELÂ 1604100005). ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12918-016-0281-4) contains supplementary material, which is available to authorized users
Improving biomarker list stability by integration of biological knowledge in the learning process
BACKGROUND:
The identification of robust lists of molecular biomarkers related to a disease is a fundamental step for early diagnosis and treatment. However, methodologies for biomarker discovery using microarray data often provide results with limited overlap. It has been suggested that one reason for these inconsistencies may be that in complex diseases, such as cancer, multiple genes belonging to one or more physiological pathways are associated with the outcomes. Thus, a possible approach to improve list stability is to integrate biological information from genomic databases in the learning process; however, a comprehensive assessment based on different types of biological information is still lacking in the literature. In this work we have compared the effect of using different biological information in the learning process like functional annotations, protein-protein interactions and expression correlation among genes.
RESULTS:
Biological knowledge has been codified by means of gene similarity matrices and expression data linearly transformed in such a way that the more similar two features are, the more closely they are mapped. Two semantic similarity matrices, based on Biological Process and Molecular Function Gene Ontology annotation, and geodesic distance applied on protein-protein interaction networks, are the best performers in improving list stability maintaining almost equal prediction accuracy.
CONCLUSIONS:
The performed analysis supports the idea that when some features are strongly correlated to each other, for example because are close in the protein-protein interaction network, then they might have similar importance and are equally relevant for the task at hand. Obtained results can be a starting point for additional experiments on combining similarity matrices in order to obtain even more stable lists of biomarkers. The implementation of the classification algorithm is available at the link: http://www.math.unipd.it/~dasan/biomarkers.html
Significance analysis of microarray transcript levels in time series experiments
Background:
Microarray time series studies are essential to understand the dynamics of molecular events. In order to limit the analysis to those genes that change expression over time, a first necessary step is to select differentially expressed transcripts. A variety of methods have been proposed to this purpose; however, these methods are seldom applicable in practice since they require a large number of replicates, often available only for a limited number of samples. In this data-poor context, we evaluate the performance of three selection methods, using synthetic data, over a range of experimental conditions. Application to real data is also discussed.
Results:
Three methods are considered, to assess differentially expressed genes in data-poor conditions. Method 1 uses a threshold on individual samples based on a model of the experimental error. Method 2 calculates the area of the region bounded by the time series expression profiles, and considers the gene differentially expressed if the area exceeds a threshold based on a model of the experimental error. These two methods are compared to Method 3, recently proposed in the literature, which exploits splines fit to compare time series profiles. Application of the three methods to synthetic data indicates that Method 2 outperforms the other two both in Precision and Recall when short time series are analyzed, while Method 3 outperforms the other two for long time series.
Conclusion:
These results help to address the choice of the algorithm to be used in data-poor time series expression study, depending on the length of the time series
A Boolean Approach to Linear Prediction for Signaling Network Modeling
The task of the DREAM4 (Dialogue for Reverse Engineering Assessments and Methods) âPredictive signaling network modelingâ challenge was to develop a method that, from single-stimulus/inhibitor data, reconstructs a cause-effect network to be used to predict the protein activity level in multi-stimulus/inhibitor experimental conditions. The method presented in this paper, one of the best performing in this challenge, consists of 3 steps: 1. Boolean tables are inferred from single-stimulus/inhibitor data to classify whether a particular combination of stimulus and inhibitor is affecting the protein. 2. A cause-effect network is reconstructed starting from these tables. 3. Training data are linearly combined according to rules inferred from the reconstructed network. This method, although simple, permits one to achieve a good performance providing reasonable predictions based on a reconstructed network compatible with knowledge from the literature. It can be potentially used to predict how signaling pathways are affected by different ligands and how this response is altered by diseases
A quantization method based on threshold optimization for microarray short time series
BACKGROUND: Reconstructing regulatory networks from gene expression profiles is a challenging problem of functional genomics. In microarray studies the number of samples is often very limited compared to the number of genes, thus the use of discrete data may help reducing the probability of finding random associations between genes. RESULTS: A quantization method, based on a model of the experimental error and on a significance level able to compromise between false positive and false negative classifications, is presented, which can be used as a preliminary step in discrete reverse engineering methods. The method is tested on continuous synthetic data with two discrete reverse engineering methods: Reveal and Dynamic Bayesian Networks. CONCLUSION: The quantization method, evaluated in comparison with two standard methods, 5% threshold based on experimental error and rank sorting, improves the ability of Reveal and Dynamic Bayesian Networks to identify relations among genes
Some like it hot: Thermal preference of the groundwater amphipod Niphargus longicaudatus (Costa, 1851) and climate change implications
Groundwater is a crucial resource for humans and the environment, but its global human demand currently exceeds available volumes by 3.5 times. Climate change is expected to exacerbate this situation by increasing the frequency of droughts along with human impacts on groundwater ecosystems. Despite prior research on the quantitative effects of climate change on groundwater, the direct impacts on groundwater biodiversity, especially obligate groundwater species, remain largely unexplored. Therefore, investigating the potential impacts of climate change, including groundwater temperature changes, is crucial for the survival of obligate groundwater species. This study aimed to determine the thermal niche breadth of the crustacean amphipod species Niphargus longicaudatus by using the chronic method. We found that N. longicaudatus has a wide thermal niche with a natural performance range of 7â9 °C, which corresponds to the thermal regime this species experiences within its distribution range in Italy. The observed range of preferred temperature (PT) was different from the mean annual temperature of the sites from which the species has been collected, challenging the idea that groundwater species are only adapted to narrow temperature ranges. Considering the significant threats of climate change to groundwater ecosystems, these findings provide crucial information for the conservation of obligate groundwater species, suggesting that some of them may be more resilient to temperature changes than previously thought. Understanding the fundamental thermal niche of these species can inform conservation efforts and management strategies to protect groundwater ecosystems and their communities.info:eu-repo/semantics/publishedVersio
An Optimized Data Structure for High Throughput 3D Proteomics Data: mzRTree
As an emerging field, MS-based proteomics still requires software tools for
efficiently storing and accessing experimental data. In this work, we focus on
the management of LC-MS data, which are typically made available in standard
XML-based portable formats. The structures that are currently employed to
manage these data can be highly inefficient, especially when dealing with
high-throughput profile data. LC-MS datasets are usually accessed through 2D
range queries. Optimizing this type of operation could dramatically reduce the
complexity of data analysis. We propose a novel data structure for LC-MS
datasets, called mzRTree, which embodies a scalable index based on the R-tree
data structure. mzRTree can be efficiently created from the XML-based data
formats and it is suitable for handling very large datasets. We experimentally
show that, on all range queries, mzRTree outperforms other known structures
used for LC-MS data, even on those queries these structures are optimized for.
Besides, mzRTree is also more space efficient. As a result, mzRTree reduces
data analysis computational costs for very large profile datasets.Comment: Paper details: 10 pages, 7 figures, 2 tables. To be published in
Journal of Proteomics. Source code available at
http://www.dei.unipd.it/mzrtre
Deep learning methods to predict amyotrophic lateral sclerosis disease progression
Amyotrophic lateral sclerosis (ALS) is a highly complex and heterogeneous neurodegenerative disease that affects motor neurons. Since life expectancy is relatively low, it is essential to promptly understand the course of the disease to better target the patientâs treatment. Predictive models for disease progression are thus of great interest. One of the most extensive and well-studied open-access data resources for ALS is the Pooled Resource Open-Access ALS Clinical Trials (PRO-ACT) repository. In 2015, the DREAM-Phil Bowen ALS Prediction Prize4Life Challenge was held on PRO-ACT data, where competitors were asked to develop machine learning algorithms to predict disease progression measured through the slope of the ALSFRS score between 3 and 12 months. However, although it has already been successfully applied in several studies on ALS patients, to the best of our knowledge deep learning approaches still remain unexplored on the ALSFRS slope prediction in PRO-ACT cohort. Here, we investigate how deep learning models perform in predicting ALS progression using the PRO-ACT data. We developed three models based on different architectures that showed comparable or better performance with respect to the state-of-the-art models, thus representing a valid alternative to predict ALS disease progression
- âŠ