120 research outputs found
A Large Scale Analysis of Information-Theoretic Network Complexity Measures Using Chemical Structures
This paper aims to investigate information-theoretic network complexity measures which have already been intensely used in mathematical- and medicinal chemistry including drug design. Numerous such measures have been developed so far but many of them lack a meaningful interpretation, e.g., we want to examine which kind of structural information they detect. Therefore, our main contribution is to shed light on the relatedness between some selected information measures for graphs by performing a large scale analysis using chemical networks. Starting from several sets containing real and synthetic chemical structures represented by graphs, we study the relatedness between a classical (partition-based) complexity measure called the topological information content of a graph and some others inferred by a different paradigm leading to partition-independent measures. Moreover, we evaluate the uniqueness of network complexity measures numerically. Generally, a high uniqueness is an important and desirable property when designing novel topological descriptors having the potential to be applied to large chemical databases
Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms
Background: Data generated using ‘omics’ technologies are characterized by high dimensionality, where the number of features measured per subject vastly exceeds the number of subjects in the study. In this paper, we consider issues relevant in the design of biomedical studies in which the goal is the discovery of a subset of features and an associated algorithm that can predict a binary outcome, such as disease status. We compare the performance of four commonly used classifiers (K-Nearest Neighbors, Prediction Analysis for Microarrays, Random Forests and Support Vector Machines) in high-dimensionality data settings. We evaluate the effects of varying levels of signal-to-noise ratio in the dataset, imbalance in class distribution and choice of metric for quantifying performance of the classifier. To guide study design, we present a summary of the key characteristics of ‘omics’ data profiled in several human or animal model experiments utilizing high-content mass spectrometry and multiplexed immunoassay based techniques.
Results: The analysis of data from seven ‘omics’ studies revealed that the average magnitude of effect size observed in human studies was markedly lower when compared to that in animal studies. The data measured in human studies were characterized by higher biological variation and the presence of outliers. The results from simulation studies indicated that the classifier Prediction Analysis for Microarrays (PAM) had the highest power when the class conditional feature distributions were Gaussian and outcome distributions were balanced. Random Forests was optimal when feature distributions were skewed and when class distributions were unbalanced. We provide a free open-source R statistical software library (MVpower) that implements the simulation strategy proposed in this paper.
Conclusion: No single classifier had optimal performance under all settings. Simulation studies provide useful guidance for the design of biomedical studies involving high-dimensionality data
Structural Measures for Network Biology Using QuACN
Background: Structural measures for networks have been extensively developed, but many of them have not yet demonstrated their sustainably. That means, it remains often unclear whether a particular measure is useful and feasible to solve a particular problem in network biology. Exemplarily, the classification of complex biological networks can be named, for which structural measures are used leading to a minimal classification error. Hence, there is a strong need to provide freely available software packages to calculate and demonstrate the appropriate usage of structural graph measures in network biology. Results: Here, we discuss topological network descriptors that are implemented in the R-package QuACN and demonstrate their behavior and characteristics by applying them to a set of example graphs. Moreover, we show a representative application to illustrate their capabilities for classifying biological networks. In particular, we infer gene regulatory networks from microarray data and classify them by methods provided by QuACN. Note that QuACN is the first freely available software written in R containing a large number of structural graph measures. Conclusion: The R package QuACN is under ongoing development and we add promising groups of topological network descriptors continuously. The package can be used to answer intriguing research questions in network biology, e.g., classifying biological data or identifying meaningful biological features, by analyzing the topology o
Effects of Pooling Samples on the Performance of Classification Algorithms: A Comparative Study
A pooling design can be used as a powerful strategy to compensate for limited amounts of samples or high biological variation. In this paper, we perform a comparative study to model and quantify the effects of virtual pooling on the performance of the widely applied classifiers, support vector machines (SVMs), random forest (RF), k-nearest neighbors (k-NN), penalized logistic regression (PLR), and prediction analysis for microarrays (PAMs). We evaluate a variety of experimental designs using mock omics datasets with varying levels of pool sizes and considering effects from feature selection. Our results show that feature selection significantly improves classifier performance for non-pooled and pooled data. All investigated classifiers yield lower misclassification rates with smaller pool sizes. RF mainly outperforms other investigated algorithms, while accuracy levels are comparable among all the remaining ones. Guidelines are derived to identify an optimal pooling scheme for obtaining adequate predictive power and, hence, to motivate a study design that meets best experimental objectives and budgetary conditions, including time constraints
Dynamic simulations on the mitochondrial fatty acid Beta-oxidation network
<p>Abstract</p> <p>Background</p> <p>The oxidation of fatty acids in mitochondria plays an important role in energy metabolism and genetic disorders of this pathway may cause metabolic diseases. Enzyme deficiencies can block the metabolism at defined reactions in the mitochondrion and lead to accumulation of specific substrates causing severe clinical manifestations. Ten of the disorders directly affecting mitochondrial fatty acid oxidation have been well-defined, implicating episodic hypoketotic hypoglycemia provoked by catabolic stress, multiple organ failure, muscle weakness, or hypertrophic cardiomyopathy. Additionally, syndromes of severe maternal illness (HELLP syndrome and AFLP) have been associated with pregnancies carrying a fetus affected by fatty acid oxidation deficiencies. However, little is known about fatty acids kinetics, especially during fasting or exercise when the demand for fatty acid oxidation is increased (catabolic stress).</p> <p>Results</p> <p>A computational kinetic network of 64 reactions with 91 compounds and 301 parameters was constructed to study dynamic properties of mitochondrial fatty acid β-oxidation. Various deficiencies of acyl-CoA dehydrogenase were simulated and verified with measured concentrations of indicative metabolites of screened newborns in Middle Europe and South Australia. The simulated accumulation of specific acyl-CoAs according to the investigated enzyme deficiencies are in agreement with experimental data and findings in literature. Investigation of the dynamic properties of the fatty acid β-oxidation reveals that the formation of acetyl-CoA – substrate for energy production – is highly impaired within the first hours of fasting corresponding to the rapid progress to coma within 1–2 hours. LCAD deficiency exhibits the highest accumulation of fatty acids along with marked increase of these substrates during catabolic stress and the lowest production rate of acetyl-CoA. These findings might confirm gestational loss to be the explanation that no human cases of LCAD deficiency have been described.</p> <p>Conclusion</p> <p>In summary, this work provides a detailed kinetic model of mitochondrial metabolism with specific focus on fatty acid β-oxidation to simulate and predict the dynamic response of that metabolic network in the context of human disease. Our findings offer insight into the disease process (e.g. rapid progress to coma) and might confirm new explanations (no human cases of LCAD deficiency), which can hardly be obtained from experimental data alone.</p
Novel topological descriptors for analyzing biological networks
<p>Abstract</p> <p>Background</p> <p>Topological descriptors, other graph measures, and in a broader sense, graph-theoretical methods, have been proven as powerful tools to perform biological network analysis. However, the majority of the developed descriptors and graph-theoretical methods does not have the ability to take vertex- and edge-labels into account, e.g., atom- and bond-types when considering molecular graphs. Indeed, this feature is important to characterize biological networks more meaningfully instead of only considering pure topological information.</p> <p>Results</p> <p>In this paper, we put the emphasis on analyzing a special type of biological networks, namely bio-chemical structures. First, we derive entropic measures to calculate the information content of vertex- and edge-labeled graphs and investigate some useful properties thereof. Second, we apply the mentioned measures combined with other well-known descriptors to supervised machine learning methods for predicting Ames mutagenicity. Moreover, we investigate the influence of our topological descriptors - measures for only unlabeled vs. measures for labeled graphs - on the prediction performance of the underlying graph classification problem.</p> <p>Conclusions</p> <p>Our study demonstrates that the application of entropic measures to molecules representing graphs is useful to characterize such structures meaningfully. For instance, we have found that if one extends the measures for determining the structural information content of unlabeled graphs to labeled graphs, the uniqueness of the resulting indices is higher. Because measures to structurally characterize labeled graphs are clearly underrepresented so far, the further development of such methods might be valuable and fruitful for solving problems within biological network analysis.</p
[COMMODE] a large-scale database of molecular descriptors using compounds from PubChem
BACKGROUND: Molecular descriptors have been extensively used in the field of structure-oriented drug design and structural chemistry. They have been applied in QSPR and QSAR models to predict ADME-Tox properties, which specify essential features for drugs. Molecular descriptors capture chemical and structural information, but investigating their interpretation and meaning remains very challenging. RESULTS: This paper introduces a large-scale database of molecular descriptors called COMMODE containing more than 25 million compounds originated from PubChem. About 2500 DRAGON-descriptors have been calculated for all compounds and integrated into this database, which is accessible through a web interface at http://commode.i-med.ac.at
- …