Search CORE

Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms

Author: Balasubramanian Raji
Graber Armin
Guo Yu
McBurney Robert N
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Background: Data generated using ‘omics’ technologies are characterized by high dimensionality, where the number of features measured per subject vastly exceeds the number of subjects in the study. In this paper, we consider issues relevant in the design of biomedical studies in which the goal is the discovery of a subset of features and an associated algorithm that can predict a binary outcome, such as disease status. We compare the performance of four commonly used classifiers (K-Nearest Neighbors, Prediction Analysis for Microarrays, Random Forests and Support Vector Machines) in high-dimensionality data settings. We evaluate the effects of varying levels of signal-to-noise ratio in the dataset, imbalance in class distribution and choice of metric for quantifying performance of the classifier. To guide study design, we present a summary of the key characteristics of ‘omics’ data profiled in several human or animal model experiments utilizing high-content mass spectrometry and multiplexed immunoassay based techniques. Results: The analysis of data from seven ‘omics’ studies revealed that the average magnitude of effect size observed in human studies was markedly lower when compared to that in animal studies. The data measured in human studies were characterized by higher biological variation and the presence of outliers. The results from simulation studies indicated that the classifier Prediction Analysis for Microarrays (PAM) had the highest power when the class conditional feature distributions were Gaussian and outcome distributions were balanced. Random Forests was optimal when feature distributions were skewed and when class distributions were unbalanced. We provide a free open-source R statistical software library (MVpower) that implements the simulation strategy proposed in this paper. Conclusion: No single classifier had optimal performance under all settings. Simulation studies provide useful guidance for the design of biomedical studies involving high-dimensionality data

ScholarWorks@UMass Amherst

A network-based approach to classify the three domains of life

Author: Dehmer Matthias
Graber Armin
Kugler Karl G
Mueller Laurin AJ
Netzer Michael
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Queen's University Belfast Research Portal

Structural Measures for Network Biology Using QuACN

Author: Dehmer Matthias
Emmert-Streib Frank
Graber Armin
Kugler Karl G
Mueller Laurin AJ
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Background: Structural measures for networks have been extensively developed, but many of them have not yet demonstrated their sustainably. That means, it remains often unclear whether a particular measure is useful and feasible to solve a particular problem in network biology. Exemplarily, the classification of complex biological networks can be named, for which structural measures are used leading to a minimal classification error. Hence, there is a strong need to provide freely available software packages to calculate and demonstrate the appropriate usage of structural graph measures in network biology. Results: Here, we discuss topological network descriptors that are implemented in the R-package QuACN and demonstrate their behavior and characteristics by applying them to a set of example graphs. Moreover, we show a representative application to illustrate their capabilities for classifying biological networks. In particular, we infer gene regulatory networks from microarray data and classify them by methods provided by QuACN. Note that QuACN is the first freely available software written in R containing a large number of structural graph measures. Conclusion: The R package QuACN is under ongoing development and we add promising groups of topological network descriptors continuously. The package can be used to answer intriguing research questions in network biology, e.g., classifying biological data or identifying meaningful biological features, by analyzing the topology o

CiteSeerX

Effects of Pooling Samples on the Performance of Classification Algorithms: A Comparative Study

Author: Baumgartner Christian
Dehmer Matthias
Graber Armin
Kusonmano Kanthida
Liedl Klaus R.
Netzer Michael
Publication venue: The Scientific World Journal
Publication date: 01/01/2012
Field of study

A pooling design can be used as a powerful strategy to compensate for limited amounts of samples or high biological variation. In this paper, we perform a comparative study to model and quantify the effects of virtual pooling on the performance of the widely applied classifiers, support vector machines (SVMs), random forest (RF), k-nearest neighbors (k-NN), penalized logistic regression (PLR), and prediction analysis for microarrays (PAMs). We evaluate a variety of experimental designs using mock omics datasets with varying levels of pool sizes and considering effects from feature selection. Our results show that feature selection significantly improves classifier performance for non-pooled and pooled data. All investigated classifiers yield lower misclassification rates with smaller pool sizes. RF mainly outperforms other investigated algorithms, while accuracy levels are comparable among all the remaining ones. Guidelines are derived to identify an optimal pooling scheme for obtaining adequate predictive power and, hence, to motivate a study design that meets best experimental objectives and budgetary conditions, including time constraints

Dynamic simulations on the mitochondrial fatty acid Beta-oxidation network

Author: Graber Armin
Modre-Osprian Robert
Osprian Ingrid
Schreier Günter
Tilg Bernhard
Weinberger Klaus M
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background The oxidation of fatty acids in mitochondria plays an important role in energy metabolism and genetic disorders of this pathway may cause metabolic diseases. Enzyme deficiencies can block the metabolism at defined reactions in the mitochondrion and lead to accumulation of specific substrates causing severe clinical manifestations. Ten of the disorders directly affecting mitochondrial fatty acid oxidation have been well-defined, implicating episodic hypoketotic hypoglycemia provoked by catabolic stress, multiple organ failure, muscle weakness, or hypertrophic cardiomyopathy. Additionally, syndromes of severe maternal illness (HELLP syndrome and AFLP) have been associated with pregnancies carrying a fetus affected by fatty acid oxidation deficiencies. However, little is known about fatty acids kinetics, especially during fasting or exercise when the demand for fatty acid oxidation is increased (catabolic stress). Results A computational kinetic network of 64 reactions with 91 compounds and 301 parameters was constructed to study dynamic properties of mitochondrial fatty acid β-oxidation. Various deficiencies of acyl-CoA dehydrogenase were simulated and verified with measured concentrations of indicative metabolites of screened newborns in Middle Europe and South Australia. The simulated accumulation of specific acyl-CoAs according to the investigated enzyme deficiencies are in agreement with experimental data and findings in literature. Investigation of the dynamic properties of the fatty acid β-oxidation reveals that the formation of acetyl-CoA – substrate for energy production – is highly impaired within the first hours of fasting corresponding to the rapid progress to coma within 1–2 hours. LCAD deficiency exhibits the highest accumulation of fatty acids along with marked increase of these substrates during catabolic stress and the lowest production rate of acetyl-CoA. These findings might confirm gestational loss to be the explanation that no human cases of LCAD deficiency have been described. Conclusion In summary, this work provides a detailed kinetic model of mitochondrial metabolism with specific focus on fatty acid β-oxidation to simulate and predict the dynamic response of that metabolic network in the context of human disease. Our findings offer insight into the disease process (e.g. rapid progress to coma) and might confirm new explanations (no human cases of LCAD deficiency), which can hardly be obtained from experimental data alone.</p

Novel topological descriptors for analyzing biological networks

Author: Armin A Graber
Dehmer Matthias M
Kurt K Varmuza
Matthias M Dehmer
Nicola N Barbarini
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Topological descriptors, other graph measures, and in a broader sense, graph-theoretical methods, have been proven as powerful tools to perform biological network analysis. However, the majority of the developed descriptors and graph-theoretical methods does not have the ability to take vertex- and edge-labels into account, e.g., atom- and bond-types when considering molecular graphs. Indeed, this feature is important to characterize biological networks more meaningfully instead of only considering pure topological information. Results In this paper, we put the emphasis on analyzing a special type of biological networks, namely bio-chemical structures. First, we derive entropic measures to calculate the information content of vertex- and edge-labeled graphs and investigate some useful properties thereof. Second, we apply the mentioned measures combined with other well-known descriptors to supervised machine learning methods for predicting Ames mutagenicity. Moreover, we investigate the influence of our topological descriptors - measures for only unlabeled vs. measures for labeled graphs - on the prediction performance of the underlying graph classification problem. Conclusions Our study demonstrates that the application of entropic measures to molecules representing graphs is useful to characterize such structures meaningfully. For instance, we have found that if one extends the measures for determining the structural information content of unlabeled graphs to labeled graphs, the uniqueness of the resulting indices is higher. Because measures to structurally characterize labeled graphs are clearly underrepresented so far, the further development of such methods might be valuable and fruitful for solving problems within biological network analysis.</p

CiteSeerX

Springer

[COMMODE] a large-scale database of molecular descriptors using compounds from PubChem

Author: Andreas Dander
Armin Graber
Frank Emmert-Streib
Laurin AJ Mueller
Matthias Dehmer
Ralf Gallasch
Stephan Pabinger
Publication venue: Springer Nature
Publication date: 01/01/2013
Field of study

BACKGROUND: Molecular descriptors have been extensively used in the field of structure-oriented drug design and structural chemistry. They have been applied in QSPR and QSAR models to predict ADME-Tox properties, which specify essential features for drugs. Molecular descriptors capture chemical and structural information, but investigating their interpretation and meaning remains very challenging. RESULTS: This paper introduces a large-scale database of molecular descriptors called COMMODE containing more than 25 million compounds originated from PubChem. About 2500 DRAGON-descriptors have been calculated for all compounds and integrated into this database, which is accessible through a web interface at http://commode.i-med.ac.at

Profiling the human response to physical exercise: a computational strategy for the identification and kinetic analysis of metabolic biomarkers

Author: Baumgartner Christian
Fang Xiaocong
Graber Armin
Handler Michael
Kugler Karl G
Netzer Michael
Seger Michael
Weinberger Klaus M
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

MADAM - An open source meta-analysis toolbox for R and Bioconductor

Author: Armin Graber
DR Rhodes
F Hong
I Borozan
JD Storey
JK Choi
Karl G Kugler
Laurin AJ Mueller
O Troyanskaya
RA Fisher
RC Gentleman
VG Tusher
Y Moreau
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study