1,407 research outputs found

    An Introduction to Recursive Partitioning: Rationale, Application and Characteristics of Classification and Regression Trees, Bagging and Random Forests

    Get PDF
    Recursive partitioning methods have become popular and widely used tools for nonparametric regression and classification in many scientific fields. Especially random forests, that can deal with large numbers of predictor variables even in the presence of complex interactions, have been applied successfully in genetics, clinical medicine and bioinformatics within the past few years. High dimensional problems are common not only in genetics, but also in some areas of psychological research, where only few subjects can be measured due to time or cost constraints, yet a large amount of data is generated for each subject. Random forests have been shown to achieve a high prediction accuracy in such applications, and provide descriptive variable importance measures reflecting the impact of each variable in both main effects and interactions. The aim of this work is to introduce the principles of the standard recursive partitioning methods as well as recent methodological improvements, to illustrate their usage for low and high dimensional data exploration, but also to point out limitations of the methods and potential pitfalls in their practical application. Application of the methods is illustrated using freely available implementations in the R system for statistical computing

    A novel approach to detect hot-spots in large-scale multivariate data

    Get PDF
    Background: Progressive advances in the measurement of complex multifactorial components of biological processes involving both spatial and temporal domains have made it difficult to identify the variables (genes, proteins, neurons etc.) significantly changed activities in response to a stimulus within large data sets using conventional statistical approaches. The set of all changed variables is termed hot-spots. The detection of such hot spots is considered to be an NP hard problem, but by first establishing its theoretical foundation we have been able to develop an algorithm that provides a solution. Results: Our results show that a first-order phase transition is observable whose critical point separates the hot-spot set from the remaining variables. Its application is also found to be more successful than existing approaches in identifying statistically significant hot-spots both with simulated data sets and in real large-scale multivariate data sets from gene arrays, electrophysiological recording and functional magnetic resonance imaging experiments. Conclusion: In summary, this new statistical algorithm should provide a powerful new analytical tool to extract the maximum information from complex biological multivariate data

    Physico-chemical foundations underpinning microarray and next-generation sequencing experiments

    Get PDF
    Hybridization of nucleic acids on solid surfaces is a key process involved in high-throughput technologies such as microarrays and, in some cases, next-generation sequencing (NGS). A physical understanding of the hybridization process helps to determine the accuracy of these technologies. The goal of a widespread research program is to develop reliable transformations between the raw signals reported by the technologies and individual molecular concentrations from an ensemble of nucleic acids. This research has inputs from many areas, from bioinformatics and biostatistics, to theoretical and experimental biochemistry and biophysics, to computer simulations. A group of leading researchers met in Ploen Germany in 2011 to discuss present knowledge and limitations of our physico-chemical understanding of high-throughput nucleic acid technologies. This meeting inspired us to write this summary, which provides an overview of the state-of-the-art approaches based on physico-chemical foundation to modeling of the nucleic acids hybridization process on solid surfaces. In addition, practical application of current knowledge is emphasized

    A Computational Pipeline for the Development of Multi-Marker Bio-Signature Panels and Ensemble Classifiers

    Get PDF
    BACKGROUND:Biomarker panels derived separately from genomic and proteomic data and with a variety of computational methods have demonstrated promising classification performance in various diseases. An open question is how to create effective proteo-genomic panels. The framework of ensemble classifiers has been applied successfully in various analytical domains to combine classifiers so that the performance of the ensemble exceeds the performance of individual classifiers. Using blood-based diagnosis of acute renal allograft rejection as a case study, we address the following question in this paper: Can acute rejection classification performance be improved by combining individual genomic and proteomic classifiers in an ensemble?RESULTS:The first part of the paper presents a computational biomarker development pipeline for genomic and proteomic data. The pipeline begins with data acquisition (e.g., from bio-samples to microarray data), quality control, statistical analysis and mining of the data, and finally various forms of validation. The pipeline ensures that the various classifiers to be combined later in an ensemble are diverse and adequate for clinical use. Five mRNA genomic and five proteomic classifiers were developed independently using single time-point blood samples from 11 acute-rejection and 22 non-rejection renal transplant patients. The second part of the paper examines five ensembles ranging in size from two to 10 individual classifiers. Performance of ensembles is characterized by area under the curve (AUC), sensitivity, and specificity, as derived from the probability of acute rejection for individual classifiers in the ensemble in combination with one of two aggregation methods: (1) Average Probability or (2) Vote Threshold. One ensemble demonstrated superior performance and was able to improve sensitivity and AUC beyond the best values observed for any of the individual classifiers in the ensemble, while staying within the range of observed specificity. The Vote Threshold aggregation method achieved improved sensitivity for all 5 ensembles, but typically at the cost of decreased specificity.CONCLUSION:Proteo-genomic biomarker ensemble classifiers show promise in the diagnosis of acute renal allograft rejection and can improve classification performance beyond that of individual genomic or proteomic classifiers alone. Validation of our results in an international multicenter study is currently underway

    Algorithmic and Statistical Perspectives on Large-Scale Data Analysis

    Full text link
    In recent years, ideas from statistics and scientific computing have begun to interact in increasingly sophisticated and fruitful ways with ideas from computer science and the theory of algorithms to aid in the development of improved worst-case algorithms that are useful for large-scale scientific and Internet data analysis problems. In this chapter, I will describe two recent examples---one having to do with selecting good columns or features from a (DNA Single Nucleotide Polymorphism) data matrix, and the other having to do with selecting good clusters or communities from a data graph (representing a social or information network)---that drew on ideas from both areas and that may serve as a model for exploiting complementary algorithmic and statistical perspectives in order to solve applied large-scale data analysis problems.Comment: 33 pages. To appear in Uwe Naumann and Olaf Schenk, editors, "Combinatorial Scientific Computing," Chapman and Hall/CRC Press, 201

    Gene set based ensemble methods for cancer classification

    Get PDF
    Diagnosis of cancer very often depends on conclusions drawn after both clinical and microscopic examinations of tissues to study the manifestation of the disease in order to place tumors in known categories. One factor which determines the categorization of cancer is the tissue from which the tumor originates. Information gathered from clinical exams may be partial or not completely predictive of a specific category of cancer. Further complicating the problem of categorizing various tumors is that the histological classification of the cancer tissue and description of its course of development may be atypical. Gene expression data gleaned from micro-array analysis provides tremendous promise for more accurate cancer diagnosis. One hurdle in the classification of tumors based on gene expression data is that the data space is ultra-dimensional with relatively few points; that is, there are a small number of examples with a large number of genes. A second hurdle is expression bias caused by the correlation of genes. Analysis of subsets of genes, known as gene set analysis, provides a mechanism by which groups of differentially expressed genes can be identified. We propose an ensemble of classifiers whose base classifiers are ℓ1-regularized logistic regression models with restriction of the feature space to biologically relevant genes. Some researchers have already explored the use of ensemble classifiers to classify cancer but the effect of the underlying base classifiers in conjunction with biologically-derived gene sets on cancer classification has not been explored

    Supervised machine learning and heterotic classification of maize (Zea mays L.) using molecular marker data

    Get PDF
    The development of molecular techniques for genetic analysis has enabled great advances in cereal breeding. However, their usefulness in hybrid breeding, particularly in assigning new lines to heterotic groups previously established, still remains unsolved. In this work we evaluate the performance of several state-of-art multiclass classifiers onto three molecular marker datasets representing a broad spectrum of maize heterotic patterns. Even though results are variable, they suggest supervised learning algorithms as a valuable complement to traditional breeding programs.Fil: Ornella, Leonardo Alfredo. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas. Universidad Nacional de Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y de Sistemas; ArgentinaFil: Tapia, Elizabeth. Universidad Nacional de Rosario. Facultad de Ciencias Exactas, Ingeniería y Agrimensura; Argentin

    Unsupervised Algorithms for Microarray Sample Stratification

    Get PDF
    The amount of data made available by microarrays gives researchers the opportunity to delve into the complexity of biological systems. However, the noisy and extremely high-dimensional nature of this kind of data poses significant challenges. Microarrays allow for the parallel measurement of thousands of molecular objects spanning different layers of interactions. In order to be able to discover hidden patterns, the most disparate analytical techniques have been proposed. Here, we describe the basic methodologies to approach the analysis of microarray datasets that focus on the task of (sub)group discovery.Peer reviewe
    corecore