20,104 research outputs found

    Random Forests for Big Data

    Get PDF
    Big Data is one of the major challenges of statistical science and has numerous consequences from algorithmic and theoretical viewpoints. Big Data always involve massive data but they also often include online data and data heterogeneity. Recently some statistical methods have been adapted to process Big Data, like linear regression models, clustering methods and bootstrapping schemes. Based on decision trees combined with aggregation and bootstrap ideas, random forests were introduced by Breiman in 2001. They are a powerful nonparametric statistical method allowing to consider in a single and versatile framework regression problems, as well as two-class and multi-class classification problems. Focusing on classification problems, this paper proposes a selective review of available proposals that deal with scaling random forests to Big Data problems. These proposals rely on parallel environments or on online adaptations of random forests. We also describe how related quantities -- such as out-of-bag error and variable importance -- are addressed in these methods. Then, we formulate various remarks for random forests in the Big Data context. Finally, we experiment five variants on two massive datasets (15 and 120 millions of observations), a simulated one as well as real world data. One variant relies on subsampling while three others are related to parallel implementations of random forests and involve either various adaptations of bootstrap to Big Data or to "divide-and-conquer" approaches. The fifth variant relates on online learning of random forests. These numerical experiments lead to highlight the relative performance of the different variants, as well as some of their limitations

    An Introduction to Programming for Bioscientists: A Python-based Primer

    Full text link
    Computing has revolutionized the biological sciences over the past several decades, such that virtually all contemporary research in the biosciences utilizes computer programs. The computational advances have come on many fronts, spurred by fundamental developments in hardware, software, and algorithms. These advances have influenced, and even engendered, a phenomenal array of bioscience fields, including molecular evolution and bioinformatics; genome-, proteome-, transcriptome- and metabolome-wide experimental studies; structural genomics; and atomistic simulations of cellular-scale molecular assemblies as large as ribosomes and intact viruses. In short, much of post-genomic biology is increasingly becoming a form of computational biology. The ability to design and write computer programs is among the most indispensable skills that a modern researcher can cultivate. Python has become a popular programming language in the biosciences, largely because (i) its straightforward semantics and clean syntax make it a readily accessible first language; (ii) it is expressive and well-suited to object-oriented programming, as well as other modern paradigms; and (iii) the many available libraries and third-party toolkits extend the functionality of the core language into virtually every biological domain (sequence and structure analyses, phylogenomics, workflow management systems, etc.). This primer offers a basic introduction to coding, via Python, and it includes concrete examples and exercises to illustrate the language's usage and capabilities; the main text culminates with a final project in structural bioinformatics. A suite of Supplemental Chapters is also provided. Starting with basic concepts, such as that of a 'variable', the Chapters methodically advance the reader to the point of writing a graphical user interface to compute the Hamming distance between two DNA sequences.Comment: 65 pages total, including 45 pages text, 3 figures, 4 tables, numerous exercises, and 19 pages of Supporting Information; currently in press at PLOS Computational Biolog

    A computational pipeline for the diagnosis of CVID patients

    Get PDF
    Common variable immunodeficiency (CVID) is one of the most frequently diagnosed primary antibody deficiencies (PADs), a group of disorders characterized by a decrease in one or more immunoglobulin (sub) classes and/or impaired antibody responses caused by inborn defects in B cells in the absence of other major immune defects. CVID patients suffer from recurrent infections and disease-related, non-infectious, complications such as autoimmune manifestations, lymphoproliferation, and malignancies. A timely diagnosis is essential for optimal follow-up and treatment. However, CVID is by definition a diagnosis of exclusion, thereby covering a heterogeneous patient population and making it difficult to establish a definite diagnosis. To aid the diagnosis of CVID patients, and distinguish them from other PADs, we developed an automated machine learning pipeline which performs automated diagnosis based on flow cytometric immunophenotyping. Using this pipeline, we analyzed the immunophenotypic profile in a pediatric and adult cohort of 28 patients with CVID, 23 patients with idiopathic primary hypogammaglobulinemia, 21 patients with IgG subclass deficiency, six patients with isolated IgA deficiency, one patient with isolated IgM deficiency, and 100 unrelated healthy controls. Flow cytometry analysis is traditionally done by manual identification of the cell populations of interest. Yet, this approach has severe limitations including subjectivity of the manual gating and bias toward known populations. To overcome these limitations, we here propose an automated computational flow cytometry pipeline that successfully distinguishes CVID phenotypes from other PADs and healthy controls. Compared to the traditional, manual analysis, our pipeline is fully automated, performing automated quality control and data pre-processing, automated population identification (gating) and deriving features from these populations to build a machine learning classifier to distinguish CVID from other PADs and healthy controls. This results in a more reproducible flow cytometry analysis, and improves the diagnosis compared to manual analysis: our pipelines achieve on average a balanced accuracy score of 0.93 (+/- 0.07), whereas using the manually extracted populations, an averaged balanced accuracy score of 0.72 (+/- 0.23) is achieved

    A Multi-Gene Genetic Programming Application for Predicting Students Failure at School

    Full text link
    Several efforts to predict student failure rate (SFR) at school accurately still remains a core problem area faced by many in the educational sector. The procedure for forecasting SFR are rigid and most often times require data scaling or conversion into binary form such as is the case of the logistic model which may lead to lose of information and effect size attenuation. Also, the high number of factors, incomplete and unbalanced dataset, and black boxing issues as in Artificial Neural Networks and Fuzzy logic systems exposes the need for more efficient tools. Currently the application of Genetic Programming (GP) holds great promises and has produced tremendous positive results in different sectors. In this regard, this study developed GPSFARPS, a software application to provide a robust solution to the prediction of SFR using an evolutionary algorithm known as multi-gene genetic programming. The approach is validated by feeding a testing data set to the evolved GP models. Result obtained from GPSFARPS simulations show its unique ability to evolve a suitable failure rate expression with a fast convergence at 30 generations from a maximum specified generation of 500. The multi-gene system was also able to minimize the evolved model expression and accurately predict student failure rate using a subset of the original expressionComment: 14 pages, 9 figures, Journal paper. arXiv admin note: text overlap with arXiv:1403.0623 by other author
    • …
    corecore