4,741 research outputs found

    Random Forests for Big Data

    Get PDF
    Big Data is one of the major challenges of statistical science and has numerous consequences from algorithmic and theoretical viewpoints. Big Data always involve massive data but they also often include online data and data heterogeneity. Recently some statistical methods have been adapted to process Big Data, like linear regression models, clustering methods and bootstrapping schemes. Based on decision trees combined with aggregation and bootstrap ideas, random forests were introduced by Breiman in 2001. They are a powerful nonparametric statistical method allowing to consider in a single and versatile framework regression problems, as well as two-class and multi-class classification problems. Focusing on classification problems, this paper proposes a selective review of available proposals that deal with scaling random forests to Big Data problems. These proposals rely on parallel environments or on online adaptations of random forests. We also describe how related quantities -- such as out-of-bag error and variable importance -- are addressed in these methods. Then, we formulate various remarks for random forests in the Big Data context. Finally, we experiment five variants on two massive datasets (15 and 120 millions of observations), a simulated one as well as real world data. One variant relies on subsampling while three others are related to parallel implementations of random forests and involve either various adaptations of bootstrap to Big Data or to "divide-and-conquer" approaches. The fifth variant relates on online learning of random forests. These numerical experiments lead to highlight the relative performance of the different variants, as well as some of their limitations

    Evolving Ensemble Fuzzy Classifier

    Full text link
    The concept of ensemble learning offers a promising avenue in learning from data streams under complex environments because it addresses the bias and variance dilemma better than its single model counterpart and features a reconfigurable structure, which is well suited to the given context. While various extensions of ensemble learning for mining non-stationary data streams can be found in the literature, most of them are crafted under a static base classifier and revisits preceding samples in the sliding window for a retraining step. This feature causes computationally prohibitive complexity and is not flexible enough to cope with rapidly changing environments. Their complexities are often demanding because it involves a large collection of offline classifiers due to the absence of structural complexities reduction mechanisms and lack of an online feature selection mechanism. A novel evolving ensemble classifier, namely Parsimonious Ensemble pENsemble, is proposed in this paper. pENsemble differs from existing architectures in the fact that it is built upon an evolving classifier from data streams, termed Parsimonious Classifier pClass. pENsemble is equipped by an ensemble pruning mechanism, which estimates a localized generalization error of a base classifier. A dynamic online feature selection scenario is integrated into the pENsemble. This method allows for dynamic selection and deselection of input features on the fly. pENsemble adopts a dynamic ensemble structure to output a final classification decision where it features a novel drift detection scenario to grow the ensemble structure. The efficacy of the pENsemble has been numerically demonstrated through rigorous numerical studies with dynamic and evolving data streams where it delivers the most encouraging performance in attaining a tradeoff between accuracy and complexity.Comment: this paper has been published by IEEE Transactions on Fuzzy System

    Learning from distributed data sources using random vector functional-link networks

    Get PDF
    One of the main characteristics in many real-world big data scenarios is their distributed nature. In a machine learning context, distributed data, together with the requirements of preserving privacy and scaling up to large networks, brings the challenge of designing fully decentralized training protocols. In this paper, we explore the problem of distributed learning when the features of every pattern are available throughout multiple agents (as is happening, for example, in a distributed database scenario). We propose an algorithm for a particular class of neural networks, known as Random Vector Functional-Link (RVFL), which is based on the Alternating Direction Method of Multipliers optimization algorithm. The proposed algorithm allows to learn an RVFL network from multiple distributed data sources, while restricting communication to the unique operation of computing a distributed average. Our experimental simulations show that the algorithm is able to achieve a generalization accuracy comparable to a fully centralized solution, while at the same time being extremely efficient

    Statewide analysis of brook trout (Salvelinus fontinalis ) population status and reach-scale conservation priorities in West Virginia Watersheds

    Get PDF
    The Eastern Brook Trout Joint Venture (EBTJV) was formed to implement range-wide strategies that sustain healthy, fishable brook trout populations. Hudy et al. (2006) recently completed a comprehensive analysis of eastern brook trout distributions representing a critical first step towards fully integrating brook trout conservation efforts in this region. This study was designed to supplement and complement existing data on brook trout distributions and status within West Virginia. We examined recently obtained data for the entire state to update the EBTJV distributional map published in Hudy et al. (2006). We then used fish sample data along with GIS-acquired landscape data to create models to predict and extrapolate brook trout distributions and population types within the historical distribution of brook trout within West Virginia. We also used these data to identify critical reach scale priorities for brook trout protection, restoration and enhancement

    Concept Drift Detection in Data Stream Mining: The Review of Contemporary Literature

    Get PDF
    Mining process such as classification, clustering of progressive or dynamic data is a critical objective of the information retrieval and knowledge discovery; in particular, it is more sensitive in data stream mining models due to the possibility of significant change in the type and dimensionality of the data over a period. The influence of these changes over the mining process termed as concept drift. The concept drift that depict often in streaming data causes unbalanced performance of the mining models adapted. Hence, it is obvious to boost the mining models to predict and analyse the concept drift to achieve the performance at par best. The contemporary literature evinced significant contributions to handle the concept drift, which fall in to supervised, unsupervised learning, and statistical assessment approaches. This manuscript contributes the detailed review of the contemporary concept-drift detection models depicted in recent literature. The contribution of the manuscript includes the nomenclature of the concept drift models and their impact of imbalanced data tuples

    Autonomous Deep Learning: Continual Learning Approach for Dynamic Environments

    Full text link
    The feasibility of deep neural networks (DNNs) to address data stream problems still requires intensive study because of the static and offline nature of conventional deep learning approaches. A deep continual learning algorithm, namely autonomous deep learning (ADL), is proposed in this paper. Unlike traditional deep learning methods, ADL features a flexible structure where its network structure can be constructed from scratch with the absence of an initial network structure via the self-constructing network structure. ADL specifically addresses catastrophic forgetting by having a different-depth structure which is capable of achieving a trade-off between plasticity and stability. Network significance (NS) formula is proposed to drive the hidden nodes growing and pruning mechanism. Drift detection scenario (DDS) is put forward to signal distributional changes in data streams which induce the creation of a new hidden layer. The maximum information compression index (MICI) method plays an important role as a complexity reduction module eliminating redundant layers. The efficacy of ADL is numerically validated under the prequential test-then-train procedure in lifelong environments using nine popular data stream problems. The numerical results demonstrate that ADL consistently outperforms recent continual learning methods while characterizing the automatic construction of network structures
    • …
    corecore