81,675 research outputs found

    Random Forests for Big Data

    Get PDF
    Big Data is one of the major challenges of statistical science and has numerous consequences from algorithmic and theoretical viewpoints. Big Data always involve massive data but they also often include online data and data heterogeneity. Recently some statistical methods have been adapted to process Big Data, like linear regression models, clustering methods and bootstrapping schemes. Based on decision trees combined with aggregation and bootstrap ideas, random forests were introduced by Breiman in 2001. They are a powerful nonparametric statistical method allowing to consider in a single and versatile framework regression problems, as well as two-class and multi-class classification problems. Focusing on classification problems, this paper proposes a selective review of available proposals that deal with scaling random forests to Big Data problems. These proposals rely on parallel environments or on online adaptations of random forests. We also describe how related quantities -- such as out-of-bag error and variable importance -- are addressed in these methods. Then, we formulate various remarks for random forests in the Big Data context. Finally, we experiment five variants on two massive datasets (15 and 120 millions of observations), a simulated one as well as real world data. One variant relies on subsampling while three others are related to parallel implementations of random forests and involve either various adaptations of bootstrap to Big Data or to "divide-and-conquer" approaches. The fifth variant relates on online learning of random forests. These numerical experiments lead to highlight the relative performance of the different variants, as well as some of their limitations

    K-nearest Neighbor Search by Random Projection Forests

    Full text link
    K-nearest neighbor (kNN) search has wide applications in many areas, including data mining, machine learning, statistics and many applied domains. Inspired by the success of ensemble methods and the flexibility of tree-based methodology, we propose random projection forests (rpForests), for kNN search. rpForests finds kNNs by aggregating results from an ensemble of random projection trees with each constructed recursively through a series of carefully chosen random projections. rpForests achieves a remarkable accuracy in terms of fast decay in the missing rate of kNNs and that of discrepancy in the kNN distances. rpForests has a very low computational complexity. The ensemble nature of rpForests makes it easily run in parallel on multicore or clustered computers; the running time is expected to be nearly inversely proportional to the number of cores or machines. We give theoretical insights by showing the exponential decay of the probability that neighboring points would be separated by ensemble random projection trees when the ensemble size increases. Our theory can be used to refine the choice of random projections in the growth of trees, and experiments show that the effect is remarkable.Comment: 15 pages, 4 figures, 2018 IEEE Big Data Conferenc

    Distributed Stochastic Aware Random Forests -- Efficient Data Mining for Big Data

    Full text link

    Using Biophysical Geospatial and Remotely Sensed Data to Classify Ecological Sites and States

    Get PDF
    Monitoring and identifying the state of rangelands on a landscape scale can be a time consuming process. In this thesis, remote sensing imagery has been used to show how the process of classifying different ecological sites and states can be done on a per pixel basis for a large landscape. Twenty-seven years\u27 worth of remotely sensed imagery was collected, atmospherically corrected, and radiometrically normalized. Several vegetation indices were extracted from the imagery along with derivatives from a digital elevation model. Dominant vegetation components from five major ecological sites in Rich County, Utah, were chosen for study. The vegetation components were Aspen, Douglas-fir, Utah juniper, mountain big sagebrush, and Wyoming big sagebrush. Training sites were extracted from within map units with a majority of one of the five ecological sites. A Random Forests decision tree model was developed using an attribute table populated with spectral biophysical variables derived from the training sites. The overall out-of-bag accuracy for the Random Forests model was 97.2%. The model was then applied to the predictor spectral and biophysical variables to spatially map the five major vegetation components for all of Rich County. Each vegetation class had greater than 90% accuracies except for Utah juniper at 81%. This process is further explained in chapter 2. As a follow-on effort, we attempted to classify vegetation ecological states within a single ecological site (Wyoming big sagebrush). This was done using field data collected by previous studies as training data for all five ecological states documented for our chosen ecological site. A Maximum Likelihood classifier was applied to four years of Landsat 5 Thematic Mapper imagery to map each ecological state to pixels coincident to the map units correlated to the Wyoming big sagebrush ecological site. We used the Mahalanobis distance metric as an indicator of pixel membership to the Wyoming big sagebrush ecological site. Overall classification accuracy for the different ecological states was 64.7% for pixels with low Mahalanobis distance and less than 25% for higher distances

    A scalable expressive ensemble learning using Random Prism: a MapReduce approach

    Get PDF
    The induction of classification rules from previously unseen examples is one of the most important data mining tasks in science as well as commercial applications. In order to reduce the influence of noise in the data, ensemble learners are often applied. However, most ensemble learners are based on decision tree classifiers which are affected by noise. The Random Prism classifier has recently been proposed as an alternative to the popular Random Forests classifier, which is based on decision trees. Random Prism is based on the Prism family of algorithms, which is more robust to noise. However, like most ensemble classification approaches, Random Prism also does not scale well on large training data. This paper presents a thorough discussion of Random Prism and a recently proposed parallel version of it called Parallel Random Prism. Parallel Random Prism is based on the MapReduce programming paradigm. The paper provides, for the first time, novel theoretical analysis of the proposed technique and in-depth experimental study that show that Parallel Random Prism scales well on a large number of training examples, a large number of data features and a large number of processors. Expressiveness of decision rules that our technique produces makes it a natural choice for Big Data applications where informed decision making increases the user’s trust in the system
    • …
    corecore