91,010 research outputs found

    Improving random forests by feature dependence analysis

    Get PDF
    Random forests (RFs) have been widely used for supervised learning tasks because of their high prediction accuracy good model interpretability and fast training process. However they are not able to learn from local structures as convolutional neural networks (CNNs) do when there exists high dependency among features. They also cannot utilize features that are jointly dependent on the label but marginally independent of it. In this dissertation we present two approaches to address these two problems respectively by dependence analysis. First a local feature sampling (LFS) approach is proposed to learn and use the locality information of features to group dependent/correlated features to train each tree. For image data the local information of features (pixels) is defined by the 2-D grid of the image. For non-image data we provided multiple ways of estimating this local structure. Our experiments shows that RF with LFS has reduced correlation and improved accuracy on multiple UCI datasets. To address the latter issue of random forest mentioned we propose a way to categorize features as marginally dependent features and jointly dependent features the latter is defined by minimum dependence sets (MDS\u27s) or by stronger dependence sets (SDS\u27s). Algorithms to identify MDS\u27s and SDS\u27s are provided. We then present a feature dependence mapping (FDM) approach to map the jointly dependent features to another feature space where they are marginally dependent. We show that by using FDM decision tree and RF have improved prediction performance on artificial datasets and a protein expression dataset

    Improving deep random forests

    Get PDF
    The most frequently used deep learning models are deep neural networks. Although they have been successfully applied to various problems, they re- quire large training sets and careful tuning of parameters. An alternative to deep neural networks is the deep forest model, which we independently implemented to verify the replicability of results in (Zhou and Feng, 2017). We test if the accuracy of deep forest can be improved by including ran- dom subspace forests or by using stacking to combine predictions of cascade forest's last layer. We evaluate the original implementation and our improvements on five data sets. The algorithm with added stacking achieves equal or better results on all five data sets, whereas the addition of random subspace forests brings worse results on three data sets and better results on two data sets

    A Framework to Adjust Dependency Measure Estimates for Chance

    Full text link
    Estimating the strength of dependency between two variables is fundamental for exploratory analysis and many other applications in data mining. For example: non-linear dependencies between two continuous variables can be explored with the Maximal Information Coefficient (MIC); and categorical variables that are dependent to the target class are selected using Gini gain in random forests. Nonetheless, because dependency measures are estimated on finite samples, the interpretability of their quantification and the accuracy when ranking dependencies become challenging. Dependency estimates are not equal to 0 when variables are independent, cannot be compared if computed on different sample size, and they are inflated by chance on variables with more categories. In this paper, we propose a framework to adjust dependency measure estimates on finite samples. Our adjustments, which are simple and applicable to any dependency measure, are helpful in improving interpretability when quantifying dependency and in improving accuracy on the task of ranking dependencies. In particular, we demonstrate that our approach enhances the interpretability of MIC when used as a proxy for the amount of noise between variables, and to gain accuracy when ranking variables during the splitting procedure in random forests.Comment: In Proceedings of the 2016 SIAM International Conference on Data Minin

    Remote sensing technology applications in forestry and REDD+

    Get PDF
    Advances in close-range and remote sensing technologies drive innovations in forest resource assessments and monitoring at varying scales. Data acquired with airborne and spaceborne platforms provide us with higher spatial resolution, more frequent coverage and increased spectral information. Recent developments in ground-based sensors have advanced three dimensional (3D) measurements, low-cost permanent systems and community-based monitoring of forests. The REDD+ mechanism has moved the remote sensing community in advancing and developing forest geospatial products which can be used by countries for the international reporting and national forest monitoring. However, there still is an urgent need to better understand the options and limitations of remote and close-range sensing techniques in the field of degradation and forest change assessment. This Special Issue contains 12 studies that provided insight into new advances in the field of remote sensing for forest management and REDD+. This includes developments into algorithm development using satellite data; synthetic aperture radar (SAR); airborne and terrestrial LiDAR; as well as forest reference emissions level (FREL) frameworks

    Life Expectancy at Birth in Europe: An Econometric Approach Based on Random Forests Methodology

    Get PDF
    The objective of this work is to identify and classify the relative importance of several socioeconomic factors which explain life expectancy at birth in the European Union (EU) countries in the period 2008–2017, paying special attention to greenhouse gas emissions and public environmental expenditures. Methods: The Random Forests methodology was employed, which allows classification of the socioeconomic variables considered in the analysis according to their relative importance to explain health outcomes. Results: Per capita income, the educational level of the population, and the variable AREA (which reflects the subdivision of Europe into four relatively homogeneous areas), followed by the public expenditures on environmental and social protection, are the variables with the highest relevance in explaining life expectancy at birth in Europe over the perip.1 he peusto el correo e inciod 2008–2017. Conclusions: We have identified seven sectors as the main sources of greenhouse gas emissions: Electricity, gas, steam, and air conditioning supply; manufacturing; transportation and storage; agriculture, forestry, and fishing; construction; wholesale and retail trade, repair of motor vehicles and motorcycles; and mining and quarrying. Therefore, any public intervention related to environmental policy should be aimed at these economic sectors. Furthermore, it will be more effective to focus on public programs with higher relevance to the health status of the population, such as environmental and social protection expenditures

    A random forest system combination approach for error detection in digital dictionaries

    Full text link
    When digitizing a print bilingual dictionary, whether via optical character recognition or manual entry, it is inevitable that errors are introduced into the electronic version that is created. We investigate automating the process of detecting errors in an XML representation of a digitized print dictionary using a hybrid approach that combines rule-based, feature-based, and language model-based methods. We investigate combining methods and show that using random forests is a promising approach. We find that in isolation, unsupervised methods rival the performance of supervised methods. Random forests typically require training data so we investigate how we can apply random forests to combine individual base methods that are themselves unsupervised without requiring large amounts of training data. Experiments reveal empirically that a relatively small amount of data is sufficient and can potentially be further reduced through specific selection criteria.Comment: 9 pages, 7 figures, 10 tables; appeared in Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data, April 201
    • …
    corecore