91,010 research outputs found
Improving random forests by feature dependence analysis
Random forests (RFs) have been widely used for supervised learning tasks because of their high prediction accuracy good model interpretability and fast training process. However they are not able to learn from local structures as convolutional neural networks (CNNs) do when there exists high dependency among features. They also cannot utilize features that are jointly dependent on the label but marginally independent of it. In this dissertation we present two approaches to address these two problems respectively by dependence analysis. First a local feature sampling (LFS) approach is proposed to learn and use the locality information of features to group dependent/correlated features to train each tree. For image data the local information of features (pixels) is defined by the 2-D grid of the image. For non-image data we provided multiple ways of estimating this local structure. Our experiments shows that RF with LFS has reduced correlation and improved accuracy on multiple UCI datasets. To address the latter issue of random forest mentioned we propose a way to categorize features as marginally dependent features and jointly dependent features the latter is defined by minimum dependence sets (MDS\u27s) or by stronger dependence sets (SDS\u27s). Algorithms to identify MDS\u27s and SDS\u27s are provided. We then present a feature dependence mapping (FDM) approach to map the jointly dependent features to another feature space where they are marginally dependent. We show that by using FDM decision tree and RF have improved prediction performance on artificial datasets and a protein expression dataset
Improving deep random forests
The most frequently used deep learning models are deep neural networks.
Although they have been successfully applied to various problems, they re-
quire large training sets and careful tuning of parameters. An alternative
to deep neural networks is the deep forest model, which we independently
implemented to verify the replicability of results in (Zhou and Feng, 2017).
We test if the accuracy of deep forest can be improved by including ran-
dom subspace forests or by using stacking to combine predictions of cascade
forest's last layer.
We evaluate the original implementation and our improvements on five
data sets. The algorithm with added stacking achieves equal or better results
on all five data sets, whereas the addition of random subspace forests brings
worse results on three data sets and better results on two data sets
A Framework to Adjust Dependency Measure Estimates for Chance
Estimating the strength of dependency between two variables is fundamental
for exploratory analysis and many other applications in data mining. For
example: non-linear dependencies between two continuous variables can be
explored with the Maximal Information Coefficient (MIC); and categorical
variables that are dependent to the target class are selected using Gini gain
in random forests. Nonetheless, because dependency measures are estimated on
finite samples, the interpretability of their quantification and the accuracy
when ranking dependencies become challenging. Dependency estimates are not
equal to 0 when variables are independent, cannot be compared if computed on
different sample size, and they are inflated by chance on variables with more
categories. In this paper, we propose a framework to adjust dependency measure
estimates on finite samples. Our adjustments, which are simple and applicable
to any dependency measure, are helpful in improving interpretability when
quantifying dependency and in improving accuracy on the task of ranking
dependencies. In particular, we demonstrate that our approach enhances the
interpretability of MIC when used as a proxy for the amount of noise between
variables, and to gain accuracy when ranking variables during the splitting
procedure in random forests.Comment: In Proceedings of the 2016 SIAM International Conference on Data
Minin
Remote sensing technology applications in forestry and REDD+
Advances in close-range and remote sensing technologies drive innovations in forest resource assessments and monitoring at varying scales. Data acquired with airborne and spaceborne platforms provide us with higher spatial resolution, more frequent coverage and increased spectral information. Recent developments in ground-based sensors have advanced three dimensional (3D) measurements, low-cost permanent systems and community-based monitoring of forests. The REDD+ mechanism has moved the remote sensing community in advancing and developing forest geospatial products which can be used by countries for the international reporting and national forest monitoring. However, there still is an urgent need to better understand the options and limitations of remote and close-range sensing techniques in the field of degradation and forest change assessment. This Special Issue contains 12 studies that provided insight into new advances in the field of remote sensing for forest management and REDD+. This includes developments into algorithm development using satellite data; synthetic aperture radar (SAR); airborne and terrestrial LiDAR; as well as forest reference emissions level (FREL) frameworks
Life Expectancy at Birth in Europe: An Econometric Approach Based on Random Forests Methodology
The objective of this work is to identify and classify the relative importance of several socioeconomic factors which explain life expectancy at birth in the European Union (EU) countries in the period 2008–2017, paying special attention to greenhouse gas emissions and public environmental expenditures. Methods: The Random Forests methodology was employed, which allows classification of the socioeconomic variables considered in the analysis according to their relative importance to explain health outcomes. Results: Per capita income, the educational level of the population, and the variable AREA (which reflects the subdivision of Europe into four relatively homogeneous areas), followed by the public expenditures on environmental and social protection, are the variables with the highest relevance in explaining life expectancy at birth in Europe over the perip.1 he peusto el correo e inciod 2008–2017. Conclusions: We have identified seven sectors as the main sources of greenhouse gas emissions: Electricity, gas, steam, and air conditioning supply; manufacturing; transportation and storage; agriculture, forestry, and fishing; construction; wholesale and retail trade, repair of motor vehicles and motorcycles; and mining and quarrying. Therefore, any public intervention related to environmental policy should be aimed at these economic sectors. Furthermore, it will be more effective to focus on public programs with higher relevance to the health status of the population, such as environmental and social protection expenditures
A random forest system combination approach for error detection in digital dictionaries
When digitizing a print bilingual dictionary, whether via optical character
recognition or manual entry, it is inevitable that errors are introduced into
the electronic version that is created. We investigate automating the process
of detecting errors in an XML representation of a digitized print dictionary
using a hybrid approach that combines rule-based, feature-based, and language
model-based methods. We investigate combining methods and show that using
random forests is a promising approach. We find that in isolation, unsupervised
methods rival the performance of supervised methods. Random forests typically
require training data so we investigate how we can apply random forests to
combine individual base methods that are themselves unsupervised without
requiring large amounts of training data. Experiments reveal empirically that a
relatively small amount of data is sufficient and can potentially be further
reduced through specific selection criteria.Comment: 9 pages, 7 figures, 10 tables; appeared in Proceedings of the
Workshop on Innovative Hybrid Approaches to the Processing of Textual Data,
April 201
- …