60,835 research outputs found

    MissForest - nonparametric missing value imputation for mixed-type data

    Full text link
    Modern data acquisition based on high-throughput technology is often facing the problem of missing data. Algorithms commonly used in the analysis of such large-scale data often depend on a complete set. Missing value imputation offers a solution to this problem. However, the majority of available imputation methods are restricted to one type of variable only: continuous or categorical. For mixed-type data the different types are usually handled separately. Therefore, these methods ignore possible relations between variable types. We propose a nonparametric method which can cope with different types of variables simultaneously. We compare several state of the art methods for the imputation of missing values. We propose and evaluate an iterative imputation method (missForest) based on a random forest. By averaging over many unpruned classification or regression trees random forest intrinsically constitutes a multiple imputation scheme. Using the built-in out-of-bag error estimates of random forest we are able to estimate the imputation error without the need of a test set. Evaluation is performed on multiple data sets coming from a diverse selection of biological fields with artificially introduced missing values ranging from 10% to 30%. We show that missForest can successfully handle missing values, particularly in data sets including different types of variables. In our comparative study missForest outperforms other methods of imputation especially in data settings where complex interactions and nonlinear relations are suspected. The out-of-bag imputation error estimates of missForest prove to be adequate in all settings. Additionally, missForest exhibits attractive computational efficiency and can cope with high-dimensional data.Comment: Submitted to Oxford Journal's Bioinformatics on 3rd of May 201

    Dealing with Missing Data and Uncertainty in the Context of Data Mining

    Get PDF
    Missing data is an issue in many real-world datasets yet robust methods for dealing with missing data appropriately still need development. In this paper we conduct an investigation of how some methods for handling missing data perform when the uncertainty increases. Using benchmark datasets from the UCI Machine Learning repository we generate datasets for our experimentation with increasing amounts of data Missing Completely At Random (MCAR) both at the attribute level and at the record level. We then apply four classification algorithms: C4.5, Random Forest, NaĂŻve Bayes and Support Vector Machines (SVMs). We measure the performance of each classifiers on the basis of complete case analysis, simple imputation and then we study the performance of the algorithms that can handle missing data. We find that complete case analysis has a detrimental effect because it renders many datasets infeasible when missing data increases, particularly for high dimensional data. We find that increasing missing data does have a negative effect on the performance of all the algorithms tested but the different algorithms tested either using preprocessing in the form of simple imputation or handling the missing data do not show a significant difference in performance

    Improved Time Series Land Cover Classification by Missing-Observation-Adaptive Nonlinear Dimensionality Reduction

    Get PDF
    Dimensionality reduction (DR) is a widely used technique to address the curse of dimensionality when high-dimensional remotely sensed data, such as multi-temporal or hyperspectral imagery, are analyzed. Nonlinear DR algorithms, also referred to as manifold learning algorithms, have been successfully applied to hyperspectral data and provide improved performance compared with linear DR algorithms. However, DR algorithms cannot handle missing data that are common in multi-temporal imagery. In this paper, the Laplacian Eigenmaps (LE) nonlinear DR algorithm was refined for application to multi-temporal satellite data with large proportions of missing data. Refined LE algorithms were applied to 52-week Landsat time series for three study areas in Texas, Kansas and South Dakota that have different amounts of missing data and land cover complexity. A series of random forest classifications were conducted on the refined LE DR bands using varying proportions of training data provided by the United States Department of Agriculture (USDA) National Agricultural Statistics Service (NASS) Cropland Data Layer (CDL); these classification results were compared with conventional metrics-based random forest classifications. Experimental results show that compared with the metrics approach, higher per-class and overall classification accuracies were obtained using the refined LE DR bands of multispectral reflectance time series, and the number of training samples required to achieve a given degree of classification accuracy was also reduced. The approach of applying the refined LE to multispectral reflectance time series is promising in that it is automated and provides dimensionality-reduced data with desirable classification properties. The implications of this research and possibilities for future algorithm development and application are discussed

    MissForest—non-parametric missing value imputation for mixed-type data

    Get PDF
    Motivation: Modern data acquisition based on high-throughput technology is often facing the problem of missing data. Algorithms commonly used in the analysis of such large-scale data often depend on a complete set. Missing value imputation offers a solution to this problem. However, the majority of available imputation methods are restricted to one type of variable only: continuous or categorical. For mixed-type data, the different types are usually handled separately. Therefore, these methods ignore possible relations between variable types. We propose a non-parametric method which can cope with different types of variables simultaneously. Results: We compare several state of the art methods for the imputation of missing values. We propose and evaluate an iterative imputation method (missForest) based on a random forest. By averaging over many unpruned classification or regression trees, random forest intrinsically constitutes a multiple imputation scheme. Using the built-in out-of-bag error estimates of random forest, we are able to estimate the imputation error without the need of a test set. Evaluation is performed on multiple datasets coming from a diverse selection of biological fields with artificially introduced missing values ranging from 10% to 30%. We show that missForest can successfully handle missing values, particularly in datasets including different types of variables. In our comparative study, missForest outperforms other methods of imputation especially in data settings where complex interactions and non-linear relations are suspected. The out-of-bag imputation error estimates of missForest prove to be adequate in all settings. Additionally, missForest exhibits attractive computational efficiency and can cope with high-dimensional data. Availability: The â„ť package missForest is freely available from http://stat.ethz.ch/CRAN/. Contact: [email protected]; [email protected]

    Novel Random Forest Methods and Algorithms for Autism Spectrum Disorders Research

    Get PDF
    Random Forest (RF) is a flexible, easy to use machine learning algorithm that was proposed by Leo Breiman in 2001 for building a predictor ensemble with a set of decision trees that grow in randomly selected subspaces of data. Its superior prediction accuracy has made it the most used algorithms in the machine learning field. In this dissertation, we use the random forest as the main building block for creating a proximity matrix for multivariate matching and diagnostic classification problems that are used for autism research (as an exemplary application). In observational studies, matching is used to optimize the balance between treatment groups. Although many matching algorithms can achieve this goal, in some fields, matching could face its own challenges. Datasets with small sample sizes and limited control reservoirs are prone to this issue. This problem may apply to many ongoing research fields, such as autism spectrum disorder (ASD). We are interested in eliminating the effect of undesirable variables using two types of algorithms, 1:k nearest matching, and full matching. Therefore, we first introduced three different types of 1:k nearest matching algorithms and two full matching based methods to compare group-wise matching vs. pairwise matching for creating an optimal balance and sample size. These proposed methods were applied to a data set from the Brain Development Imaging Lab (BDIL) at San Diego State University. Next, we introduce the iterMatch R package. This package finds a 1:1 matched subsample of the data that is balanced on all matching variables while incorporating missing values in an iterative manner. Missing variables in dataset need to be imputed or only complete cases can be considered in matching. Losing data because of the limitations in a matching algorithm can decrease the power of the study as well as omit important information. Other than introducing the iterMatch package, tuning the input parameters of this package is discussed, using medium and large datasets from the Autism Brain Imaging Data Exchange (ABIDE). We then propose two mixed-effects random forest-based classification algorithms applicable to multi-site (clustered data) using resting-state fMRI (rs-fMRI) and structural MRI (sMRI). These algorithms control the random effects of the confounding factor of the site and fixed-effect of phenotype variable of age internally while building the prediction model. On top of controlling the effects of confounding variables, these algorithms take away the necessity of utilizing a separate dimension reduction algorithm for high dimensional data such as functional connectivity in a non-linear fashion. We show the proposed algorithms can achieve prediction accuracy over 80 percent using test data

    Geometry- and Accuracy-Preserving Random Forest Proximities with Applications

    Get PDF
    Many machine learning algorithms use calculated distances or similarities between data observations to make predictions, cluster similar data, visualize patterns, or generally explore the data. Most distances or similarity measures do not incorporate known data labels and are thus considered unsupervised. Supervised methods for measuring distance exist which incorporate data labels and thereby exaggerate separation between data points of different classes. This approach tends to distort the natural structure of the data. Instead of following similar approaches, we leverage a popular algorithm used for making data-driven predictions, known as random forests, to naturally incorporate data labels into similarity measures known as random forest proximities. In this dissertation, we explore previously defined random forest proximities and demonstrate their weaknesses in popular proximity-based applications. Additionally, we develop a new proximity definition that can be used to recreate the random forest’s predictions. We call these random forest-geometry-and accuracy-Preserving proximities or RF-GAP. We show by proof and empirical demonstration can be used to perfectly reconstruct the random forest’s predictions and, as a result, we argue that RF-GAP proximities provide a truer representation of the random forest’s learning when used in proximity-based applications. We provide evidence to suggest that RF-GAP proximities improve applications including imputing missing data, detecting outliers, and visualizing the data. We also introduce a new random forest proximity-based technique that can be used to generate 2- or 3-dimensional data representations which can be used as a tool to visually explore the data. We show that this method does well at portraying the relationship between data variables and the data labels. We show quantitatively and qualitatively that this method surpasses other existing methods for this task

    Performance evaluation of supervised learning algorithms with various training data sizes and missing attributes

    Get PDF
    Supervised learning is a machine learning technique used for creating a data prediction model. This article focuses on finding high performance supervised learning algorithms with varied training data sizes, varied number of attributes, and time spent on prediction. This studied evaluated seven algorithms, Boosting, Random Forest, Bagging, Naive Bayes, K-Nearest Neighbours (K-NN), Decision Tree, and Support Vector Machine (SVM), on seven data sets that are the standard benchmark from University of California, Irvine (UCI) with two evaluation metrics and experimental settings of various training data sizes and missing key attributes. Our findings reveal that Bagging, Random Forest, and SVM are overall the three most accurate algorithms. However, when presence of key attribute values is of concern, K-NN is recommended as its performance is affected the least. Alternatively, when training data sizes may be not large enough, Naive Bayes is preferable since it is the most insensitive algorithm to training data sizes. The algorithms are characterized on a two-dimension chart based on prediction performance and computation time. This chart is expected to guide a novice user to choose an appropriate method for his/her demand. Based on this chart, in general, Bagging and Random Forest are the two most recommended algorithms because of their high performance and speed

    Predicting Heart Ailment in Patients with Varying number of Features using Data Mining Techniques

    Get PDF
    Data mining can be defined as a process of extracting unknown, verifiable and possibly helpful data from information. Among the various ailments, heart ailment is one of the primary reason behind death of individuals around the globe, hence in order to curb this, a detailed analysis is done using Data Mining. Many a times we limit ourselves with minimal attributes that are required to predict a patient with heart disease. By doing so we are missing on a lot of important attributes that are main causes for heart diseases. Hence, this research aims at considering almost all the important features affecting heart disease and performs the analysis step by step with minimal to maximum set of attributes using Data Mining techniques to predict heart ailments. The various classification methods used are NaĂŻve Bayes classifier, Random Forest and Random Tree which are applied on three datasets with different number of attributes but with a common class label. From the analysis performed, it shows that there is a gradual increase in prediction accuracies with the increase in the attributes irrespective of the classifiers used and NaĂŻve Bayes and Random Forest algorithms comparatively outperforms with these sets of data

    The hidden sexual minorities: machine learning approaches to estimate the sexual minority orientation among Beijing college students

    Get PDF
    Based on the fourth-wave Beijing College Students Panel Survey (BCSPS), this study aims to provide accurate estimation of the percentage of the potential sexual minorities among the Beijing college students by using machine learning methods. Specifically, we employ random forest (RF), an ensemble learning approach for classification and regression, to predict the sexual orientation of those who were not willing to disclose his/her inherent sexual identity. To overcome the imbalance problem arising from far different numerical proportion of sexual minority and majority members, we adopt the repeated random sub-sampling for training set by partitioning those who expressed heterosexual orientation into different number of splits and further combining each split with those who expressed sexual minority orientation. The prediction from 24-split random forest suggests that youths in Beijing with sexual minority orientation amount to 5.71%, almost two times that of the original estimation 3.03%. The results are robust to alternative learning methods and covariate sets. Besides, it is also suggested that random forest outperforms other learning algorithms, including AdaBoost, Naive Bayes, support vector machine (SVM), and logistic regression, in dealing with missing data, by showing higher accuracy, F1 score, and area under curve (AUC) value
    • …
    corecore