156,536 research outputs found
Improved Weighted Random Forest for Classification Problems
Several studies have shown that combining machine learning models in an
appropriate way will introduce improvements in the individual predictions made
by the base models. The key to make well-performing ensemble model is in the
diversity of the base models. Of the most common solutions for introducing
diversity into the decision trees are bagging and random forest. Bagging
enhances the diversity by sampling with replacement and generating many
training data sets, while random forest adds selecting a random number of
features as well. This has made the random forest a winning candidate for many
machine learning applications. However, assuming equal weights for all base
decision trees does not seem reasonable as the randomization of sampling and
input feature selection may lead to different levels of decision-making
abilities across base decision trees. Therefore, we propose several algorithms
that intend to modify the weighting strategy of regular random forest and
consequently make better predictions. The designed weighting frameworks include
optimal weighted random forest based on ac-curacy, optimal weighted random
forest based on the area under the curve (AUC), performance-based weighted
random forest, and several stacking-based weighted random forest models. The
numerical results show that the proposed models are able to introduce
significant improvements compared to regular random forest
Recommended from our members
Improving music genre classification using automatically induced harmony rules
We present a new genre classification framework using both low-level signal-based features and high-level harmony features. A state-of-the-art statistical genre classifier based on timbral features is extended using a first-order random forest containing for each genre rules derived from harmony or chord sequences. This random forest has been automatically induced, using the first-order logic induction algorithm TILDE, from a dataset, in which for each chord the degree and chord category are identified, and covering classical, jazz and pop genre classes. The audio descriptor-based genre classifier contains 206 features, covering spectral, temporal, energy, and pitch characteristics of the audio signal. The fusion of the harmony-based classifier with the extracted feature vectors is tested on three-genre subsets of the GTZAN and ISMIR04 datasets, which contain 300 and 448 recordings, respectively. Machine learning classifiers were tested using 5 Ă— 5-fold cross-validation and feature selection. Results indicate that the proposed harmony-based rules combined with the timbral descriptor-based genre classification system lead to improved genre classification rates
Recommended from our members
Improving music genre classification using automatically induced harmony rules
We present a new genre classification framework using both low-level signal-based features and high-level harmony features. A state-of-the-art statistical genre classifier based on timbral features is extended using a first-order random forest containing for each genre rules derived from harmony or chord sequences. This random forest has been automatically induced, using the first-order logic induction algorithm TILDE, from a dataset, in which for each chord the degree and chord category are identified, and covering classical, jazz and pop genre classes. The audio descriptor-based genre classifier contains 206 features, covering spectral, temporal, energy, and pitch characteristics of the audio signal. The fusion of the harmony-based classifier with the extracted feature vectors is tested on three-genre subsets of the GTZAN and ISMIR04 datasets, which contain 300 and 448 recordings, respectively. Machine learning classifiers were tested using 5 Ă— 5-fold cross-validation and feature selection. Results indicate that the proposed harmony-based rules combined with the timbral descriptor-based genre classification system lead to improved genre classification rates
Random forest for gene selection and microarray data classification
A random forest method has been selected to perform both gene selection and classification of the microarray data. In this
embedded method, the selection of smallest possible sets of genes with lowest error rates is the key factor in achieving highest
classification accuracy. Hence, improved gene selection method using random forest has been proposed to obtain the smallest
subset of genes as well as biggest subset of genes prior to classification. The option for biggest subset selection is done to assist
researchers who intend to use the informative genes for further research. Enhanced random forest gene selection has performed
better in terms of selecting the smallest subset as well as biggest subset of informative genes with lowest out of bag error rates
through gene selection. Furthermore, the classification performed on the selected subset of genes using random forest has lead to
lower prediction error rates compared to existing method and other similar available methods
Predicting Changes in Earnings: A Walk Through a Random Forest
This paper investigates whether the accuracy of models used in accounting research to predict categorical dependent variables (classification) can be improved by using a data analytics approach. This topic is important because accounting research makes extensive use of classification in many different research streams that are likely to benefit from improved accuracy. Specifically, this paper investigates whether the out-of-sample accuracy of models used to predict future changes in earnings can be improved by considering whether the assumptions of the models are likely to be violated and whether alternative techniques have strengths that are likely to make them a better choice for the classification task. I begin my investigation using logistic regression to predict positive changes in earnings using a large set of independent variables. Next, I implement two separate modifications to the standard logistic regression model, stepwise logistic regression and elastic net, and examine whether these modifications improve the accuracy of the classification task. Lastly, I relax the logistic regression parametric assumption and examine whether random forest, a nonparametric machine learning technique, improves the accuracy of the classification task. I find little difference in the accuracy of the logistic regression-based models; however, I find that random forest has consistently higher out-of-sample accuracy than the other models. I also find that a hedge portfolio formed on predicted probabilities using random forest earns larger abnormal returns than hedge portfolios formed using the logistic regression-based models. In subsequent analysis, I consider whether the documented improvements exist in an alternative classification setting: financial misstatements. I find that random forest’s out-of-sample area under the receiver operating characteristic (AUC) is significantly higher than the logistic-based models. Taken together, my findings suggest that the accuracy of classification models used in accounting research can be improved by considering the strengths and weaknesses of different classification models and considering whether machine learning models are appropriate
A Data-driven Approach For Winter Precipitation Classification Using Weather Radar And NWP Data
This study describes a framework that provides qualitative weather information on winter precipitation types using a data-driven approach. The framework incorporates the data retrieved from weather radars and the numerical weather prediction (NWP) model to account for relevant precipitation microphysics. To enable multimodel-based ensemble classification, we selected six supervised machine learning models: k-nearest neighbors, logistic regression, support vector machine, decision tree, random forest, and multi-layer perceptron. Our model training and cross-validation results based on Monte Carlo Simulation (MCS) showed that all the models performed better than our baseline method, which applies two thresholds (surface temperature and atmospheric layer thickness) for binary classification (i.e., rain/snow). Among all six models, random forest presented the best classification results for the basic classes (rain, freezing rain, and snow) and the further refinement of the snow classes (light, moderate, and heavy). Our model evaluation, which uses an independent dataset not associated with model development and learning, led to classification performance consistent with that from the MCS analysis. Based on the visual inspection of the classification maps generated for an individual radar domain, we confirmed the improved classification capability of the developed models (e.g., random forest) compared to the baseline one in representing both spatial variability and continuity
NeuroSVM: A Graphical User Interface for Identification of Liver Patients
Diagnosis of liver infection at preliminary stage is important for better
treatment. In todays scenario devices like sensors are used for detection of
infections. Accurate classification techniques are required for automatic
identification of disease samples. In this context, this study utilizes data
mining approaches for classification of liver patients from healthy
individuals. Four algorithms (Naive Bayes, Bagging, Random forest and SVM) were
implemented for classification using R platform. Further to improve the
accuracy of classification a hybrid NeuroSVM model was developed using SVM and
feed-forward artificial neural network (ANN). The hybrid model was tested for
its performance using statistical parameters like root mean square error (RMSE)
and mean absolute percentage error (MAPE). The model resulted in a prediction
accuracy of 98.83%. The results suggested that development of hybrid model
improved the accuracy of prediction. To serve the medicinal community for
prediction of liver disease among patients, a graphical user interface (GUI)
has been developed using R. The GUI is deployed as a package in local
repository of R platform for users to perform prediction.Comment: 9 pages, 6 figure
- …