5,716 research outputs found
A Multi-Gene Genetic Programming Application for Predicting Students Failure at School
Several efforts to predict student failure rate (SFR) at school accurately
still remains a core problem area faced by many in the educational sector. The
procedure for forecasting SFR are rigid and most often times require data
scaling or conversion into binary form such as is the case of the logistic
model which may lead to lose of information and effect size attenuation. Also,
the high number of factors, incomplete and unbalanced dataset, and black boxing
issues as in Artificial Neural Networks and Fuzzy logic systems exposes the
need for more efficient tools. Currently the application of Genetic Programming
(GP) holds great promises and has produced tremendous positive results in
different sectors. In this regard, this study developed GPSFARPS, a software
application to provide a robust solution to the prediction of SFR using an
evolutionary algorithm known as multi-gene genetic programming. The approach is
validated by feeding a testing data set to the evolved GP models. Result
obtained from GPSFARPS simulations show its unique ability to evolve a suitable
failure rate expression with a fast convergence at 30 generations from a
maximum specified generation of 500. The multi-gene system was also able to
minimize the evolved model expression and accurately predict student failure
rate using a subset of the original expressionComment: 14 pages, 9 figures, Journal paper. arXiv admin note: text overlap
with arXiv:1403.0623 by other author
Overview of Random Forest Methodology and Practical Guidance with Emphasis on Computational Biology and Bioinformatics
The Random Forest (RF) algorithm by Leo Breiman has become a
standard data analysis tool in bioinformatics. It has shown excellent performance in settings where the number of variables is much larger than the number of observations, can cope with complex interaction structures as well as highly correlated variables and returns measures of variable importance. This paper synthesizes ten years of RF development with emphasis on applications to bioinformatics and computational biology. Special attention is given to practical aspects such as the selection of parameters, available RF implementations, and important pitfalls and biases of RF and its variable importance measures (VIMs). The paper surveys recent developments of the methodology relevant to bioinformatics as well as some representative examples of RF applications in this context and possible directions for future research
Chance-Constrained Day-Ahead Hourly Scheduling in Distribution System Operation
This paper aims to propose a two-step approach for day-ahead hourly
scheduling in a distribution system operation, which contains two operation
costs, the operation cost at substation level and feeder level. In the first
step, the objective is to minimize the electric power purchase from the
day-ahead market with the stochastic optimization. The historical data of
day-ahead hourly electric power consumption is used to provide the forecast
results with the forecasting error, which is presented by a chance constraint
and formulated into a deterministic form by Gaussian mixture model (GMM). In
the second step, the objective is to minimize the system loss. Considering the
nonconvexity of the three-phase balanced AC optimal power flow problem in
distribution systems, the second-order cone program (SOCP) is used to relax the
problem. Then, a distributed optimization approach is built based on the
alternating direction method of multiplier (ADMM). The results shows that the
validity and effectiveness method.Comment: 5 pages, preprint for Asilomar Conference on Signals, Systems, and
Computers 201
Multiobjective Evolutionary Optimization of Type-2 Fuzzy Rule-Based Systems for Financial Data Classification
Classification techniques are becoming essential in the financial world for reducing risks and possible disasters. Managers are interested in not only high accuracy, but in interpretability and transparency as well. It is widely accepted now that the comprehension of how inputs and outputs are related to each other is crucial for taking operative and strategic decisions. Furthermore, inputs are often affected by contextual factors and characterized by a high level of uncertainty. In addition, financial data are usually highly skewed toward the majority class. With the aim of achieving high accuracies, preserving the interpretability, and managing uncertain and unbalanced data, this paper presents a novel method to deal with financial data classification by adopting type-2 fuzzy rule-based classifiers (FRBCs) generated from data by a multiobjective evolutionary algorithm (MOEA). The classifiers employ an approach, denoted as scaled dominance, for defining rule weights in such a way to help minority classes to be correctly classified. In particular, we have extended PAES-RCS, an MOEA-based approach to learn concurrently the rule and data bases of FRBCs, for managing both interval type-2 fuzzy sets and unbalanced datasets. To the best of our knowledge, this is the first work that generates type-2 FRBCs by concurrently maximizing accuracy and minimizing the number of rules and the rule length with the objective of producing interpretable models of real-world skewed and incomplete financial datasets. The rule bases are generated by exploiting a rule and condition selection (RCS) approach, which selects a reduced number of rules from a heuristically generated rule base and a reduced number of conditions for each selected rule during the evolutionary process. The weight associated with each rule is scaled by the scaled dominance approach on the fuzzy frequency of the output class, in order to give a higher weight to the minority class. As regards the data base learning, the membership function parameters of the interval type-2 fuzzy sets used in the rules are learned concurrently to the application of RCS. Unbalanced datasets are managed by using, in addition to complexity, selectivity and specificity as objectives of the MOEA rather than only the classification rate. We tested our approach, named IT2-PAES-RCS, on 11 financial datasets and compared our results with the ones obtained by the original PAES-RCS with three objectives and with and without scaled dominance, the FRBCs, fuzzy association rule-based classification model for high-dimensional dataset (FARC-HD) and fuzzy unordered rules induction algorithm (FURIA), the classical C4.5 decision tree algorithm, and its cost-sensitive version. Using nonparametric statistical tests, we will show that IT2-PAES-RCS generates FRBCs with, on average, accuracy statistically comparable with and complexity lower than the ones generated by the two versions of the original PAES-RCS. Further, the FRBCs generated by FARC-HD and FURIA and the decision trees computed by C4.5 and its cost-sensitive version, despite the highest complexity, result to be less accurate than the FRBCs generated by IT2-PAES-RCS. Finally, we will highlight how these FRBCs are easily interpretable by showing and discussing one of them
Ensemble Pruning for Glaucoma Detection in an Unbalanced Data Set
Background: Random forests are successful classifier ensemble methods consisting of typically 100 to 1000 classification trees. Ensemble pruning techniques reduce the computational cost, especially the memory demand, of random forests by reducing the number of trees without relevant loss of performance or even with increased performance of the sub-ensemble. The application to the problem of an early detection of glaucoma, a severe eye disease with low prevalence, based on topographical measurements of the eye background faces specific challenges. Objectives: We examine the performance of ensemble pruning strategies for glaucoma detection in an unbalanced data situation. Methods: The data set consists of 102 topographical features of the eye background of 254 healthy controls and 55 glaucoma patients. We compare the area under the receiver operating characteristic curve (AUC), and the Brier score on the total data set, in the majority class, and in the minority class of pruned random forest ensembles obtained with strategies based on the prediction accuracy of greedily grown sub-ensembles, the uncertainty weighted accuracy, and the similarity between single trees. To validate the findings and to examine the influence of the prevalence of glaucoma in the data set, we additionally perform a simulation study with lower prevalences of glaucoma. Results: In glaucoma classification all three pruning strategies lead to improved AUC and smaller Brier scores on the total data set with sub-ensembles as small as 30 to 80 trees compared to the classification results obtained with the full ensemble consisting of 1000 trees. In the simulation study, we were able to show that the prevalence of glaucoma is a critical factor and lower prevalence decreases the performance of our pruning strategies. Conclusions: The memory demand for glaucoma classification in an unbalanced data situation based on random forests could effectively be reduced by the application of pruning strategies without loss of performance in a population with increased risk of glaucoma
Coupling different methods for overcoming the class imbalance problem
Many classification problems must deal with imbalanced datasets where one class \u2013 the majority class \u2013 outnumbers the other classes. Standard classification methods do not provide accurate predictions in this setting since classification is generally biased towards the majority class. The minority classes are oftentimes the ones of interest (e.g., when they are associated with pathological conditions in patients), so methods for handling imbalanced datasets are critical.
Using several different datasets, this paper evaluates the performance of state-of-the-art classification methods for handling the imbalance problem in both binary and multi-class datasets. Different strategies are considered, including the one-class and dimension reduction approaches, as well as their fusions. Moreover, some ensembles of classifiers are tested, in addition to stand-alone classifiers, to assess the effectiveness of ensembles in the presence of imbalance. Finally, a novel ensemble of ensembles is designed specifically to tackle the problem of class imbalance: the proposed ensemble does not need to be tuned separately for each dataset and outperforms all the other tested approaches.
To validate our classifiers we resort to the KEEL-dataset repository, whose data partitions (training/test) are publicly available and have already been used in the open literature: as a consequence, it is possible to report a fair comparison among different approaches in the literature.
Our best approach (MATLAB code and datasets not easily accessible elsewhere) will be available at https://www.dei.unipd.it/node/2357
Feature Selection via Robust Weighted Score for High Dimensional Binary Class-Imbalanced Gene Expression Data
In this paper, a robust weighted score for unbalanced data (ROWSU) is
proposed for selecting the most discriminative feature for high dimensional
gene expression binary classification with class-imbalance problem. The method
addresses one of the most challenging problems of highly skewed class
distributions in gene expression datasets that adversely affect the performance
of classification algorithms. First, the training dataset is balanced by
synthetically generating data points from minority class observations. Second,
a minimum subset of genes is selected using a greedy search approach. Third, a
novel weighted robust score, where the weights are computed by support vectors,
is introduced to obtain a refined set of genes. The highest-scoring genes based
on this approach are combined with the minimum subset of genes selected by the
greedy search approach to form the final set of genes. The novel method ensures
the selection of the most discriminative genes, even in the presence of skewed
class distribution, thus improving the performance of the classifiers. The
performance of the proposed ROWSU method is evaluated on gene expression
datasets. Classification accuracy and sensitivity are used as performance
metrics to compare the proposed ROWSU algorithm with several other
state-of-the-art methods. Boxplots and stability plots are also constructed for
a better understanding of the results. The results show that the proposed
method outperforms the existing feature selection procedures based on
classification performance from k nearest neighbours (kNN) and random forest
(RF) classifiers.Comment: 25 page
- ā¦