157 research outputs found
Hellinger Distance Trees for Imbalanced Streams
Classifiers trained on data sets possessing an imbalanced class distribution
are known to exhibit poor generalisation performance. This is known as the
imbalanced learning problem. The problem becomes particularly acute when we
consider incremental classifiers operating on imbalanced data streams,
especially when the learning objective is rare class identification. As
accuracy may provide a misleading impression of performance on imbalanced data,
existing stream classifiers based on accuracy can suffer poor minority class
performance on imbalanced streams, with the result being low minority class
recall rates. In this paper we address this deficiency by proposing the use of
the Hellinger distance measure, as a very fast decision tree split criterion.
We demonstrate that by using Hellinger a statistically significant improvement
in recall rates on imbalanced data streams can be achieved, with an acceptable
increase in the false positive rate.Comment: 6 Pages, 2 figures, to be published in Proceedings 22nd International
Conference on Pattern Recognition (ICPR) 201
A hybrid algorithm to improve the accuracy of support vector machines on skewed data-sets
Over the past few years, has been shown that generalization power of Support Vector Machines (SVM) falls dramatically on imbalanced data-sets. In this paper, we propose a new method to improve accuracy of SVM on imbalanced data-sets. To get this outcome, firstly, we used undersampling and SVM to obtain the initial SVs and a sketch of the hyperplane. These support vectors help to generate new artificial instances, which will take part as the initial population of a genetic algorithm. The genetic algorithm improves the population in artificial instances from one generation to another and eliminates instances that produce noise in the hyperplane. Finally, the generated and evolved data were included in the original data-set for minimizing the imbalance and improving the generalization ability of the SVM on skewed data-sets
Epileptic Seizure Detection in EEGs by Using Random Tree Forest, Naïve Bayes and KNN Classification
Epilepsy is a disease that attacks the nerves. To detect epilepsy, it is necessary to
analyze the results of an EEG test. In this study, we compared the naive bayes, random tree forest and K-nearest neighbour (KNN) classification algorithms to detect epilepsy. The raw EEG data were pre-processed before doing feature extraction. Then, we have done the training in three algorithms: KNN Classification, naïve bayes classification and random tree forest. The last step was validation of the trained machine learning. Comparing those three classifiers, we calculated accuracy, sensitivity, specificity, and precision. The best trained classifier is KNN
classifier (accuracy: 92.7%), rather than random tree forest (accuracy: 86.6%) and naïve bayes classifier (accuracy: 55.6%). Seen from precision performance, KNN Classification also gives the best precision (82.5%) rather than Naïve Bayes classification (25.3%) and random tree forest (68.2%). But, for the sensitivity, Naïve Bayes classification is the best with 80.3% sensitivity, compare to KNN 73.2% and random tree forest (42.2%). For specificity, KNN classification gives 96.7% specificity, then random tree forest 95.9% and Naïve bayes 50.4%. The training time of naïve bayes was 0.166030 sec, while training time of random tree forest was 2.4094sec and KNN was the slower in training that was 4.789 sec. Therefore, KNN Classification gives better performance than naïve bayes and random tree forest classification
Combine vector quantization and support vector machine for imbalanced datasets
In cases of extremely imbalanced dataset with high dimensions, standard machine learning techniques tend to be overwhelmed by the large classes. This paper rebalances skewed datasets by compressing the majority class. This approach combines Vector Quantization and Support Vector Machine and constructs a new approach, VQ-SVM, to rebalance datasets without significant information loss. Some issues, e.g. distortion and support vectors, have been discussed to address the trade-off between the information loss and undersampling. Experiments compare VQ-SVM and standard SVM on some imbalanced datasets with varied imbalance ratios, and results show that the performance of VQ-SVM is superior to SVM, especially in case of extremely imbalanced large datasets.IFIP International Conference on Artificial Intelligence in Theory and Practice - Integration of AI with other TechnologiesRed de Universidades con Carreras en Informática (RedUNCI
A post-processing strategy for SVM learning from unbalanced data
Standard learning algorithms may perform poorly when learning
from unbalanced datasets. Based on the Fisher’s discriminant analysis,
a post-processing strategy is introduced to deal datasets with significant
imbalance in the data distribution. A new bias is defined, which reduces
skew towards the minority class. Empirical results from experiments for
a learned SVM model on twelve UCI datasets indicates that the proposed
solution improves the original SVM, and they also improve those reported
when using a z-SVM, in terms of g-mean and sensitivity.Peer ReviewedPostprint (author’s final draft
Equity Forecast: Predicting Long Term Stock Price Movement using Machine Learning
Abstract. Long term investment is one of the major investment strategies. However, calculating intrinsic value of some company and evaluating shares for long term investment is not easy, since analyst have to care about a large number of financial indicators and evaluate them in a right manner. So far, little help in predicting the direction of the company value over the longer period of time has been provided from the machines. In this paper we present a machine learning aided approach to evaluate the equity’s future price over the long time. Our method is able to correctly predict whether some company’s value will be 10% higher or not over the period of one year in 76.5% of cases.Keywords. Machine learning, Long term investment, Equity, Stock price prediction.JEL. H54, D92, E20
Bridging the Gap: Simultaneous Fine Tuning for Data Re-Balancing
There are many real-world classification problems wherein the issue of data
imbalance (the case when a data set contains substantially more samples for
one/many classes than the rest) is unavoidable. While under-sampling the
problematic classes is a common solution, this is not a compelling option when
the large data class is itself diverse and/or the limited data class is
especially small. We suggest a strategy based on recent work concerning limited
data problems which utilizes a supplemental set of images with similar
properties to the limited data class to aid in the training of a neural
network. We show results for our model against other typical methods on a
real-world synthetic aperture sonar data set. Code can be found at
github.com/JohnMcKay/dataImbalance.Comment: Submitted to IGARSS 2018, 4 Pages, 8 Figure
- …