Search CORE

157 research outputs found

Hellinger Distance Trees for Imbalanced Streams

Author: Brooke J. M.
Knowles J. D.
Lyon R. J.
Stappers B. W.
Publication venue
Publication date: 01/01/2014
Field of study

Classifiers trained on data sets possessing an imbalanced class distribution are known to exhibit poor generalisation performance. This is known as the imbalanced learning problem. The problem becomes particularly acute when we consider incremental classifiers operating on imbalanced data streams, especially when the learning objective is rare class identification. As accuracy may provide a misleading impression of performance on imbalanced data, existing stream classifiers based on accuracy can suffer poor minority class performance on imbalanced streams, with the result being low minority class recall rates. In this paper we address this deficiency by proposing the use of the Hellinger distance measure, as a very fast decision tree split criterion. We demonstrate that by using Hellinger a statistically significant improvement in recall rates on imbalanced data streams can be achieved, with an acceptable increase in the false positive rate.Comment: 6 Pages, 2 figures, to be published in Proceedings 22nd International Conference on Pattern Recognition (ICPR) 201

arXiv.org e-Print Archive

Crossref

University of Birmingham Research Portal

Edge Hill University Research Information Repository

The University of Manchester - Institutional Repository

A hybrid algorithm to improve the accuracy of support vector machines on skewed data-sets

Author: A. Fernández
B.X. Wang
G. Wu
G.E. Batista
H. Han
N.V. Chawla
R. Akbani
S. García
Z.-Q. Zeng
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

Over the past few years, has been shown that generalization power of Support Vector Machines (SVM) falls dramatically on imbalanced data-sets. In this paper, we propose a new method to improve accuracy of SVM on imbalanced data-sets. To get this outcome, firstly, we used undersampling and SVM to obtain the initial SVs and a sketch of the hyperplane. These support vectors help to generate new artificial instances, which will take part as the initial population of a genetic algorithm. The genetic algorithm improves the population in artificial instances from one generation to another and eliminates instances that produce noise in the hyperplane. Finally, the generated and evolved data were included in the original data-set for minimizing the imbalance and improving the generalization ability of the SVM on skewed data-sets

Crossref

Red Mexicana de Repositorios Institucionales

Repositorio Institucional de la Universidad Autónoma del Estado de México

Epileptic Seizure Detection in EEGs by Using Random Tree Forest, Naïve Bayes and KNN Classification

Author: Edison Rizki Edmi
Fauzy Fikry Ravi
Haekal Mohammad
Haryanto Freddy
Khotimah Siti Nurul
Lestari Fauzi P
Publication venue: 'IOP Publishing'
Publication date: 01/03/2020
Field of study

Epilepsy is a disease that attacks the nerves. To detect epilepsy, it is necessary to analyze the results of an EEG test. In this study, we compared the naive bayes, random tree forest and K-nearest neighbour (KNN) classification algorithms to detect epilepsy. The raw EEG data were pre-processed before doing feature extraction. Then, we have done the training in three algorithms: KNN Classification, naïve bayes classification and random tree forest. The last step was validation of the trained machine learning. Comparing those three classifiers, we calculated accuracy, sensitivity, specificity, and precision. The best trained classifier is KNN classifier (accuracy: 92.7%), rather than random tree forest (accuracy: 86.6%) and naïve bayes classifier (accuracy: 55.6%). Seen from precision performance, KNN Classification also gives the best precision (82.5%) rather than Naïve Bayes classification (25.3%) and random tree forest (68.2%). But, for the sensitivity, Naïve Bayes classification is the best with 80.3% sensitivity, compare to KNN 73.2% and random tree forest (42.2%). For specificity, KNN classification gives 96.7% specificity, then random tree forest 95.9% and Naïve bayes 50.4%. The training time of naïve bayes was 0.166030 sec, while training time of random tree forest was 2.4094sec and KNN was the slower in training that was 4.789 sec. Therefore, KNN Classification gives better performance than naïve bayes and random tree forest classification

UHAMKA Repository

Combine vector quantization and support vector machine for imbalanced datasets

Author: Debenham John
Jan Tony
Simoff Simeon
Yu Ting
Publication venue
Publication date: 01/08/2006
Field of study

In cases of extremely imbalanced dataset with high dimensions, standard machine learning techniques tend to be overwhelmed by the large classes. This paper rebalances skewed datasets by compressing the majority class. This approach combines Vector Quantization and Support Vector Machine and constructs a new approach, VQ-SVM, to rebalance datasets without significant information loss. Some issues, e.g. distortion and support vectors, have been discussed to address the trade-off between the information loss and undersampling. Experiments compare VQ-SVM and standard SVM on some imbalanced datasets with varied imbalance ratios, and results show that the performance of VQ-SVM is superior to SVM, especially in case of extremely imbalanced large datasets.IFIP International Conference on Artificial Intelligence in Theory and Practice - Integration of AI with other TechnologiesRed de Universidades con Carreras en Informática (RedUNCI

A post-processing strategy for SVM learning from unbalanced data

Author: Angulo Bahón Cecilio
González Abril Luis
Núñez Castro Haydemar
Publication venue
Publication date: 01/01/2011
Field of study

Standard learning algorithms may perform poorly when learning from unbalanced datasets. Based on the Fisher’s discriminant analysis, a post-processing strategy is introduced to deal datasets with significant imbalance in the data distribution. A new bias is defined, which reduces skew towards the minority class. Empirical results from experiments for a learned SVM model on twelve UCI datasets indicates that the proposed solution improves the original SVM, and they also improve those reported when using a z-SVM, in terms of g-mean and sensitivity.Peer ReviewedPostprint (author’s final draft

UPCommons. Portal del coneixement obert de la UPC

idUS. Depósito de Investigación Universidad de Sevilla

Equity Forecast: Predicting Long Term Stock Price Movement using Machine Learning

Author: MILOSEVIC Nikola
Publication venue: Journal of Economics Library
Publication date: 18/06/2016
Field of study

Abstract. Long term investment is one of the major investment strategies. However, calculating intrinsic value of some company and evaluating shares for long term investment is not easy, since analyst have to care about a large number of financial indicators and evaluate them in a right manner. So far, little help in predicting the direction of the company value over the longer period of time has been provided from the machines. In this paper we present a machine learning aided approach to evaluate the equity’s future price over the long time. Our method is able to correctly predict whether some company’s value will be 10% higher or not over the period of one year in 76.5% of cases.Keywords. Machine learning, Long term investment, Equity, Stock price prediction.JEL. H54, D92, E20

KSP Journals

Journal of Economics Library

Bridging the Gap: Simultaneous Fine Tuning for Data Re-Balancing

Author: Gerg Isaac
McKay John
Monga Vishal
Publication venue
Publication date: 08/01/2018
Field of study

There are many real-world classification problems wherein the issue of data imbalance (the case when a data set contains substantially more samples for one/many classes than the rest) is unavoidable. While under-sampling the problematic classes is a common solution, this is not a compelling option when the large data class is itself diverse and/or the limited data class is especially small. We suggest a strategy based on recent work concerning limited data problems which utilizes a supplemental set of images with similar properties to the limited data class to aid in the training of a neural network. We show results for our model against other typical methods on a real-world synthetic aperture sonar data set. Code can be found at github.com/JohnMcKay/dataImbalance.Comment: Submitted to IGARSS 2018, 4 Pages, 8 Figure

arXiv.org e-Print Archive

Crossref