4,051 research outputs found
An empirical evaluation of imbalanced data strategies from a practitioner's point of view
This research tested the following well known strategies to deal with binary
imbalanced data on 82 different real life data sets (sampled to imbalance rates
of 5%, 3%, 1%, and 0.1%): class weight, SMOTE, Underbagging, and a baseline
(just the base classifier). As base classifiers we used SVM with RBF kernel,
random forests, and gradient boosting machines and we measured the quality of
the resulting classifier using 6 different metrics (Area under the curve,
Accuracy, F-measure, G-mean, Matthew's correlation coefficient and Balanced
accuracy). The best strategy strongly depends on the metric used to measure the
quality of the classifier. For AUC and accuracy class weight and the baseline
perform better; for F-measure and MCC, SMOTE performs better; and for G-mean
and balanced accuracy, underbagging
Ensemble classifier and resampling for imbalanced multiclass learning
An ensemble classifier called DECIML has previously reported that the classifier is able to perform on benchmark data compared to several single classifiers and ensemble classifiers such as AdaBoost, Bagging
and Random Forest.The implementation of the ensemble using sampling was carried out in order to investigate if there are any improvements in the classification performances of the DECIML.Random sampling with replacement (SWR) method is applied to minority class in the imbalanced multiclass data. Results show that the SWR is able to increase the average performance of the ensemble classifie
Estudio de métodos de construcción de ensembles de clasificadores y aplicaciones
La inteligencia artificial se dedica a la creación de sistemas informáticos con un comportamiento inteligente. Dentro de este área el aprendizaje computacional estudia la creación de sistemas que aprenden por sà mismos.
Un tipo de aprendizaje computacional es el aprendizaje supervisado, en el cual, se le proporcionan al sistema tanto las entradas como la salida esperada y el sistema aprende a partir de estos datos. Un sistema de este tipo se denomina clasificador.
En ocasiones ocurre, que en el conjunto de ejemplos que utiliza el sistema para aprender, el número de ejemplos de un tipo es mucho mayor que el número de ejemplos de otro tipo. Cuando esto ocurre se habla de conjuntos desequilibrados.
La combinación de varios clasificadores es lo que se denomina "ensemble", y a menudo ofrece mejores resultados que cualquiera de los miembros que lo forman. Una de las claves para el buen funcionamiento de los ensembles es la diversidad.
Esta tesis, se centra en el desarrollo de nuevos algoritmos de construcción de ensembles, centrados en técnicas de incremento de la diversidad y en los problemas desequilibrados. Adicionalmente, se aplican estas técnicas a la solución de varias problemas industriales.Ministerio de EconomÃa y Competitividad, proyecto TIN-2011-2404
Data mining for detecting Bitcoin Ponzi schemes
Soon after its introduction in 2009, Bitcoin has been adopted by
cyber-criminals, which rely on its pseudonymity to implement virtually
untraceable scams. One of the typical scams that operate on Bitcoin are the
so-called Ponzi schemes. These are fraudulent investments which repay users
with the funds invested by new users that join the scheme, and implode when it
is no longer possible to find new investments. Despite being illegal in many
countries, Ponzi schemes are now proliferating on Bitcoin, and they keep
alluring new victims, who are plundered of millions of dollars. We apply data
mining techniques to detect Bitcoin addresses related to Ponzi schemes. Our
starting point is a dataset of features of real-world Ponzi schemes, that we
construct by analysing, on the Bitcoin blockchain, the transactions used to
perform the scams. We use this dataset to experiment with various machine
learning algorithms, and we assess their effectiveness through standard
validation protocols and performance metrics. The best of the classifiers we
have experimented can identify most of the Ponzi schemes in the dataset, with a
low number of false positives
- …