4 research outputs found
An insight into imbalanced Big Data classification: outcomes and challenges
Big Data applications are emerging during the last years, and researchers from many disciplines are aware of the high advantages related to the knowledge extraction from this type of problem. However, traditional learning approaches cannot be directly applied due to scalability issues. To overcome this issue, the MapReduce framework has arisen as a “de facto” solution. Basically, it carries out a “divide-and-conquer” distributed procedure in a fault-tolerant way to adapt for commodity hardware. Being still a recent discipline, few research has been conducted on imbalanced classification for Big Data. The reasons behind this are mainly the difficulties in adapting standard techniques to the MapReduce programming style. Additionally, inner problems of imbalanced data, namely lack of data and small disjuncts, are accentuated during the data partitioning to fit the MapReduce programming style. This paper is designed under three main pillars. First, to present the first outcomes for imbalanced classification in Big Data problems, introducing the current research state of this area. Second, to analyze the behavior of standard pre-processing techniques in this particular framework. Finally, taking into account the experimental results obtained throughout this work, we will carry out a discussion on the challenges and future directions for the topic.This work has been partially supported by the Spanish Ministry of Science and Technology under Projects TIN2014-57251-P and TIN2015-68454-R, the Andalusian Research Plan P11-TIC-7765, the Foundation BBVA Project 75/2016 BigDaPTOOLS, and the National Science Foundation (NSF) Grant IIS-1447795
Do you really follow them? Automatic detection of credulous Twitter users
Online Social Media represent a pervasive source of information able to reach
a huge audience. Sadly, recent studies show how online social bots (automated,
often malicious accounts, populating social networks and mimicking genuine
users) are able to amplify the dissemination of (fake) information by orders of
magnitude. Using Twitter as a benchmark, in this work we focus on what we
define credulous users, i.e., human-operated accounts with a high percentage of
bots among their followings. Being more exposed to the harmful activities of
social bots, credulous users may run the risk of being more influenced than
other users; even worse, although unknowingly, they could become spreaders of
misleading information (e.g., by retweeting bots). We design and develop a
supervised classifier to automatically recognize credulous users. The best
tested configuration achieves an accuracy of 93.27% and AUC-ROC of 0.93, thus
leading to positive and encouraging results.Comment: 8 pages, 2 tables. Accepted for publication at IDEAL 2019 (20th
International Conference on Intelligent Data Engineering and Automated
Learning, Manchester, UK, 14-16 November, 2019). The present version is the
accepted version, and it is not the final published versio
Handling Imbalanced Classes: Feature Based Variance Ranking Techniques for Classification
To obtain good predictions in the presence of imbalance classes has posed significant challenges in the data science community. Imbalanced classed data is a term used to describe a situation where there are unequal number of classes or groups in datasets. In most real-life datasets one of the classes are always higher in number than others and is called the majority class, while the smaller classes are called the minority class. During classifications even with very high accuracy, the classified minority groups are usually very small when compared to the total number of minority in the datasets and more often than not, the minority classes are what is being sought. This work is specifically concern with providing techniques to improve classifications
performance by eliminating or reducing negative effects of class imbalance. Real-life datasets have been found to contain different types of error in combination with
class imbalance. While these errors are easily corrected, but the solutions to class imbalance have remained elusive.
Previously, machine learning (ML) technique has been used to solve the problems of class imbalanced. There are notable shortcomings that have been identified while using this technique. Mostly, it involve fine-tuning and changing parameters of the algorithms and this process is not standardised because of countless numbers of algorithms and parameters. In general, the results obtained from these unstandardised (ML) technique are very inconsistent and cannot be replicated with similar datasets and algorithms.
We present a novel technique for dealing with imbalanced classes called variance ranking features selection, that enables machine learning algorithms to classify more
of minority classes during classification, hence reducing the negative effects of class imbalance. Our approaches utilised the intrinsic property of the datasets called
the variance. As the variance is one of the measures of central tendency of the data items concentration within the datasets vector space. We demonstrated the selections of features at different level of performance threshold thereby providing an opportunity for performance and feature significance to be assessed and correlated at different levels of prediction. In the evaluations we compared our features selections with some of the best known features selections techniques using proximity distance comparison techniques and verify all the results with different datasets, both binary and multi classed with varying degree of class imbalance. In all the experiments, the results we obtained showed a significant improvement when compared with other previous work in class imbalance
Recent Advances in Indoor Localization Systems and Technologies
Despite the enormous technical progress seen in the past few years, the maturity of indoor localization technologies has not yet reached the level of GNSS solutions. The 23 selected papers in this book present the recent advances and new developments in indoor localization systems and technologies, propose novel or improved methods with increased performance, provide insight into various aspects of quality control, and also introduce some unorthodox positioning methods