Search CORE

2 research outputs found

Enabling Transparent Acceleration of Big Data Frameworks Using Heterogeneous Hardware

Author: Bitsakos Constantinos
Doka Katerina
Fumero Alfonso Juan
Katsakioris Christos
Kotselidis Christos-Efthymios
Koziris Nectarios
Stratikopoulos Athanasios
Xekalaki Maria
Publication venue
Publication date: 01/01/2022
Field of study

The University of Manchester - Institutional Repository

Learning from Multi-Class Imbalanced Big Data with Apache Spark

Author: Sleeman William C, IV
Publication venue: VCU Scholars Compass
Publication date: 01/01/2021
Field of study

With data becoming a new form of currency, its analysis has become a top priority in both academia and industry, furthering advancements in high-performance computing and machine learning. However, these large, real-world datasets come with additional complications such as noise and class overlap. Problems are magnified when with multi-class data is presented, especially since many of the popular algorithms were originally designed for binary data. Another challenge arises when the number of examples are not evenly distributed across all classes in a dataset. This often causes classifiers to favor the majority class over the minority classes, leading to undesirable results as learning from the rare cases may be the primary goal. Many of the classic machine learning algorithms were not designed for multi-class, imbalanced data or parallelism, and so their effectiveness has been hindered. This dissertation addresses some of these challenges with in-depth experimentation using novel implementations of machine learning algorithms using Apache Spark, a distributed computing framework based on the MapReduce model designed to handle very large datasets. Experimentation showed that many of the traditional classifier algorithms do not translate well to a distributed computing environment, indicating the need for a new generation of algorithms targeting modern high-performance computing. A collection of popular oversampling methods, originally designed for small binary class datasets, have been implemented using Apache Spark for the first time to improve parallelism and add multi-class support. An extensive study on how instance level difficulty affects the learning from large datasets was also performed

VCU Scholars Compass