Search CORE

8 research outputs found

An Instance Selection Algorithm for Big Data in High imbalanced datasets based on LSH

Author: Arias-Londoño Julián D.
Duitama-Muñoz Freddy
Melo-Acosta Germán E.
Publication venue
Publication date: 09/10/2022
Field of study

Training of Machine Learning (ML) models in real contexts often deals with big data sets and high-class imbalance samples where the class of interest is unrepresented (minority class). Practical solutions using classical ML models address the problem of large data sets using parallel/distributed implementations of training algorithms, approximate model-based solutions, or applying instance selection (IS) algorithms to eliminate redundant information. However, the combined problem of big and high imbalanced datasets has been less addressed. This work proposes three new methods for IS to be able to deal with large and imbalanced data sets. The proposed methods use Locality Sensitive Hashing (LSH) as a base clustering technique, and then three different sampling methods are applied on top of the clusters (or buckets) generated by LSH. The algorithms were developed in the Apache Spark framework, guaranteeing their scalability. The experiments carried out in three different datasets suggest that the proposed IS methods can improve the performance of a base ML model between 5% and 19% in terms of the geometric mean.Comment: 23 pages, 15 figure

arXiv.org e-Print Archive

Instance selection of linear complexity for big data

Over recent decades, database sizes have grown considerably. Larger sizes present new challenges, because machine learning algorithms are not prepared to process such large volumes of information. Instance selection methods can alleviate this problem when the size of the data set is medium to large. However, even these methods face similar problems with very large-to-massive data sets. In this paper, two new algorithms with linear complexity for instance selection purposes are presented. Both algorithms use locality-sensitive hashing to find similarities between instances. While the complexity of conventional methods (usually quadratic, O(n2), or log-linear, O(nlogn)) means that they are unable to process large-sized data sets, the new proposal shows competitive results in terms of accuracy. Even more remarkably, it shortens execution time, as the proposal manages to reduce complexity and make it linear with respect to the data set size. The new proposal has been compared with some of the best known instance selection methods for testing and has also been evaluated on large data sets (up to a million instances).Supported by the Research Projects TIN 2011-24046 and TIN 2015-67534-P from the Spanish Ministry of Economy and Competitiveness

Elsevier - Publisher Connector

Crossref

Repositorio Institucional de la Universidad de Burgos

Comparison of Instances Seletion Algorithms I. Algorithms Survey

Author: B. Schölkopf
D. Wilson
D.R. Wilson
D.W. Aha
G. Gates
G.L. Ritter
H. Brighton
I. Tomek
K. Grab̧czewski
N. Jankowski
P.E. Hart
R. Adamczak
R.O. Duda
T.M. Cover
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2004
Field of study

Crossref

Comparison of instances seletion algorithms I. Algorithms survey

Author: Marek Grochowski
Norbert Jankowski
Publication venue
Publication date
Field of study

Abstract. Several methods were proposed to reduce the number of instances (vectors) in the learning set. Some of them extract only bad vectors while others try to remove as many instances as possible without significant degradation of the reduced dataset for learning. Several strategies to shrink training sets are compared here using different neural and machine learning classification algorithms. In part II (the accompanying paper) results on benchmarks databases have been presented.

CiteSeerX

Advances in Data Mining Knowledge Discovery and Applications

Author
Publication venue: 'IntechOpen'
Publication date: 20/04/2021
Field of study

Advances in Data Mining Knowledge Discovery and Applications aims to help data miners, researchers, scholars, and PhD students who wish to apply data mining techniques. The primary contribution of this book is highlighting frontier fields and implementations of the knowledge discovery and data mining. It seems to be same things are repeated again. But in general, same approach and techniques may help us in different fields and expertise areas. This book presents knowledge discovery and data mining applications in two different sections. As known that, data mining covers areas of statistics, machine learning, data management and databases, pattern recognition, artificial intelligence, and other areas. In this book, most of the areas are covered with different data mining applications. The eighteen chapters have been classified in two parts: Knowledge Discovery and Data Mining Applications

Directory of Open Access Books (DOAB)

Algorithm Selection in Multimodal Medical Image Registration

Author: Elkeshreu Husein Mehemed
Publication venue: 'University of Waterloo'
Publication date: 25/01/2022
Field of study

Medical image acquisition technology has improved significantly throughout the last several decades, and clinicians now rely on medical images to diagnose illnesses, and to determine treatment protocols, and surgical planning. Medical images have been divided by researchers into two types of structures: functional and anatomical. Anatomical imaging, such as magnetic resonance imaging (MRI), computed tomography imaging (C.T.), ultrasound, and other systems, enables medical personnel to examine a body internally with great accuracy, thereby avoiding the risks associated with exploratory surgery. Functional (or physiological) imaging systems contain single-photon emission computed tomography (SPECT), positron emission tomography (PET), and other methods, which refer to a medical imaging system for discovering or evaluating variations in absorption, blood flow, metabolism, and regional chemical composition. Notably, one of these medical imaging models alone cannot usually supply doctors with adequate information. Additionally, data obtained from several images of the same subject generally provide complementary information via a process called medical image registration. Image registration may be defined as the process of geometrically mapping one -image’s coordinate system to the coordinate system of another image acquired from a different perspective and with a different sensor. Registration performs a crucial role in medical image assessment because it helps clinicians observe the developing trend of the disease and make proper measures accordingly. Medical image registration (MIR) has several applications: radiation therapy, tumour diagnosis and recognition, template atlas application, and surgical guidance system. There are two types of registration: manual registration and registration-based computer system. Manual registration is when the radiologist /physician completes all registration tasks interactively with visual feedback provided by the computer system, which can result in serious problems. For instance, investigations conducted by two experts are not identical, and registration correctness is determined by the user's assessment of the relationship between anatomical features. Furthermore, it may take a long time for the user to achieve proper alignment, and the outcomes vary according to the user. As a result, the outcomes of manual alignment are doubtful and unreliable. The second registration approach is computer-based multimodal medical image registration that targets various medical images, and an arraof application types. . Additionally, automatic registration in medical pictures matches the standard recognized characteristics or voxels in pre- and intra-operative imaging without user input. Registration of multimodal pictures is the initial step in integrating data from several images. Automatic image processing has emerged to mitigate (Husein, do you mean “mitigate” or “improve”?) the manual image registration reliability, robustness, accuracy, and processing time. While such registration algorithms offer advantages when applied to some medical images, their use with others is accompanied by disadvantages. No registration technique can outperform all input datasets due to the changeability of medical imaging and the diverse demands of applications. However, no algorithm is preferable under all possible conditions; given many available algorithms, choosing the one that adapts the best to the task is vital. The essential factor is to choose which method is most appropriate for the situation. The Algorithm Selection Problem has emerged in numerous research disciplines, including medical diagnosis, machine learning, optimization, and computations. The choice of the most powerful strategy for a particular issue seeks to minimize these issues. This study delivers a universal and practical framework for multimodal registration algorithm choice. The primary goal of this study is to introduce a generic structure for constructing a medical image registration system capable of selecting the best registration process from a range of registration algorithms for various used datasets. Three strategies were constructed to examine the framework that was created. The first strategy is based on transforming the problem of algorithm selection into a classification problem. The second strategy investigates the effect of various parameters, such as optimization control points, on the optimal selection. The third strategy establishes a framework for choosing the optimal registration algorithm for a delivered dataset based on two primary criteria: registration algorithm applicability, and performance measures. The approach mentioned in this section has relied on machine learning methods and artificial neural networks to determine which candidate is most promising. Several experiments and scenarios have been conducted, and the results reveal that the novel Framework strategy leads to achieving the best performance, such as high accuracy, reliability, robustness, efficiency, and low processing time

University of Waterloo's Institutional Repository

Visual Scene Understanding by Deep Fisher Discriminant Learning

Author: Shahriari Arash
Publication venue
Publication date
Field of study

Modern deep learning has recently revolutionized several fields of classic machine learning and computer vision, such as, scene understanding, natural language processing and machine translation. The substitution of feature hand-crafting with automatic feature learning, provides an excellent opportunity for gaining an in-depth understanding of large-scale data statistics. Deep neural networks generally train models with huge numbers of parameters, facilitating efficient search for optimal and sub-optimal spaces of highly non-convex objective functions. On the other hand, Fisher discriminant analysis has been widely employed to impose class discrepancy, for the sake of segmentation, classification, and recognition tasks. This thesis bridges between contemporary deep learning and classic discriminant analysis, to accommodate some important challenges in visual scene understanding, i.e. semantic segmentation, texture classification, and object recognition. The aim is to accomplish specific tasks in some new high-dimensional spaces, covered by the statistical information of the datasets under study. Inspired by a new formulation of Fisher discriminant analysis, this thesis introduces some novel arrangements of well-known deep learning architectures, to achieve better performances on the targeted missions. The theoretical justifications are based upon a large body of experimental work, and consolidate the contribution of the proposed idea; Deep Fisher Discriminant Learning, to several challenges in visual scene understanding

The Australian National University