Search CORE

728 research outputs found

Fast and Scalable Approaches to Accelerate the Fuzzy k Nearest Neighbors Classifier for Big Data

Author: García Salvador
Herrera Francisco
Luengo Julián
Maillo Jesus
Triguero Isaac
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/05/2020
Field of study

One of the best-known and most effective methods in supervised classification is the k nearest neighbors algorithm (kNN). Several approaches have been proposed to improve its accuracy, where fuzzy approaches prove to be among the most successful, highlighting the classical Fuzzy k nearest neighbors (FkNN). However, these traditional algorithms fail to tackle the large amounts of data that are available today. There are multiple alternatives to enable kNN classification in big datasets, spotlighting the approximate version of kNN called Hybrid Spill Tree. Nevertheless, the existing proposals of FkNN for big data problems are not fully scalable, because a high computational load is required to obtain the same behavior as the original FkNN algorithm. This work proposes Global Approximate Hybrid Spill Tree FkNN and Local Hybrid Spill Tree FkNN, two approximate approaches that speed up runtime without losing quality in the classification process. The experimentation compares various FkNN approaches for big data with datasets of up to 11 million instances. The results show an improvement in runtime and accuracy over literature algorithms

Repository@Nottingham

Fast Machine Learning Algorithms for Massive Datasets with Applications in the Biomedical Domain

Author: Sadrfaridpour Ehsan
Publication venue: Clemson University Libraries
Publication date: 01/08/2020
Field of study

The continuous increase in the size of datasets introduces computational challenges for machine learning algorithms. In this dissertation, we cover the machine learning algorithms and applications in large-scale data analysis in manufacturing and healthcare. We begin with introducing a multilevel framework to scale the support vector machine (SVM), a popular supervised learning algorithm with a few tunable hyperparameters and highly accurate prediction. The computational complexity of nonlinear SVM is prohibitive on large-scale datasets compared to the linear SVM, which is more scalable for massive datasets. The nonlinear SVM has shown to produce significantly higher classification quality on complex and highly imbalanced datasets. However, a higher classification quality requires a computationally expensive quadratic programming solver and extra kernel parameters for model selection. We introduce a generalized fast multilevel framework for regular, weighted, and instance weighted SVM that achieves similar or better classification quality compared to the state-of-the-art SVM libraries such as LIBSVM. Our framework improves the runtime more than two orders of magnitude for some of the well-known benchmark datasets. We cover multiple versions of our proposed framework and its implementation in detail. The framework is implemented using PETSc library which allows easy integration with scientific computing tasks. Next, we propose an adaptive multilevel learning framework for SVM to reduce the variance between prediction qualities across the levels, improve the overall prediction accuracy, and boost the runtime. We implement multi-threaded support to speed up the parameter fitting runtime that results in more than an order of magnitude speed-up. We design an early stopping criteria to reduce the extra computational cost when we achieve expected prediction quality. This approach provides significant speed-up, especially for massive datasets. Finally, we propose an efficient low dimensional feature extraction over massive knowledge networks. Knowledge networks are becoming more popular in the biomedical domain for knowledge representation. Each layer in knowledge networks can store the information from one or multiple sources of data. The relationships between concepts or between layers represent valuable information. The proposed feature engineering approach provides an efficient and highly accurate prediction of the relationship between biomedical concepts on massive datasets. Our proposed approach utilizes semantics and probabilities to reduce the potential search space for the exploration and learning of machine learning algorithms. The calculation of probabilities is highly scalable with the size of the knowledge network. The number of features is fixed and equivalent to the number of relationships or classes in the data. A comprehensive comparison of well-known classifiers such as random forest, SVM, and deep learning over various features extracted from the same dataset, provides an overview for performance and computational trade-offs. Our source code, documentation and parameters will be available at https://github.com/esadr/

Clemson University: TigerPrints

Shared Nearest-Neighbor Quantum Game-Based Attribute Reduction with Hierarchical Coevolutionary Spark and Its Application in Consistent Segmentation of Neonatal Cerebral Cortical Surfaces

Author: Cao Z
Ding W
Lin CT
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

© 2012 IEEE. The unprecedented increase in data volume has become a severe challenge for conventional patterns of data mining and learning systems tasked with handling big data. The recently introduced Spark platform is a new processing method for big data analysis and related learning systems, which has attracted increasing attention from both the scientific community and industry. In this paper, we propose a shared nearest-neighbor quantum game-based attribute reduction (SNNQGAR) algorithm that incorporates the hierarchical coevolutionary Spark model. We first present a shared coevolutionary nearest-neighbor hierarchy with self-evolving compensation that considers the features of nearest-neighborhood attribute subsets and calculates the similarity between attribute subsets according to the shared neighbor information of attribute sample points. We then present a novel attribute weight tensor model to generate ranking vectors of attributes and apply them to balance the relative contributions of different neighborhood attribute subsets. To optimize the model, we propose an embedded quantum equilibrium game paradigm (QEGP) to ensure that noisy attributes do not degrade the big data reduction results. A combination of the hierarchical coevolutionary Spark model and an improved MapReduce framework is then constructed that it can better parallelize the SNNQGAR to efficiently determine the preferred reduction solutions of the distributed attribute subsets. The experimental comparisons demonstrate the superior performance of the SNNQGAR, which outperforms most of the state-of-the-art attribute reduction algorithms. Moreover, the results indicate that the SNNQGAR can be successfully applied to segment overlapping and interdependent fuzzy cerebral tissues, and it exhibits a stable and consistent segmentation performance for neonatal cerebral cortical surfaces

OPUS - University of Technology Sydney

University of Tasmania Open Access Repository

Fuzzy Integration to Standard Calculation of K-Nearest Neighbour Attributes

Author: Al Karomi M Adib
Ivandari Ivandari
Publication venue: 'Politeknik Negeri Semarang'
Publication date: 13/11/2020
Field of study

The development of information and data in the era of the industrial revolution 4.0 is very fast. Researchers, institutions and even industry are competing to find and utilize methods in data processing that are more effective and efficient. In data mining classification, there are several best methods and are widely used by researchers. One of them is K-Nearest Neighbor (KNN). The calculation process in the KNN algorithm is carried out by comparing the testing data to all existing training data. This comparison is generally symbolized by the value of closeness or similarity between attribute records. The KNN method is proven to be good for handling large datasets and datasets with many attributes. One of the drawbacks in calculating the similarity of the KNN is that if there are attributes with a large range value, the similarity value will also be large. Conversely, if the range in an attribute is small, the similarity is also small. This condition is clearly unfair considering the types of attributes in the current data vary widely. One solution to this problem is to use standardization for all existing data attributes. Fuzzy is a model introduced by Prof. Zadeh which allows a faint value to be a value between 1 and 0. In this study the fuzzy model will be integrated in the KNN similarity calculation to obtain standardization of all data attributes. The results show that the use of the KNN algorithm in the classification of credit approval has an accuracy rate of 91.83%

Portal Jurnal Politeknik Negeri Semarang

Transforming big data into smart data: An insight on the use of the k-nearest neighbors algorithm to obtain quality data

Author: Aha
Al-Fuqaha
Andoni
Angiulli
Arnaiz-González
Arya
Batista
Bertino
Bezdek
Biau
Cano
Chang
Chen
Chen
Cover
Datta
Dean
Derrac
Derrac
Dutta
Eiben
Fadili
Fan
Fan
Fernández
Fernández
Figueredo
Friedman
Frénay
Garcia
Garcia
García
García
García
García
García-Laencina
Gupta
Hart
Hernández
Iafrate
Iguyon
Keller
Kim
Kononenko
Lenk
Little
Little
Liu
Liu
Luengo
Luengo
Maillo
Maillo
Marx
Meng
Navot
Nguyen
Palma-Mendoza
Pan
Peralta
Philip-Chen
Quinlan
Raja
Ramírez-Gallego
Ramírez-Gallego
Ramírez-Gallego
Rastogi
Royston
Río
Schneider
Skalak
Snir
Sun
Sánchez
Sánchez
Tan
Tomek
Triguero
Triguero
Triguero
Triguero
Triguero
Triguero
Uhlmann
Weinberger
Wettschereck
White
Wilson
Xue
Zaharia
Zerhari
Zhang
Zhong
Zhu
Zou
Publication venue: 'Wiley'
Publication date: 01/03/2019
Field of study

The k-nearest neighbours algorithm is characterised as a simple yet effective data mining technique. The main drawback of this technique appears when massive amounts of data -likely to contain noise and imperfections - are involved, turning this algorithm into an imprecise and especially inefficient technique. These disadvantages have been subject of research for many years, and among others approaches, data preprocessing techniques such as instance reduction or missing values imputation have targeted these weaknesses. As a result, these issues have turned out as strengths and the k-nearest neighbours rule has become a core algorithm to identify and correct imperfect data, removing noisy and redundant samples, or imputing missing values, transforming Big Data into Smart Data - which is data of sufficient quality to expect a good outcome from any data mining algorithm. The role of this smart data gleaning algorithm in a supervised learning context will be investigated. This will include a brief overview of Smart Data, current and future trends for the k-nearest neighbour algorithm in the Big Data context, and the existing data preprocessing techniques based on this algorithm. We present the emerging big data-ready versions of these algorithms and develop some new methods to cope with Big Data. We carry out a thorough experimental analysis in a series of big datasets that provide guidelines as to how to use the k-nearest neighbour algorithm to obtain Smart/Quality Data for a high quality data mining process. Moreover, multiple Spark Packages have been developed including all the Smart Data algorithms analysed

Crossref

Repository@Nottingham

MRPR: a MapReduce solution for prototype reduction in big data classification

Author: Alpaydin
Angiulli
Bacardit
Cano
Caruana
Chang
Chen
Cover
Daniel Peralta
Dean
Dean
Derrac
Derrac
Derrac
Francisco Herrera
García
García
García-Pedrajas
García-Pedrajas
Hart
He
Isaac Triguero
Jaume Bacardit
Kohonen
Lam
Marx
Minelli
Mollineda
Nanni
Neri
Palit
Price
Pyle
Sakr
Salvador García
Snir
Srinivasan
Sánchez
Sánchez
Triguero
Triguero
Triguero
White
Wilson
Wilson
Witten
Woniak
Zhao
Publication venue: 'Elsevier BV'
Publication date: 03/03/2014
Field of study

In the era of big data, analyzing and extracting knowledge from large-scale data sets is a very interesting and challenging task. The application of standard data mining tools in such data sets is not straightforward. Hence, a new class of scalable mining method that embraces the huge storage and processing capacity of cloud platforms is required. In this work, we propose a novel distributed partitioning methodology for prototype reduction techniques in nearest neighbor classification. These methods aim at representing original training data sets as a reduced number of instances. Their main purposes are to speed up the classification process and reduce the storage requirements and sensitivity to noise of the nearest neighbor rule. However, the standard prototype reduction methods cannot cope with very large data sets. To overcome this limitation, we develop a MapReduce-based framework to distribute the functioning of these algorithms through a cluster of computing elements, proposing several algorithmic strategies to integrate multiple partial solutions (reduced sets of prototypes) into a single one. The proposed model enables prototype reduction algorithms to be applied over big data classification problems without significant accuracy loss. We test the speeding up capabilities of our model with data sets up to 5.7 millions of instances. The results show that this model is a suitable tool to enhance the performance of the nearest neighbor classifier with big data

Nottingham ePrints

Nottingham eTheses

Crossref

Repository@Nottingham

Repositorio Institucional Universidad de Granada

Detecting Heart Attacks Using Learning Classifiers

Author: M. Hafez Manar
S. Ebeid Ramy
T. Mosa Diana
Publication venue: Arab Journals Platform
Publication date: 10/07/2023
Field of study

Cardiovascular diseases (CVDs) have emerged as a critical global threat to human life. The diagnosis of these diseases presents a complex challenge, particularly for inexperienced doctors, as their symptoms can be mistaken for signs of aging or similar conditions. Early detection of heart disease can help prevent heart failure, making it crucial to develop effective diagnostic techniques. Machine Learning (ML) techniques have gained popularity among researchers for identifying new patients based on past data. While various forecasting techniques have been applied to different medical datasets, accurate detection of heart attacks in a timely manner remains elusive. This article presents a comprehensive comparative analysis of various ML techniques, including Decision Tree, Support Vector Machines, Random Forest, Extreme Gradient Boosting (XGBoost), Adaptive Boosting, Multilayer Perceptron, Gradient Boosting, K-Nearest Neighbor, and Logistic Regression. These classifiers are implemented and evaluated in Python using data from over 300 patients obtained from the Kaggle cardiovascular repository in CSV format. The classifiers categorize patients into two groups: those with a heart attack and those without. Performance evaluation metrics such as recall, precision, accuracy, and the F1-measure are employed to assess the classifiers’ effectiveness. The results of this study highlight XGBoost classifier as a promising tool in the medical domain for accurate diagnosis, demonstrating the highest predictive accuracy (95.082%) with a calculation time of (0.07995 sec) on the dataset compared to other classifiers

Arab Journals Platform

KFREAIN: Design of A Kernel-Level Forensic Layer for Improving Real-Time Evidence Analysis Performance in IoT Networks

Author: Chhabra Prachi
Mangesh Sangeeta
Shukla Seema
Publication venue: Politeknik Elektronika Negeri Surabaya (PENS)
Publication date: 20/12/2023
Field of study

An exponential increase in number of attacks in IoT Networks makes it essential to formulate attack-level mitigation strategies. This paper proposes design of a scalable Kernel-level Forensic layer that assists in improving real-time evidence analysis performance to assist in efficient pattern analysis of the collected data samples. It has an inbuilt Temporal Blockchain Cache (TBC), which is refreshed after analysis of every set of evidences. The model uses a multidomain feature extraction engine that combines lightweight Fourier, Wavelet, Convolutional, Gabor, and Cosine feature sets that are selected by a stochastic Bacterial Foraging Optimizer (BFO) for identification of high variance features. The selected features are processed by an ensemble learning (EL) classifier that use low complexity classifiers reducing the energy consumption during analysis by 8.3% when compared with application-level forensic models. The model also showcased 3.5% higher accuracy, 4.9% higher precision, and 4.3% higher recall of attack-event identification when compared with standard forensic techniques. Due to kernel-level integration, the model is also able to reduce the delay needed for forensic analysis on different network types by 9.5%, thus making it useful for real-time & heterogenous network scenarios

EMITTER - International Journal of Engineering Technology

Big data analytics for preventive medicine

Author: Imran M
Razzak MI
Xu G
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

© 2019, Springer-Verlag London Ltd., part of Springer Nature. Medical data is one of the most rewarding and yet most complicated data to analyze. How can healthcare providers use modern data analytics tools and technologies to analyze and create value from complex data? Data analytics, with its promise to efficiently discover valuable pattern by analyzing large amount of unstructured, heterogeneous, non-standard and incomplete healthcare data. It does not only forecast but also helps in decision making and is increasingly noticed as breakthrough in ongoing advancement with the goal is to improve the quality of patient care and reduces the healthcare cost. The aim of this study is to provide a comprehensive and structured overview of extensive research on the advancement of data analytics methods for disease prevention. This review first introduces disease prevention and its challenges followed by traditional prevention methodologies. We summarize state-of-the-art data analytics algorithms used for classification of disease, clustering (unusually high incidence of a particular disease), anomalies detection (detection of disease) and association as well as their respective advantages, drawbacks and guidelines for selection of specific model followed by discussion on recent development and successful application of disease prevention methods. The article concludes with open research challenges and recommendations

Deakin Research Online

OPUS - University of Technology Sydney

Federation ResearchOnline