728 research outputs found

    Fast and Scalable Approaches to Accelerate the Fuzzy k Nearest Neighbors Classifier for Big Data

    Get PDF
    One of the best-known and most effective methods in supervised classification is the k nearest neighbors algorithm (kNN). Several approaches have been proposed to improve its accuracy, where fuzzy approaches prove to be among the most successful, highlighting the classical Fuzzy k nearest neighbors (FkNN). However, these traditional algorithms fail to tackle the large amounts of data that are available today. There are multiple alternatives to enable kNN classification in big datasets, spotlighting the approximate version of kNN called Hybrid Spill Tree. Nevertheless, the existing proposals of FkNN for big data problems are not fully scalable, because a high computational load is required to obtain the same behavior as the original FkNN algorithm. This work proposes Global Approximate Hybrid Spill Tree FkNN and Local Hybrid Spill Tree FkNN, two approximate approaches that speed up runtime without losing quality in the classification process. The experimentation compares various FkNN approaches for big data with datasets of up to 11 million instances. The results show an improvement in runtime and accuracy over literature algorithms

    Fast Machine Learning Algorithms for Massive Datasets with Applications in the Biomedical Domain

    Get PDF
    The continuous increase in the size of datasets introduces computational challenges for machine learning algorithms. In this dissertation, we cover the machine learning algorithms and applications in large-scale data analysis in manufacturing and healthcare. We begin with introducing a multilevel framework to scale the support vector machine (SVM), a popular supervised learning algorithm with a few tunable hyperparameters and highly accurate prediction. The computational complexity of nonlinear SVM is prohibitive on large-scale datasets compared to the linear SVM, which is more scalable for massive datasets. The nonlinear SVM has shown to produce significantly higher classification quality on complex and highly imbalanced datasets. However, a higher classification quality requires a computationally expensive quadratic programming solver and extra kernel parameters for model selection. We introduce a generalized fast multilevel framework for regular, weighted, and instance weighted SVM that achieves similar or better classification quality compared to the state-of-the-art SVM libraries such as LIBSVM. Our framework improves the runtime more than two orders of magnitude for some of the well-known benchmark datasets. We cover multiple versions of our proposed framework and its implementation in detail. The framework is implemented using PETSc library which allows easy integration with scientific computing tasks. Next, we propose an adaptive multilevel learning framework for SVM to reduce the variance between prediction qualities across the levels, improve the overall prediction accuracy, and boost the runtime. We implement multi-threaded support to speed up the parameter fitting runtime that results in more than an order of magnitude speed-up. We design an early stopping criteria to reduce the extra computational cost when we achieve expected prediction quality. This approach provides significant speed-up, especially for massive datasets. Finally, we propose an efficient low dimensional feature extraction over massive knowledge networks. Knowledge networks are becoming more popular in the biomedical domain for knowledge representation. Each layer in knowledge networks can store the information from one or multiple sources of data. The relationships between concepts or between layers represent valuable information. The proposed feature engineering approach provides an efficient and highly accurate prediction of the relationship between biomedical concepts on massive datasets. Our proposed approach utilizes semantics and probabilities to reduce the potential search space for the exploration and learning of machine learning algorithms. The calculation of probabilities is highly scalable with the size of the knowledge network. The number of features is fixed and equivalent to the number of relationships or classes in the data. A comprehensive comparison of well-known classifiers such as random forest, SVM, and deep learning over various features extracted from the same dataset, provides an overview for performance and computational trade-offs. Our source code, documentation and parameters will be available at https://github.com/esadr/

    Shared Nearest-Neighbor Quantum Game-Based Attribute Reduction with Hierarchical Coevolutionary Spark and Its Application in Consistent Segmentation of Neonatal Cerebral Cortical Surfaces

    Full text link
    © 2012 IEEE. The unprecedented increase in data volume has become a severe challenge for conventional patterns of data mining and learning systems tasked with handling big data. The recently introduced Spark platform is a new processing method for big data analysis and related learning systems, which has attracted increasing attention from both the scientific community and industry. In this paper, we propose a shared nearest-neighbor quantum game-based attribute reduction (SNNQGAR) algorithm that incorporates the hierarchical coevolutionary Spark model. We first present a shared coevolutionary nearest-neighbor hierarchy with self-evolving compensation that considers the features of nearest-neighborhood attribute subsets and calculates the similarity between attribute subsets according to the shared neighbor information of attribute sample points. We then present a novel attribute weight tensor model to generate ranking vectors of attributes and apply them to balance the relative contributions of different neighborhood attribute subsets. To optimize the model, we propose an embedded quantum equilibrium game paradigm (QEGP) to ensure that noisy attributes do not degrade the big data reduction results. A combination of the hierarchical coevolutionary Spark model and an improved MapReduce framework is then constructed that it can better parallelize the SNNQGAR to efficiently determine the preferred reduction solutions of the distributed attribute subsets. The experimental comparisons demonstrate the superior performance of the SNNQGAR, which outperforms most of the state-of-the-art attribute reduction algorithms. Moreover, the results indicate that the SNNQGAR can be successfully applied to segment overlapping and interdependent fuzzy cerebral tissues, and it exhibits a stable and consistent segmentation performance for neonatal cerebral cortical surfaces

    Fuzzy Integration to Standard Calculation of K-Nearest Neighbour Attributes

    Get PDF
    The development of information and data in the era of the industrial revolution 4.0 is very fast. Researchers, institutions and even industry are competing to find and utilize methods in data processing that are more effective and efficient. In data mining classification, there are several best methods and are widely used by researchers. One of them is K-Nearest Neighbor (KNN). The calculation process in the KNN algorithm is carried out by comparing the testing data to all existing training data. This comparison is generally symbolized by the value of closeness or similarity between attribute records. The KNN method is proven to be good for handling large datasets and datasets with many attributes. One of the drawbacks in calculating the similarity of the KNN is that if there are attributes with a large range value, the similarity value will also be large. Conversely, if the range in an attribute is small, the similarity is also small. This condition is clearly unfair considering the types of attributes in the current data vary widely. One solution to this problem is to use standardization for all existing data attributes. Fuzzy is a model introduced by Prof. Zadeh which allows a faint value to be a value between 1 and 0. In this study the fuzzy model will be integrated in the KNN similarity calculation to obtain standardization of all data attributes. The results show that the use of the KNN algorithm in the classification of credit approval has an accuracy rate of 91.83%

    Transforming big data into smart data: An insight on the use of the k-nearest neighbors algorithm to obtain quality data

    Get PDF
    The k-nearest neighbours algorithm is characterised as a simple yet effective data mining technique. The main drawback of this technique appears when massive amounts of data -likely to contain noise and imperfections - are involved, turning this algorithm into an imprecise and especially inefficient technique. These disadvantages have been subject of research for many years, and among others approaches, data preprocessing techniques such as instance reduction or missing values imputation have targeted these weaknesses. As a result, these issues have turned out as strengths and the k-nearest neighbours rule has become a core algorithm to identify and correct imperfect data, removing noisy and redundant samples, or imputing missing values, transforming Big Data into Smart Data - which is data of sufficient quality to expect a good outcome from any data mining algorithm. The role of this smart data gleaning algorithm in a supervised learning context will be investigated. This will include a brief overview of Smart Data, current and future trends for the k-nearest neighbour algorithm in the Big Data context, and the existing data preprocessing techniques based on this algorithm. We present the emerging big data-ready versions of these algorithms and develop some new methods to cope with Big Data. We carry out a thorough experimental analysis in a series of big datasets that provide guidelines as to how to use the k-nearest neighbour algorithm to obtain Smart/Quality Data for a high quality data mining process. Moreover, multiple Spark Packages have been developed including all the Smart Data algorithms analysed

    MRPR: a MapReduce solution for prototype reduction in big data classification

    Get PDF
    In the era of big data, analyzing and extracting knowledge from large-scale data sets is a very interesting and challenging task. The application of standard data mining tools in such data sets is not straightforward. Hence, a new class of scalable mining method that embraces the huge storage and processing capacity of cloud platforms is required. In this work, we propose a novel distributed partitioning methodology for prototype reduction techniques in nearest neighbor classification. These methods aim at representing original training data sets as a reduced number of instances. Their main purposes are to speed up the classification process and reduce the storage requirements and sensitivity to noise of the nearest neighbor rule. However, the standard prototype reduction methods cannot cope with very large data sets. To overcome this limitation, we develop a MapReduce-based framework to distribute the functioning of these algorithms through a cluster of computing elements, proposing several algorithmic strategies to integrate multiple partial solutions (reduced sets of prototypes) into a single one. The proposed model enables prototype reduction algorithms to be applied over big data classification problems without significant accuracy loss. We test the speeding up capabilities of our model with data sets up to 5.7 millions of instances. The results show that this model is a suitable tool to enhance the performance of the nearest neighbor classifier with big data

    Detecting Heart Attacks Using Learning Classifiers

    Get PDF
    Cardiovascular diseases (CVDs) have emerged as a critical global threat to human life. The diagnosis of these diseases presents a complex challenge, particularly for inexperienced doctors, as their symptoms can be mistaken for signs of aging or similar conditions. Early detection of heart disease can help prevent heart failure, making it crucial to develop effective diagnostic techniques. Machine Learning (ML) techniques have gained popularity among researchers for identifying new patients based on past data. While various forecasting techniques have been applied to different medical datasets, accurate detection of heart attacks in a timely manner remains elusive. This article presents a comprehensive comparative analysis of various ML techniques, including Decision Tree, Support Vector Machines, Random Forest, Extreme Gradient Boosting (XGBoost), Adaptive Boosting, Multilayer Perceptron, Gradient Boosting, K-Nearest Neighbor, and Logistic Regression. These classifiers are implemented and evaluated in Python using data from over 300 patients obtained from the Kaggle cardiovascular repository in CSV format. The classifiers categorize patients into two groups: those with a heart attack and those without. Performance evaluation metrics such as recall, precision, accuracy, and the F1-measure are employed to assess the classifiers’ effectiveness. The results of this study highlight XGBoost classifier as a promising tool in the medical domain for accurate diagnosis, demonstrating the highest predictive accuracy (95.082%) with a calculation time of (0.07995 sec) on the dataset compared to other classifiers

    KFREAIN: Design of A Kernel-Level Forensic Layer for Improving Real-Time Evidence Analysis Performance in IoT Networks

    Get PDF
    An exponential increase in number of attacks in IoT Networks makes it essential to formulate attack-level mitigation strategies. This paper proposes design of a scalable Kernel-level Forensic layer that assists in improving real-time evidence analysis performance to assist in efficient pattern analysis of the collected data samples. It has an inbuilt Temporal Blockchain Cache (TBC), which is refreshed after analysis of every set of evidences. The model uses a multidomain feature extraction engine that combines lightweight Fourier, Wavelet, Convolutional, Gabor, and Cosine feature sets that are selected by a stochastic Bacterial Foraging Optimizer (BFO) for identification of high variance features. The selected features are processed by an ensemble learning (EL) classifier that use low complexity classifiers reducing the energy consumption during analysis by 8.3% when compared with application-level forensic models. The model also showcased 3.5% higher accuracy, 4.9% higher precision, and 4.3% higher recall of attack-event identification when compared with standard forensic techniques. Due to kernel-level integration, the model is also able to reduce the delay needed for forensic analysis on different network types by 9.5%, thus making it useful for real-time & heterogenous network scenarios

    Big data analytics for preventive medicine

    Get PDF
    © 2019, Springer-Verlag London Ltd., part of Springer Nature. Medical data is one of the most rewarding and yet most complicated data to analyze. How can healthcare providers use modern data analytics tools and technologies to analyze and create value from complex data? Data analytics, with its promise to efficiently discover valuable pattern by analyzing large amount of unstructured, heterogeneous, non-standard and incomplete healthcare data. It does not only forecast but also helps in decision making and is increasingly noticed as breakthrough in ongoing advancement with the goal is to improve the quality of patient care and reduces the healthcare cost. The aim of this study is to provide a comprehensive and structured overview of extensive research on the advancement of data analytics methods for disease prevention. This review first introduces disease prevention and its challenges followed by traditional prevention methodologies. We summarize state-of-the-art data analytics algorithms used for classification of disease, clustering (unusually high incidence of a particular disease), anomalies detection (detection of disease) and association as well as their respective advantages, drawbacks and guidelines for selection of specific model followed by discussion on recent development and successful application of disease prevention methods. The article concludes with open research challenges and recommendations
    • …
    corecore