2,046 research outputs found

    Fast Clustering Using a Grid-Based Underlying Density Function Approximation

    Get PDF
    Clustering is an unsupervised machine learning task that seeks to partition a set of data into smaller groupings, referred to as “clusters”, where items within the same cluster are somehow alike, while differing from those in other clusters. There are many different algorithms for clustering, but many of them are overly complex and scale poorly with larger data sets. In this paper, a new algorithm for clustering is proposed to solve some of these issues. Density-based clustering algorithms use a concept called the “underlying density function”, which is a conceptual higher-dimension function that describes the possible results from the continuous data set that our input data is just a discrete sample of. The algorithm proposed in this paper seeks to use this concept by creating a piecewise approximation of the underlying density function, and then merging points towards local density maxima from this higher-dimensioned space. First, the data space is divided into a grid-based structure and the density of each grid is calculated. Second, each of these “grid-squares” determines the densest space in its local area. Finally, the grid squares are merged together in the direction of their local density maximum, ultimately merging with one of the density maxima that form the root of a cluster. The experimental results show significant time improvements over standard algorithms such as DBSCAN with no accuracy penalty. Furthermore, the algorithm is also suitable for use with parallel and distributed systems, as an implementation with Apache Spark showed proper parallel scaling with low data set sizes required to overtake the serial implementation

    Some Clustering Methods, Algorithms and their Applications

    Get PDF
    Clustering is a type of unsupervised learning [15]. When no target values are known, or "supervisors," in an unsupervised learning task, the purpose is to produce training data from the inputs themselves. Data mining and machine learning would be useless without clustering. If you utilize it to categorize your datasets according to their similarities, you'll be able to predict user behavior more accurately. The purpose of this research is to compare and contrast three widely-used data-clustering methods. Clustering techniques include partitioning, hierarchy, density, grid, and fuzzy clustering. Machine learning, data mining, pattern recognition, image analysis, and bioinformatics are just a few of the many fields where clustering is utilized as an analytical technique. In addition to defining the various algorithms, specialized forms of cluster analysis, linking methods, and please offer a review of the clustering techniques used in the big data setting

    Study of Clustering Data Mining Techniques

    Get PDF
    Data mining's primary purpose is to take a massive records series and wreck it down right into a more plausible form for evaluation and alertness. Exploratory facts evaluation and information mining applications frequently center on clustering. The time period "clustering" refers back to the method of categorizing facts factors into groupings wherein the objects within every cluster have more similarities than differences (clusters). Each approach serves a completely unique motive, determined by using the nature of the records at hand and the needs of the software. Nonetheless, our research has led us to the realization that the K-way approach outperforms the options in a huge type of settings. In this look at, senior undergraduate and master's degree college students from the Faculty of Economics and Business Administration at Babe?-Bolyai University of Cluj-Napoca participated via the usage of questionnaires in a collaborative effort, with the gathered data being processed through information mining clustering techniques, graphical and percent representations, the use of algorithms applied in the software program Wek

    A Review of Subsequence Time Series Clustering

    Get PDF
    Clustering of subsequence time series remains an open issue in time series clustering. Subsequence time series clustering is used in different fields, such as e-commerce, outlier detection, speech recognition, biological systems, DNA recognition, and text mining. One of the useful fields in the domain of subsequence time series clustering is pattern recognition. To improve this field, a sequence of time series data is used. This paper reviews some definitions and backgrounds related to subsequence time series clustering. The categorization of the literature reviews is divided into three groups: preproof, interproof, and postproof period. Moreover, various state-of-the-art approaches in performing subsequence time series clustering are discussed under each of the following categories. The strengths and weaknesses of the employed methods are evaluated as potential issues for future studies

    Two and three dimensional segmentation of multimodal imagery

    Get PDF
    The role of segmentation in the realms of image understanding/analysis, computer vision, pattern recognition, remote sensing and medical imaging in recent years has been significantly augmented due to accelerated scientific advances made in the acquisition of image data. This low-level analysis protocol is critical to numerous applications, with the primary goal of expediting and improving the effectiveness of subsequent high-level operations by providing a condensed and pertinent representation of image information. In this research, we propose a novel unsupervised segmentation framework for facilitating meaningful segregation of 2-D/3-D image data across multiple modalities (color, remote-sensing and biomedical imaging) into non-overlapping partitions using several spatial-spectral attributes. Initially, our framework exploits the information obtained from detecting edges inherent in the data. To this effect, by using a vector gradient detection technique, pixels without edges are grouped and individually labeled to partition some initial portion of the input image content. Pixels that contain higher gradient densities are included by the dynamic generation of segments as the algorithm progresses to generate an initial region map. Subsequently, texture modeling is performed and the obtained gradient, texture and intensity information along with the aforementioned initial partition map are used to perform a multivariate refinement procedure, to fuse groups with similar characteristics yielding the final output segmentation. Experimental results obtained in comparison to published/state-of the-art segmentation techniques for color as well as multi/hyperspectral imagery, demonstrate the advantages of the proposed method. Furthermore, for the purpose of achieving improved computational efficiency we propose an extension of the aforestated methodology in a multi-resolution framework, demonstrated on color images. Finally, this research also encompasses a 3-D extension of the aforementioned algorithm demonstrated on medical (Magnetic Resonance Imaging / Computed Tomography) volumes

    Cancer diagnosis using deep learning: A bibliographic review

    Get PDF
    In this paper, we first describe the basics of the field of cancer diagnosis, which includes steps of cancer diagnosis followed by the typical classification methods used by doctors, providing a historical idea of cancer classification techniques to the readers. These methods include Asymmetry, Border, Color and Diameter (ABCD) method, seven-point detection method, Menzies method, and pattern analysis. They are used regularly by doctors for cancer diagnosis, although they are not considered very efficient for obtaining better performance. Moreover, considering all types of audience, the basic evaluation criteria are also discussed. The criteria include the receiver operating characteristic curve (ROC curve), Area under the ROC curve (AUC), F1 score, accuracy, specificity, sensitivity, precision, dice-coefficient, average accuracy, and Jaccard index. Previously used methods are considered inefficient, asking for better and smarter methods for cancer diagnosis. Artificial intelligence and cancer diagnosis are gaining attention as a way to define better diagnostic tools. In particular, deep neural networks can be successfully used for intelligent image analysis. The basic framework of how this machine learning works on medical imaging is provided in this study, i.e., pre-processing, image segmentation and post-processing. The second part of this manuscript describes the different deep learning techniques, such as convolutional neural networks (CNNs), generative adversarial models (GANs), deep autoencoders (DANs), restricted Boltzmann’s machine (RBM), stacked autoencoders (SAE), convolutional autoencoders (CAE), recurrent neural networks (RNNs), long short-term memory (LTSM), multi-scale convolutional neural network (M-CNN), multi-instance learning convolutional neural network (MIL-CNN). For each technique, we provide Python codes, to allow interested readers to experiment with the cited algorithms on their own diagnostic problems. The third part of this manuscript compiles the successfully applied deep learning models for different types of cancers. Considering the length of the manuscript, we restrict ourselves to the discussion of breast cancer, lung cancer, brain cancer, and skin cancer. The purpose of this bibliographic review is to provide researchers opting to work in implementing deep learning and artificial neural networks for cancer diagnosis a knowledge from scratch of the state-of-the-art achievements

    K-Means and Alternative Clustering Methods in Modern Power Systems

    Get PDF
    As power systems evolve by integrating renewable energy sources, distributed generation, and electric vehicles, the complexity of managing these systems increases. With the increase in data accessibility and advancements in computational capabilities, clustering algorithms, including K-means, are becoming essential tools for researchers in analyzing, optimizing, and modernizing power systems. This paper presents a comprehensive review of over 440 articles published through 2022, emphasizing the application of K-means clustering, a widely recognized and frequently used algorithm, along with its alternative clustering methods within modern power systems. The main contributions of this study include a bibliometric analysis to understand the historical development and wide-ranging applications of K-means clustering in power systems. This research also thoroughly examines K-means, its various variants, potential limitations, and advantages. Furthermore, the study explores alternative clustering algorithms that can complete or substitute K-means. Some prominent examples include K-medoids, Time-series K-means, BIRCH, Bayesian clustering, HDBSCAN, CLIQUE, SPECTRAL, SOMs, TICC, and swarm-based methods, broadening the understanding and applications of clustering methodologies in modern power systems. The paper highlights the wide-ranging applications of these techniques, from load forecasting and fault detection to power quality analysis and system security assessment. Throughout the examination, it has been observed that the number of publications employing clustering algorithms within modern power systems is following an exponential upward trend. This emphasizes the necessity for professionals to understand various clustering methods, including their benefits and potential challenges, to incorporate the most suitable ones into their studies

    Efficient Approximate Big Data Clustering: Distributed and Parallel Algorithms in the Spectrum of IoT Architectures

    Get PDF
    Clustering, the task of grouping together similar items, is a frequently used method for processing data, with numerous applications. Clustering the data generated by sensors in the Internet of Things, for instance, can be useful for monitoring and making control decisions. For example, a cyber physical environment can be monitored by one or more 3D laser-based sensors to detect the objects in that environment and avoid critical situations, e.g. collisions.With the advancements in IoT-based systems, the volume of data produced by, typically high-rate, sensors has become immense. For example, a 3D laser-based sensor with a spinning head can produce hundreds of thousands of points in each second. Clustering such a large volume of data using conventional clustering methods takes too long time, violating the time-sensitivity requirements of applications leveraging the outcome of the clustering. For example, collisions in a cyber physical environment must be prevented as fast as possible.The thesis contributes to efficient clustering methods for distributed and parallel computing architectures, representative of the processing environments in IoT- based systems. To that end, the thesis proposes MAD-C (abbreviating Multi-stage Approximate Distributed Cluster-Combining) and PARMA-CC (abbreviating Parallel Multiphase Approximate Cluster Combining). MAD-C is a method for distributed approximate data clustering. MAD-C employs an approximation-based data synopsis that drastically lowers the required communication bandwidth among the distributed nodes and achieves multiplicative savings in computation time, compared to a baseline that centrally gathers and clusters the data. PARMA-CC is a method for parallel approximate data clustering on multi-cores. Employing approximation-based data synopsis, PARMA-CC achieves scalability on multi-cores by increasing the synergy between the work-sharing procedure and data structures to facilitate highly parallel execution of threads. The thesis provides analytical and empirical evaluation for MAD-C and PARMA-CC

    Data Mining in Internet of Things Systems: A Literature Review

    Get PDF
    The Internet of Things (IoT) and cloud technologies have been the main focus of recent research, allowing for the accumulation of a vast amount of data generated from this diverse environment. These data include without any doubt priceless knowledge if could correctly discovered and correlated in an efficient manner. Data mining algorithms can be applied to the Internet of Things (IoT) to extract hidden information from the massive amounts of data that are generated by IoT and are thought to have high business value. In this paper, the most important data mining approaches covering classification, clustering, association analysis, time series analysis, and outlier analysis from the knowledge will be covered. Additionally, a survey of recent work in in this direction is included. Another significant challenges in the field are collecting, storing, and managing the large number of devices along with their associated features. In this paper, a deep look on the data mining for the IoT platforms will be given concentrating on real applications found in the literatur
    • …
    corecore