9 research outputs found

    Reducing the Time Requirement of k-Means Algorithm

    Get PDF
    Traditional k-means and most k-means variants are still computationally expensive for large datasets, such as microarray data, which have large datasets with large dimension size d. In k-means clustering, we are given a set of n data points in ddimensional space Rd and an integer k. The problem is to determine a set of k points in Rd, called centers, so as to minimize the mean squared distance from each data point to its nearest center. In this work, we develop a novel k-means algorithm, which is simple but more efficient than the traditional k-means and the recent enhanced k-means. Our new algorithm is based on the recently established relationship between principal component analysis and the k-means clustering. We provided the correctness proof for this algorithm. Results obtained from testing the algorithm on three biological data and six non-biological data (three of these data are real, while the other three are simulated) also indicate that our algorithm is empirically faster than other known k-means algorithms. We assessed the quality of our algorithm clusters against the clusters of a known structure using the Hubert-Arabie Adjusted Rand index (ARIHA). We found that when k is close to d, the quality is good (ARIHA.0.8) and when k is not close to d, the quality of our new k-means algorithm is excellent (ARIHA.0.9). In this paper, emphases are on the reduction of the time requirement of the k-means algorithm and its application to microarray data due to the desire to create a tool for clustering and malaria research. However, the new clustering algorithm can be used for other clustering needs as long as an appropriate measure of distance between the centroids and the members is used. This has been demonstrated in this work on six non-biological data

    A Review on Dimension Reduction Techniques in Data Mining

    Get PDF
    Real world data is high-dimensional like images, speech signals containing multiple dimensions to represent data. Higher dimensional data are more complex for detecting and exploiting the relationships among terms. Dimensionality reduction is a technique used for reducing complexity for analyzing high dimensional data. There are many methodologies that are being used to find the Critical Dimensions for a dataset that significantly reduces the number of dimensions. They reduce the dimensions from the original input data. Dimensionality reduction methods can be of two types as feature extractions and feature selection techniques. Feature Extraction is a distinct form of Dimensionality Reduction to extract some important feature from input dataset. Two different approaches available for dimensionality reduction are supervised approach and unsupervised approach. One exclusive purpose of this survey is to provide an adequate comprehension of the different dimensionality reduction techniques that exist currently and also to introduce the applicability of any one of the prescribed methods that depends upon the given set of parameters and varying conditions. This paper surveys the schemes that are majorly used for Dimensionality Reduction mainly high dimension datasets. A comparative analysis of surveyed methodologies is also done, based on which, best methodology for a certain type of dataset can be chosen. Keywords: Data Mining, Dimensionality Reduction, Clustering, feature selection; curse of dimensionality; critical dimensio

    Optimized Parameter Search Approach for Weight Modification Attack Targeting Deep Learning Models

    Get PDF
    Deep neural network models have been developed in different fields, bringing many advances in several tasks. However, they have also started to be incorporated into tasks with critical risks. That worries researchers who have been interested in studying possible attacks on these models, discovering a long list of threats from which every model should be defended. The weight modification attack is presented and discussed among researchers, who have presented several versions and analyses about such a threat. It focuses on detecting multiple vulnerable weights to modify, misclassifying the desired input data. Therefore, analysis of the different approaches to this attack helps understand how to defend against such a vulnerability. This work presents a new version of the weight modification attack. Our approach is based on three processes: input data clusterization, weight selection, and modification of the weights. Data clusterization allows a directed attack to a selected class. Weight selection uses the gradient given by the input data to identify the most-vulnerable parameters. The modifications are incorporated in each step via limited noise. Finally, this paper shows how this new version of fault injection attack is capable of misclassifying the desired cluster completely, converting the 100% accuracy of the targeted cluster to 0–2.7% accuracy, while the rest of the data continues being well-classified. Therefore, it demonstrates that this attack is a real threat to neural networks.This research has been partially funded by European Union’s Horizon 2020 research and innovation programme project SPARTA and by the Basque Government under ELKARTEK project (LANTEGI4.0 KK-2020/00072)

    Online content clustering using variant K-Means Algorithms

    Get PDF
    Thesis (MTech)--Cape Peninsula University of Technology, 2019We live at a time when so much information is created. Unfortunately, much of the information is redundant. There is a huge amount of online information in the form of news articles that discuss similar stories. The number of articles is projected to grow. The growth makes it difficult for a person to process all that information in order to update themselves on a subject matter. There is an overwhelming amount of similar information on the internet. There is need for a solution that can organize this similar information into specific themes. The solution is a branch of Artificial intelligence (AI) called machine learning (ML) using clustering algorithms. This refers to clustering groups of information that is similar into containers. When the information is clustered people can be presented with information on their subject of interest, grouped together. The information in a group can be further processed into a summary. This research focuses on unsupervised learning. Literature has it that K-Means is one of the most widely used unsupervised clustering algorithm. K-Means is easy to learn, easy to implement and is also efficient. However, there is a horde of variations of K-Means. The research seeks to find a variant of K-Means that can be used with an acceptable performance, to cluster duplicate or similar news articles into correct semantic groups. The research is an experiment. News articles were collected from the internet using gocrawler. gocrawler is a program that takes Universal Resource Locators (URLs) as an argument and collects a story from a website pointed to by the URL. The URLs are read from a repository. The stories come riddled with adverts and images from the web page. This is referred to as a dirty text. The dirty text is sanitized. Sanitization is basically cleaning the collected news articles. This includes removing adverts and images from the web page. The clean text is stored in a repository, it is the input for the algorithm. The other input is the K value. All K-Means based variants take K value that defines the number of clusters to be produced. The stories are manually classified and labelled. The labelling is done to check the accuracy of machine clustering. Each story is labelled with a class to which it belongs. The data collection process itself was not unsupervised but the algorithms used to cluster are totally unsupervised. A total of 45 stories were collected and 9 manual clusters were identified. Under each manual cluster there are sub clusters of stories talking about one specific event. The performance of all the variants is compared to see the one with the best clustering results. Performance was checked by comparing the manual classification and the clustering results from the algorithm. Each K-Means variant is run on the same set of settings and same data set, that is 45 stories. The settings used are, • Dimensionality of the feature vectors, • Window size, • Maximum distance between the current and predicted word in a sentence, • Minimum word frequency, • Specified range of words to ignore, • Number of threads to train the model. • The training algorithm either distributed memory (PV-DM) or distributed bag of words (PV-DBOW), • The initial learning rate. The learning rate decreases to minimum alpha as training progresses, • Number of iterations per cycle, • Final learning rate, • Number of clusters to form, • The number of times the algorithm will be run, • The method used for initialization. The results obtained show that K-Means can perform better than K-Modes. The results are tabulated and presented in graphs in chapter six. Clustering can be improved by incorporating Named Entity (NER) recognition into the K-Means algorithms. Results can also be improved by implementing multi-stage clustering technique. Where initial clustering is done then you take the cluster group and further cluster it to achieve finer clustering results

    SEQUENCE-STRATIGRAPHIC AND FACIES CONTROLS ON RESERVOIR QUALITY AND PRODUCTIVITY OF EARLY TO MIDDLE MIOCENE FLUVIAL AND TIDE-DOMINATED DELTAIC DEPOSITS, FORMATION 2, GULF OF THAILAND

    Get PDF
    The Early to Middle Miocene Formation 2 is the main contributor to hydrocarbon production in the Gulf of Thailand. Formation 2 consists of nine key lithofacies deposits in fluvial and tide-dominated deltaic environments. These lithofacies include 1) coal, 2) organic claystone, 3) bioturbated and laminated claystone, 4) heterolithic sandstone, 5) parallel-laminated sandstone, 6) ripple cross-laminated sandstone 7) cross-bedded sandstone, 8) structureless sandstone, and 9) conglomerate. Two methods of electrofacies classification were used to estimate rock types in non-cored wells, including Artificial-Neural Networks (ANNs) and K-means clustering. For mapping purposes, lithofacies are combined into four lithologies: 1) coal, 2) claystone, 3) heterolithic sandstone, and 4) sandstone. Using ANNs classification with an overall accuracy of 85%, lithology logs were estimated to establish a sequence-stratigraphic framework and to map reservoir properties. Formation 2 strata form a subset of a large first-order transgressive sequence that includes the underlying Formation 0, Formation 1, and the overlying Formation 3. Formation 2 stratigraphic framework consists of five third-order stratigraphic cycles named, from deepest to shallowest, units 2A-E. The moderate eustatic sea-level rise approximately 19 Ma resulted in a variety of depositional environments, facies distributions, and their reservoir properties. Units 2A-C represent a continuous transgression and landward shift of facies. The top of unit 2C possibly indicates the maximum landward extent of the shoreline. Unit 2D records a major regression and basinward shift of facies resulting from the combination of a glacio-eustatic sea-level fall and tectonic uplift in this region. Three-dimensional reservoir models illustrate the spatial distribution of lithology, porosity, permeability, and pore volume of the fluvial and tide-dominated deltaic deposits. Sandstone percentage and reservoir quality directly relates to the regressive cycle, unit 2D, while transgressive cycles 2A-C exhibit lower sandstone content and reservoir quality. A combination of the stratigraphic variability of fluvial and deltaic sandstones and fault compartmentalization control hydrocarbon accumulation
    corecore