103 research outputs found

    Effective Spell Checking Methods Using Clustering Algorithms

    Get PDF
    This paper presents a novel approach to spell checking using dictionary clustering. The main goal is to reduce the number of times distances have to be calculated when finding target words for misspellings. The method is unsupervised and combines the application of anomalous pattern initialization and partition around medoids (PAM). To evaluate the method, we used an English misspelling list compiled using real examples extracted from the Birkbeck spelling error corpus.Final Published versio

    Feature Relevance in Ward’s Hierarchical Clustering Using the Lp Norm

    Get PDF
    In this paper we introduce a new hierarchical clustering algorithm called Ward p . Unlike the original Ward, Ward p generates feature weights, which can be seen as feature rescaling factors thanks to the use of the L p norm. The feature weights are cluster dependent, allowing a feature to have different degrees of relevance at different clusters. We validate our method by performing experiments on a total of 75 real-world and synthetic datasets, with and without added features made of uniformly random noise. Our experiments show that: (i) the use of our feature weighting method produces results that are superior to those produced by the original Ward method on datasets containing noise features; (ii) it is indeed possible to estimate a good exponent p under a totally unsupervised framework. The clusterings produced by Ward p are dependent on p. This makes the estimation of a good value for this exponent a requirement for this algorithm, and indeed for any other also based on the Lp norm.Peer reviewedFinal Accepted Versio

    A new perceptual dissimilarity measure for image retrieval and clustering

    Get PDF
    Image retrieval and clustering are two important tools for analysing and organising images. Dissimilarity measure is central to both image retrieval and clustering. The performance of image retrieval and clustering algorithms depends on the effectiveness of the dissimilarity measure. ‘Minkowski’ distance, or more specifically, ‘Euclidean’ distance, is the most widely used dissimilarity measure in image retrieval and clustering. Euclidean distance depends only on the geometric position of two data instances in the feature space and completely ignores the data distribution. However, data distribution has an effect on human perception. The argument that two data instances in a dense area are more perceptually dissimilar than the same two instances in a sparser area, is proposed by psychologists. Based on this idea, a dissimilarity measure called, ‘mp’, has been proposed to address Euclidean distance’s limitation of ignoring the data distribution. Here, mp relies on data distribution to calculate the dissimilarity between two instances. As prescribed in mp, higher data mass between two data instances implies higher dissimilarity, and vice versa. mp relies only on data distribution and completely ignores the geometric distance in its calculations. In the aggregation of dissimilarities between two instances over all the dimensions in feature space, both Euclidean distance and mp give same priority to all the dimensions. This may result in a situation that the final dissimilarity between two data instances is determined by a few dimensions of feature vectors with relatively much higher values. As a result, the dissimilarity derived may not align well with human perception. The need to address the limitations of Minkowski distance measures, along with the importance of a dissimilarity measure that considers both geometric distance and the perceptual effect of data distribution in measuring dissimilarity between images motivated this thesis. It studies the performance of mp for image retrieval. It investigates a new dissimilarity measure that combines both Euclidean distance and data distribution. In addition to these, it studies the performance of such a dissimilarity measure for image retrieval and clustering. Our performance study of mp for image retrieval shows that relying only on data distribution to measure the dissimilarity results in some situations, where the mp’s measurement is contrary to human perception. This thesis introduces a new dissimilarity measure called, perceptual dissimilarity measure (PDM). PDM considers the perceptual effect of data distribution in combination with Euclidean distance. PDM has two variants, PDM1 and PDM2. PDM1 focuses on improving mp by weighting it using Euclidean distance in situations where mp may not retrieve accurate results. PDM2 considers the effect of data distribution on the perceived dissimilarity measured by Euclidean distance. PDM2 proposes a weighting system for Euclidean distance using a logarithmic transform of data mass. The proposed PDM variants have been used as alternatives to Euclidean distance and mp to improve the accuracy in image retrieval. Our results show that PDM2 has consistently performed the best, compared to Euclidean distance, mp and PDM1. PDM1’s performance was not consistent, although it has performed better than mp in all the experiments, but it could not outperform Euclidean distance in some cases. Following the promising results of PDM2 in image retrieval, we have studied its performance for image clustering. k-means is the most widely used clustering algorithm in scientific and industrial applications. k-medoids is the closest clustering algorithm to k-means. Unlike k-means which works only with Euclidean distance, k-medoids gives the option to choose the arbitrary dissimilarity measure. We have used Euclidean distance, mp and PDM2 as the dissimilarity measure in k-medoids and compared the results with k-means. Our clustering results show that PDM2 has perfromed overally the best. This confirms our retrieval results and identifies PDM2 as a suitable dissimilarity measure for image retrieval and clustering.Doctor of Philosoph

    Fuzzy C-ordered medoids clustering of interval-valued data

    Get PDF
    Fuzzy clustering for interval-valued data helps us to find natural vague boundaries in such data. The Fuzzy c-Medoids Clustering (FcMdC) method is one of the most popular clustering methods based on a partitioning around medoids approach. However, one of the greatest disadvantages of this method is its sensitivity to the presence of outliers in data. This paper introduces a new robust fuzzy clustering method named Fuzzy c-Ordered-Medoids clustering for interval-valued data (FcOMdC-ID). The Huber's M-estimators and the Yager's Ordered Weighted Averaging (OWA) operators are used in the method proposed to make it robust to outliers. The described algorithm is compared with the fuzzy c-medoids method in the experiments performed on synthetic data with different types of outliers. A real application of the FcOMdC-ID is also provided

    Performance analysis in text clustering using k-means and k-medoids algorithms for Malay crime documents

    Get PDF
    Few studies on text clustering for the Malay language have been conducted due to some limitations that need to be addressed. The purpose of this article is to compare the two clustering algorithms of k-means and k-medoids using Euclidean distance similarity to determine which method is the best for clustering documents. Both algorithms are applied to 1000 documents pertaining to housebreaking crimes involving a variety of different modus operandi. Comparability results indicate that the k-means algorithm performed the best at clustering the relevant documents, with a 78% accuracy rate. K-means clustering also achieves the best performance for cluster evaluation when comparing the average within-cluster distance to the k-medoids algorithm. However, k-medoids perform exceptionally well on the Davis Bouldin index (DBI). Furthermore, the accuracy of k-means is dependent on the number of initial clusters, where the appropriate cluster number can be determined using the elbow method

    Effect of different distance measures in result of cluster analysis

    Get PDF
    The objective of this master’s thesis was to explore different distance measures that could be used in clustering and to evaluate how different distance measures in K-medoid clustering method would affect the clustering output. The different distance measures used in this research includes Euclidean, Squared Euclidean, Manhattan, Chebyshev and Mahalanobis distance. To achieve the research objective, K-medoid method with different distance measures was applied to a spatial dataset to explore relative information revealed by each distance measure. The effect of each distance measure on output is documented and the output was further compared with each other to reveal the differences between each distance measure. The study starts with literature review of cluster analysis process where necessary steps for performing cluster analysis are explained. In literature section, different clustering methods with particular characteristics of each method are described that would serve as basis for choice of clustering method. Data description and data analysis is included thereafter which is followed by interpretation of clustering result and its use for Terrain analysis. Terrain analysis has its significance in forest industry, military as well as crisis management and is usually concerned with off-road mobility of a vehicle or a group of vehicles between given locations. In case of terrain analysis, clustering could be used to group the similar areas and determine the off-road mobility of a particular vehicle. This result could be further categorized according to suitability of the item in the cluster and interpreted using expert evaluation in order to reveal useful information about mobility in a terrain. Cluster Validation measures were applied to output of clustering to determine the differences between different distance measures. The findings of this study indicate that in the study area, there exists some level of differences in the result of clustering when different distance measures are used. This difference is then interpreted with the help of input dataset and expert opinion to understand the effect of different distance measures in the dataset. Finally, the study provides basis for mobility analysis with help of clustering output

    An exploration of methodologies to improve semi-supervised hierarchical clustering with knowledge-based constraints

    Get PDF
    Clustering algorithms with constraints (also known as semi-supervised clustering algorithms) have been introduced to the field of machine learning as a significant variant to the conventional unsupervised clustering learning algorithms. They have been demonstrated to achieve better performance due to integrating prior knowledge during the clustering process, that enables uncovering relevant useful information from the data being clustered. However, the research conducted within the context of developing semi-supervised hierarchical clustering techniques are still an open and active investigation area. Majority of current semi-supervised clustering algorithms are developed as partitional clustering (PC) methods and only few research efforts have been made on developing semi-supervised hierarchical clustering methods. The aim of this research is to enhance hierarchical clustering (HC) algorithms based on prior knowledge, by adopting novel methodologies. [Continues.

    Solving a hard Cutting Stock Problem by machine learning and optimisation

    Get PDF
    We are working with a company on a hard industrial optimisation problem: a version of the well-known Cutting Stock Problem in which a paper mill must cut rolls of paper following certain cutting patterns to meet customer demands. In our problem each roll to be cut may have a different size, the cutting patterns are semi-automated so that we have only indirect control over them via a list of continuous parameters called a request, and there are multiple mills each able to use only one request. We solve the problem using a combination of machine learning and optimisation techniques. First we approximate the distribution of cutting patterns via Monte Carlo simulation. Secondly we cover the distribution by applying a k-medoids algorithm. Thirdly we use the results to build an ILP model which is then solved

    An integrated clustering analysis framework for heterogeneous data

    Get PDF
    Big data is a growing area of research with some important research challenges that motivate our work. We focus on one such challenge, the variety aspect. First, we introduce our problem by defining heterogeneous data as data about objects that are described by different data types, e.g., structured data, text, time-series, images, etc. Through our work we use five datasets for experimentation: a real dataset of prostate cancer data and four synthetic dataset that we have created and made them publicly available. Each dataset covers different combinations of data types that are used to describe objects. Our strategy for clustering is based on fusion approaches. We compare intermediate and late fusion schemes. We propose an intermediary fusion approach, Similarity Matrix Fusion (SMF), where the integration process takes place at the level of calculating similarities. SMF produces a single distance fusion matrix and two uncertainty expression matrices. We then propose a clustering algorithm, Hk-medoids, a modified version of the standard k-medoids algorithm that utilises uncertainty calculations to improve on the clustering performance. We evaluate our results by comparing them to clustering produced using individual elements and show that the fusion approach produces equal or significantly better results. Also, we show that there are advantages in utilising the uncertainty information as Hkmedoids does. In addition, from a theoretical point of view, our proposed Hk-medoids algorithm has less computation complexity than the popular PAM implementation of the k-medoids algorithm. Then, we employed late fusion that aggregates the results of clustering by individual elements by combining cluster labels using an object co-occurrence matrix technique. The final cluster is then derived by a hierarchical clustering algorithm. We show that intermediate fusion for clustering of heterogeneous data is a feasible and efficient approach using our proposed Hk-medoids algorithm
    • …
    corecore