2,176 research outputs found

    A survey of kernel and spectral methods for clustering

    Get PDF
    Clustering algorithms are a useful tool to explore data structures and have been employed in many disciplines. The focus of this paper is the partitioning clustering problem with a special interest in two recent approaches: kernel and spectral methods. The aim of this paper is to present a survey of kernel and spectral clustering methods, two approaches able to produce nonlinear separating hypersurfaces between clusters. The presented kernel clustering methods are the kernel version of many classical clustering algorithms, e.g., K-means, SOM and neural gas. Spectral clustering arise from concepts in spectral graph theory and the clustering problem is configured as a graph cut problem where an appropriate objective function has to be optimized. An explicit proof of the fact that these two paradigms have the same objective is reported since it has been proven that these two seemingly different approaches have the same mathematical foundation. Besides, fuzzy kernel clustering methods are presented as extensions of kernel K-means clustering algorithm. (C) 2007 Pattem Recognition Society. Published by Elsevier Ltd. All rights reserved

    Recent Developments in Document Clustering

    Get PDF
    This report aims to give a brief overview of the current state of document clustering research and present recent developments in a well-organized manner. Clustering algorithms are considered with two hypothetical scenarios in mind: online query clustering with tight efficiency constraints, and offline clustering with an emphasis on accuracy. A comparative analysis of the algorithms is performed along with a table summarizing important properties, and open problems as well as directions for future research are discussed

    Fuzzy spectral clustering methods for textual data

    Get PDF
    Nowadays, the development of advanced information technologies has determined an increase in the production of textual data. This inevitable growth accentuates the need to advance in the identification of new methods and tools able to efficiently analyse such kind of data. Against this background, unsupervised classification techniques can play a key role in this process since most of this data is not classified. Document clustering, which is used for identifying a partition of clusters in a corpus of documents, has proven to perform efficiently in the analyses of textual documents and it has been extensively applied in different fields, from topic modelling to information retrieval tasks. Recently, spectral clustering methods have gained success in the field of text classification. These methods have gained popularity due to their solid theoretical foundations which do not require any specific assumption on the global structure of the data. However, even though they prove to perform well in text classification problems, little has been done in the field of clustering. Moreover, depending on the type of documents analysed, it might be often the case that textual documents do not contain only information related to a single topic: indeed, there might be an overlap of contents characterizing different knowledge domains. Consequently, documents may contain information that is relevant to different areas of interest to some degree. The first part of this work critically analyses the main clustering algorithms used for text data, involving also the mathematical representation of documents and the pre-processing phase. Then, three novel fuzzy versions of spectral clustering algorithms for text data are introduced. The first one exploits the use of fuzzy K-medoids instead of K-means. The second one derives directly from the first one but is used in combination with Kernel and Set Similarity (KS2M), which takes into account the Jaccard index. Finally, in the third one, in order to enhance the clustering performance, a new similarity measure S∗ is proposed. This last one exploits the inherent sequential nature of text data by means of a weighted combination between the Spectrum string kernel function and a measure of set similarity. The second part of the thesis focuses on spectral bi-clustering algorithms for text mining tasks, which represent an interesting and partially unexplored field of research. In particular, two novel versions of fuzzy spectral bi-clustering algorithms are introduced. The two algorithms differ from each other for the approach followed in the identification of the document and the word partitions. Indeed, the first one follows a simultaneous approach while the second one a sequential approach. This difference leads also to a diversification in the choice of the number of clusters. The adequacy of all the proposed fuzzy (bi-)clustering methods is evaluated by experiments performed on both real and benchmark data sets

    Methods for fast and reliable clustering

    Get PDF

    Text Classification Aided by Clustering: a Literature Review

    Get PDF

    Clustering and its Application in Requirements Engineering

    Get PDF
    Large scale software systems challenge almost every activity in the software development life-cycle, including tasks related to eliciting, analyzing, and specifying requirements. Fortunately many of these complexities can be addressed through clustering the requirements in order to create abstractions that are meaningful to human stakeholders. For example, the requirements elicitation process can be supported through dynamically clustering incoming stakeholders’ requests into themes. Cross-cutting concerns, which have a significant impact on the architectural design, can be identified through the use of fuzzy clustering techniques and metrics designed to detect when a theme cross-cuts the dominant decomposition of the system. Finally, traceability techniques, required in critical software projects by many regulatory bodies, can be automated and enhanced by the use of cluster-based information retrieval methods. Unfortunately, despite a significant body of work describing document clustering techniques, there is almost no prior work which directly addresses the challenges, constraints, and nuances of requirements clustering. As a result, the effectiveness of software engineering tools and processes that depend on requirements clustering is severely limited. This report directly addresses the problem of clustering requirements through surveying standard clustering techniques and discussing their application to the requirements clustering process

    A framework for regime identification and asset allocation

    Get PDF
    The purpose of this thesis is to examine a regime-based asset allocation strategy and evaluate whether accounting for regime-dependent risk and return of asset classes provides any significant improvement on portfolio performance. The South African market and economy are considered as a proxy for the analysis. Motivation of this thesis stems from the growing body of research by practitioners devoted to models that are reflective of the interdependency between financial assets and the real economy. The asset classes under consideration for the analysis are domestic and foreign cash, domestic and foreign bonds, domestic and foreign equity, inflation linked bonds, property, gold and commodities. In order to evaluate the performance of the regime-based strategy, this thesis proposes a framework based on Principal Component Analysis and Fuzzy Cluster Analysis for regime identification and asset allocation. The performance of the strategy is tested against two strategies that are not cognizant of regime changes. These are an equally weighted portfolio and a buy-and-hold strategy. Furthermore, relative performance analysis was performed by comparing the regime-based strategy proposed in this thesis against the Alexander Forbes Large Manager Watch Index. Due to data limitations, the analysis is done on an in-sample basis without an out-of-sample testing. The results from the analysis showed the extent of outperformance of the proposed regime-based strategy relative to an equally weighted strategy and a buy-and-hold strategy. These results were consistent with existing literature on regime-based strategies. Furthermore, the results provided strong motivation for the use of the regime identification framework together with tactical asset allocation proposed in this thesis

    USING LATENT SEMANTIC INDEXING FOR DOCUMENT CLUSTERING

    Get PDF
    Documents with various contents are easily obtained from URLs which are associated with their titles. However, the titles of documents may not describe their contents and they just attract the readers to buy and read them. Therefore, the document clustering based on the same category is important to help users to retrieve information they need. Document clustering is an implementation of data mining task. By using similarity measurement of documents‟ characteristic, they can be clustered based on the same category or topic. High dimensionality of the document representation is due to representing of all substantial words in the vector space model. It is one of problems in document clustering that decreases the cluster quality performance including f-measure, entropy and accuracy. In categorical domain, many research have been conducted to reduce the dimension size of term-document matrix representation until by using keyword base. However, the result is obtained low accuracy in various class sizes of document collections. Therefore, this research is intended to improve the quality and accuracy of document clustering by using a method in information retrieval. A method in information retrieval, Latent Semantic Indexing (LSI), is proposed to reduce the dimension of term-document matrix for document representation. In this work, the LSI method is used to produce the patterns of terms, so that documents can be mapped into concept space. Based on the new representation, the documents are then subjected to the clustering algorithm itself, which is Fuzzy c-Means algorithm. A variant of distance measurement, cosine similarity, is also embedded to this algorithm. The results are then compared with some existing algorithms, which are used for benchmark purposes. The results show that the proposed method obtains high quality cluster and it is superior to the other fuzzy clustering algorithms for category i.e. FCCM, FSKWIC, and Fuzzy CoDoK with accuracy rate of over 90%
    corecore