1,783 research outputs found

    Fuzzy spectral clustering methods for textual data

    Get PDF
    Nowadays, the development of advanced information technologies has determined an increase in the production of textual data. This inevitable growth accentuates the need to advance in the identification of new methods and tools able to efficiently analyse such kind of data. Against this background, unsupervised classification techniques can play a key role in this process since most of this data is not classified. Document clustering, which is used for identifying a partition of clusters in a corpus of documents, has proven to perform efficiently in the analyses of textual documents and it has been extensively applied in different fields, from topic modelling to information retrieval tasks. Recently, spectral clustering methods have gained success in the field of text classification. These methods have gained popularity due to their solid theoretical foundations which do not require any specific assumption on the global structure of the data. However, even though they prove to perform well in text classification problems, little has been done in the field of clustering. Moreover, depending on the type of documents analysed, it might be often the case that textual documents do not contain only information related to a single topic: indeed, there might be an overlap of contents characterizing different knowledge domains. Consequently, documents may contain information that is relevant to different areas of interest to some degree. The first part of this work critically analyses the main clustering algorithms used for text data, involving also the mathematical representation of documents and the pre-processing phase. Then, three novel fuzzy versions of spectral clustering algorithms for text data are introduced. The first one exploits the use of fuzzy K-medoids instead of K-means. The second one derives directly from the first one but is used in combination with Kernel and Set Similarity (KS2M), which takes into account the Jaccard index. Finally, in the third one, in order to enhance the clustering performance, a new similarity measure S∗ is proposed. This last one exploits the inherent sequential nature of text data by means of a weighted combination between the Spectrum string kernel function and a measure of set similarity. The second part of the thesis focuses on spectral bi-clustering algorithms for text mining tasks, which represent an interesting and partially unexplored field of research. In particular, two novel versions of fuzzy spectral bi-clustering algorithms are introduced. The two algorithms differ from each other for the approach followed in the identification of the document and the word partitions. Indeed, the first one follows a simultaneous approach while the second one a sequential approach. This difference leads also to a diversification in the choice of the number of clusters. The adequacy of all the proposed fuzzy (bi-)clustering methods is evaluated by experiments performed on both real and benchmark data sets

    Trapezoidal Spherical Fuzzy Numbers and its Application to Fuzzy Risk Analysis

    Get PDF
    Spherical fuzzy sets are a broader type of fuzzy sets that have the ability to handle various scenarios using their membership, non-membership, and neutral membership grades. These sets require that the total of the squares of these grades be no greater than one. This condition extends the possible values for the three grades and enables decision makers to have a wider range of options when assessing a situation. In solving real life problems, it is necessary to describe a real number as a spherical fuzzy set to incorporate the fuzziness, thus, the need to use trapezoidal spherical fuzzy numbers (TSFN).  In this paper, the membership functions of the TSFN, their arithmetic operations and their properties are discussed. Also, a ranking function is proposed to order the TSFNs. All these are used to solve a fuzzy risk analysis problem whose parameters are presented as TSFNs

    A cosine based validation measure for Document Clustering

    Get PDF
    Document Clustering is the peculiar application of cluster analysis methods on huge documentary databases. Document Clustering aims at organizing a large quantity of unlabelled documents into a smaller number of meaningful and coherent clusters, similar in content. One of the main unsolved problems in clustering literature is the lack of a reliable methodology to evaluate results, although a wide variety of validation measures has been proposed. If those measures are often unsatisfactory when dealing with numerical databases, they definitely underperform in Document Clustering. This paper proposes a new validation measure. After introducing the most common approaches to Document Clustering, our attention is focused on Spherical K-means, do to its strict connection with the Vector Space Model, typical of Information Retrieval. Since Spherical K-means adopts a cosine-based similarity measure, we propose a validation measure based on the same criterion. The new measure effectiveness is shown in the frame of a comparative study, by involving 13 different corpora (usually used in literature for comparing different proposals) and 15 validation measures

    Circular Pythagorean fuzzy sets and applications to multi-criteria decision making

    Full text link
    In this paper, we introduce the concept of circular Pythagorean fuzzy set (value) (C-PFS(V)) as a new generalization of both circular intuitionistic fuzzy sets (C-IFSs) proposed by Atannassov and Pythagorean fuzzy sets (PFSs) proposed by Yager. A circular Pythagorean fuzzy set is represented by a circle that represents the membership degree and the non-membership degree and whose center consists of non-negative real numbers μ\mu and ν\nu with the condition μ2+ν21\mu^2+\nu^2\leq 1. A C-PFS models the fuzziness of the uncertain information more properly thanks to its structure that allows modelling the information with points of a circle of a certain center and a radius. Therefore, a C-PFS lets decision makers to evaluate objects in a larger and more flexible region and thus more sensitive decisions can be made. After defining the concept of C-PFS we define some fundamental set operations between C-PFSs and propose some algebraic operations between C-PFVs via general tt-norms and tt-conorms. By utilizing these algebraic operations, we introduce some weighted aggregation operators to transform input values represented by C-PFVs to a single output value. Then to determine the degree of similarity between C-PFVs we define a cosine similarity measure based on radius. Furthermore, we develop a method to transform a collection of Pythagorean fuzzy values to a PFS. Finally, a method is given to solve multi-criteria decision making problems in circular Pythagorean fuzzy environment and the proposed method is practiced to a problem about selecting the best photovoltaic cell from the literature. We also study the comparison analysis and time complexity of the proposed method

    Recent Developments in Document Clustering

    Get PDF
    This report aims to give a brief overview of the current state of document clustering research and present recent developments in a well-organized manner. Clustering algorithms are considered with two hypothetical scenarios in mind: online query clustering with tight efficiency constraints, and offline clustering with an emphasis on accuracy. A comparative analysis of the algorithms is performed along with a table summarizing important properties, and open problems as well as directions for future research are discussed

    Techniques for clustering gene expression data

    Get PDF
    Many clustering techniques have been proposed for the analysis of gene expression data obtained from microarray experiments. However, choice of suitable method(s) for a given experimental dataset is not straightforward. Common approaches do not translate well and fail to take account of the data profile. This review paper surveys state of the art applications which recognises these limitations and implements procedures to overcome them. It provides a framework for the evaluation of clustering in gene expression analyses. The nature of microarray data is discussed briefly. Selected examples are presented for the clustering methods considered
    corecore