41,004 research outputs found

    Fuzzy spectral clustering methods for textual data

    Get PDF
    Nowadays, the development of advanced information technologies has determined an increase in the production of textual data. This inevitable growth accentuates the need to advance in the identification of new methods and tools able to efficiently analyse such kind of data. Against this background, unsupervised classification techniques can play a key role in this process since most of this data is not classified. Document clustering, which is used for identifying a partition of clusters in a corpus of documents, has proven to perform efficiently in the analyses of textual documents and it has been extensively applied in different fields, from topic modelling to information retrieval tasks. Recently, spectral clustering methods have gained success in the field of text classification. These methods have gained popularity due to their solid theoretical foundations which do not require any specific assumption on the global structure of the data. However, even though they prove to perform well in text classification problems, little has been done in the field of clustering. Moreover, depending on the type of documents analysed, it might be often the case that textual documents do not contain only information related to a single topic: indeed, there might be an overlap of contents characterizing different knowledge domains. Consequently, documents may contain information that is relevant to different areas of interest to some degree. The first part of this work critically analyses the main clustering algorithms used for text data, involving also the mathematical representation of documents and the pre-processing phase. Then, three novel fuzzy versions of spectral clustering algorithms for text data are introduced. The first one exploits the use of fuzzy K-medoids instead of K-means. The second one derives directly from the first one but is used in combination with Kernel and Set Similarity (KS2M), which takes into account the Jaccard index. Finally, in the third one, in order to enhance the clustering performance, a new similarity measure S∗ is proposed. This last one exploits the inherent sequential nature of text data by means of a weighted combination between the Spectrum string kernel function and a measure of set similarity. The second part of the thesis focuses on spectral bi-clustering algorithms for text mining tasks, which represent an interesting and partially unexplored field of research. In particular, two novel versions of fuzzy spectral bi-clustering algorithms are introduced. The two algorithms differ from each other for the approach followed in the identification of the document and the word partitions. Indeed, the first one follows a simultaneous approach while the second one a sequential approach. This difference leads also to a diversification in the choice of the number of clusters. The adequacy of all the proposed fuzzy (bi-)clustering methods is evaluated by experiments performed on both real and benchmark data sets

    A new classification technique based on hybrid fuzzy soft set theory and supervised fuzzy c-means

    Get PDF
    Recent advances in information technology have led to significant changes in today‟s world. The generating and collecting data have been increasing rapidly. Popular use of the World Wide Web (www) as a global information system led to a tremendous amount of information, and this can be in the form of text document. This explosive growth has generated an urgent need for new techniques and automated tools that can assist us in transforming the data into more useful information and knowledge. Data mining was born for these requirements. One of the essential processes contained in the data mining is classification, which can be used to classify such text documents and utilize it in many daily useful applications. There are many classification methods, such as Bayesian, K-Nearest Neighbor, Rocchio, SVM classifier, and Soft Set Theory used to classify text document. Although those methods are quite successful, but accuracy and efficiency are still outstanding for text classification problem. This study is to propose a new approach on classification problem based on hybrid fuzzy soft set theory and supervised fuzzy c-means. It is called Hybrid Fuzzy Classifier (HFC). The HFC used the fuzzy soft set as data representation and then using the supervised fuzzy c-mean as classifier. To evaluate the performance of HFC, two well-known datasets are used i.e., 20 Newsgroups and Reuters-21578, and compared it with the performance of classic fuzzy soft set classifiers and classic text classifiers. The results show that the HFC outperforms up to 50.42% better as compared to classic fuzzy soft set classifier and up to 0.50% better as compare classic text classifier

    Document preprocessing and fuzzy unsupervised character classification

    Get PDF
    This dissertation presents document preprocessing and fuzzy unsupervised character classification for automatically reading daily-received office documents that have complex layout structures, such as multiple columns and mixed-mode contents of texts, graphics and half-tone pictures. First, the block segmentation algorithm is performed based on a simple two-step run-length smoothing to decompose a document into single-mode blocks. Next, the block classification is performed based on the clustering rules to classify each block into one of the types such as text, horizontal or vertical lines, graphics, and pictures. The mean white-to-black transition is shown as an invariance for textual blocks, and is useful for block discrimination. A fuzzy model for unsupervised character classification is designed to improve the robustness, correctness, and speed of the character recognition system. The classification procedures are divided into two stages. The first stage separates the characters into seven typographical categories based on word structures of a text line. The second stage uses pattern matching to classify the characters in each category into a set of fuzzy prototypes based on a nonlinear weighted similarity function. A fuzzy model of unsupervised character classification, which is more natural in the representation of prototypes for character matching, is defined and the weighted fuzzy similarity measure is explored. The characteristics of the fuzzy model are discussed and used in speeding up the classification process. After classification, the character recognition procedure is simply applied on the limited versions of the fuzzy prototypes. To avoid information loss and extra distortion, an topography-based approach is proposed to apply directly on the fuzzy prototypes to extract the skeletons. First, a convolution by a bell-shaped function is performed to obtain a smooth surface. Second, the ridge points are extracted by rule-based topographic analysis of the structure. Third, a membership function is assigned to ridge points with values indicating the degrees of membership with respect to the skeleton of an object. Finally, the significant ridge points are linked to form strokes of skeleton, and the clues of eigenvalue variation are used to deal with degradation and preserve connectivity. Experimental results show that our algorithm can reduce the deformation of junction points and correctly extract the whole skeleton although a character is broken into pieces. For some characters merged together, the breaking candidates can be easily located by searching for the saddle points. A pruning algorithm is then applied on each breaking position. At last, a multiple context confirmation can be applied to increase the reliability of breaking hypotheses

    Enhanced sentence extraction through neuro-fuzzy technique for text document summarization

    Get PDF
    A summary system comprises a subtraction of text documents to generate a new form that delivers the essentials contents of the documents. Due to the hassle of documents overload, getting the right information and effectively-developed summaries are essential in retrieving information. Reduction of information allows users to find the information needed quickly without the need to read the full document collection, in particular, multi documents. In the recent past, soft computing-based approaches have gained popularity in its ability to determine important information across documents. A number of studies have modelled summarization systems based on fuzzy logic reasoning in order to select important sentences to be included in the summary. Although past studies support the benefits of employing fuzzy based reasoning for extracting important sentences from the document, there is a limitation concerning this method. Human or linguistic experts are required to determine the rules for the fuzzy system. Furthermore, the membership functions need to be manually tuned. These can be a very tedious and time-consuming process. Moreover, the performance of the fuzzy system can be affected by the choice of rules and parameters of membership function. Therefore, this study proposes a text summarization model based on classification using neuro-fuzzy approach. A classifier is first trained to identify summary sentences. Then, we use the proposed model to score and filter high-quality summary sentences. We compare the performance of our proposed model with the existing approaches, which are based on fuzzy logic and neural network techniques. In this study, we also evaluate the performance of sentence scoring and clustering in the process of generating text summaries. The proposed neuro-fuzzy model was used to score the sentences and clustering were performed using K-Means and Hierarchical Clustering (HC) approaches. The proposed approach showed improved results compared to the previous techniques in terms of precision, recall and F-measure on the Document Understanding Conference (DUC) data corpus. However, it was found that no improvements in the quality of the generated summaries obtained by simply performing clustering

    Taming Wild High Dimensional Text Data with a Fuzzy Lash

    Full text link
    The bag of words (BOW) represents a corpus in a matrix whose elements are the frequency of words. However, each row in the matrix is a very high-dimensional sparse vector. Dimension reduction (DR) is a popular method to address sparsity and high-dimensionality issues. Among different strategies to develop DR method, Unsupervised Feature Transformation (UFT) is a popular strategy to map all words on a new basis to represent BOW. The recent increase of text data and its challenges imply that DR area still needs new perspectives. Although a wide range of methods based on the UFT strategy has been developed, the fuzzy approach has not been considered for DR based on this strategy. This research investigates the application of fuzzy clustering as a DR method based on the UFT strategy to collapse BOW matrix to provide a lower-dimensional representation of documents instead of the words in a corpus. The quantitative evaluation shows that fuzzy clustering produces superior performance and features to Principal Components Analysis (PCA) and Singular Value Decomposition (SVD), two popular DR methods based on the UFT strategy

    The category proliferation problem in ART neural networks

    Get PDF
    This article describes the design of a new model IKMART, for classification of documents and their incorporation into categories based on the KMART architecture. The architecture consists of two networks that mutually cooperate through the interconnection of weights and the output matrix of the coded documents. The architecture retains required network features such as incremental learning without the need of descriptive and input/output fuzzy data, learning acceleration and classification of documents and a minimal number of user-defined parameters. The conducted experiments with real documents showed a more precise categorization of documents and higher classification performance in comparison to the classic KMART algorithm.Web of Science145634
    corecore