77,205 research outputs found

    Text Categorization of Documents using K-Means and K-Means++ Clustering Algorithm

    Get PDF
    Text categorization is the technique used for sorting a set of documents into categories from a predefined set. Text categorization is useful in better management and retrieval of the text documents and also makes document retrieval as a simple task. Clustering is an unsupervised learning technique aimed at grouping a set of objects into clusters. Text document Clustering means clustering of related text documents into groups based upon their content. Various clustering algorithms are available for text categorization. This paper presents categorization of the text documents using two clustering algorithms namely K-means and K-means++ and a comparison is carried out to find which algorithm is best for categorizing text documents. This project also introduces pre-processing phase, which in turn includes tokenization, stop-words removal and stemming. It also involves Tf-Idf calculation. In addition, the impact of the three distance/similarity measures (Cosine Similarity, Jaccard coefficient, Euclidean distance) on the results of both clustering algorithms(K-means and K-means++) are evaluated. The dataset considered for evaluation consists of 600 text documents of three different categories- Festivals, Sports and Tourism in India. Our observation shows that for categorizing the text documents using K-Means++ clustering algorithm with Cosine Similarity measure gives better result as compared to K-means. For K-Means++ algorithm using Cosine Similarity measure purity of the cluster obtained is 0.8216

    Graph based text representation for document clustering

    Get PDF
    Advances in digital technology and the World Wide Web has led to the increase of digital documents that are used for various purposes such as publishing and digital library. This phenomenon raises awareness for the requirement of effective techniques that can help during the search and retrieval of text. One of the most needed tasks is clustering, which categorizes documents automatically into meaningful groups. Clustering is an important task in data mining and machine learning. The accuracy of clustering depends tightly on the selection of the text representation method. Traditional methods of text representation model documents as bags of words using term-frequency index document frequency (TFIDF). This method ignores the relationship and meanings of words in the document. As a result the sparsity and semantic problem that is prevalent in textual document are not resolved. In this study, the problem of sparsity and semantic is reduced by proposing a graph based text representation method, namely dependency graph with the aim of improving the accuracy of document clustering. The dependency graph representation scheme is created through an accumulation of syntactic and semantic analysis. A sample of 20 news group, dataset was used in this study. The text documents undergo pre-processing and syntactic parsing in order to identify the sentence structure. Then the semantic of words are modeled using dependency graph. The produced dependency graph is then used in the process of cluster analysis. K-means clustering technique was used in this study. The dependency graph based clustering result were compared with the popular text representation method, i.e. TFIDF and Ontology based text representation. The result shows that the dependency graph outperforms both TFIDF and Ontology based text representation. The findings proved that the proposed text representation method leads to more accurate document clustering results

    Visualising Arabic sentiments and association rules in financial text

    Get PDF
    Text mining methods involve various techniques, such as text categorization, summarisation, information retrieval, document clustering, topic detection, and concept extraction. In addition, because of the difficulties involved in text mining, visualisation techniques can play a paramount role in the analysis and pre-processing of textual data. This paper will present two novel frameworks for the classification and extraction of the association rules and the visualisation of financial Arabic text in order to realize both the general structure and the sentiment within an accumulated corpus. However, mining unstructured data with natural language processing (NLP) and machine learning techniques can be arduous, especially where the Arabic language is concerned, because of limited research in this area. The results show that our frameworks can readily classify Arabic tweets. Furthermore, they can handle many antecedent text association rules for the positive class and the negative class

    A SEMANTICS-BASED CLUSTERING APPROACH FOR SIMILAR RESEARCH AREA DETECTION: A CASE STUDY OF NIGERIAN UNIVERSITIES

    Get PDF
    The place of research collaborations is indispensable in coming up with research publications. The task of detecting similar research areas is crucial to the development and furtherance of research. Prominent and rookie researchers alike are predisposed to seek existing research publications in a research field of interest before coming up with a thesis. The manual process of searching out individuals in an already existing research techniques which do not sufficiently capture the implicit semantics of keywords thereby leaving out some research articles. In this work, we have proposed a similar research area detection framework to address this problem. The aim of this study is to develop a semantics-based clustering method for similar research area detection. This study employs a number of techniques such as Ontology-based pre-processing, Latent Semantic.Indexing and K-Means Clustering to develop a prototype similar research area detection system, that can be used to determine similar research domain publications. However, traditional document clustering techniques suffer from high dimensionality and data sparsity problems. In a bid to solve these problems, a domain ontology is used in the preprocessing stage to weight concepts and determine semantically similar concepts, while Latent Semantic Analysis is used as the topic modelling technique in order to capture the implicit semantic relationship between terms in the text corpus. To test our framework, publications from a number of Nigerian University faculties were randomly selected and used as the dataset for our clustering model. A proof-of-concept implementation was developed using the Python programming language. From the evaluation of our system, we were able to derive more accurate clustering results as a result of the integration of ontologies in the pre-processing stage in comparison with documents that were not pre-processed with the ontology. field is cumbersome and time-consuming. Besides, it tends to not capture publications with keywords that do not match a keyword query which results in inaccurate results. From extant literature, automated similar research area detection systems have been developed to solve this problem. However, most of them use keyword matching techniques which do not sufficiently capture the implicit semantics of keywords thereby leaving out some research articles. In this work, we have proposed a similar research area detection framework to address this problem. The aim of this study is to develop a semantics-based clustering method for similar research area detection. This study employs a number of techniques such as Ontology-based pre-processing, Latent Semantic Indexing and K-Means Clustering to develop a prototype similar research area detectionsystem, that can be used to determine similar research domain publications. However, traditional document clustering techniques suffer from high dimensionality and data sparsity problems. In a bid to solve these problems, a domain ontology is used in the preprocessing stage to weight concepts and determine semantically similar concepts, while Latent Semantic Analysis is used as the topic modelling technique in order to capture the implicit semantic relationship between terms in the text corpus. To test our framework, publications from a number of Nigerian University faculties were randomly selected and used as the dataset for our clustering model. A proof-of-concept implementation was developed using the Python programming language. From the evaluation of our system, we were able to derive more accurate clustering results as a result of the integration of ontologies in the pre-processing stage in comparison with documents that were not pre-processed with the ontology

    DocSCAN: Unsupervised Text Classification via Learning from Neighbors

    Full text link
    We introduce DocSCAN, a completely unsupervised text classification approach using Semantic Clustering by Adopting Nearest-Neighbors (SCAN). For each document, we obtain semantically informative vectors from a large pre-trained language model. Similar documents have proximate vectors, so neighbors in the representation space tend to share topic labels. Our learnable clustering approach uses pairs of neighboring datapoints as a weak learning signal. The proposed approach learns to assign classes to the whole dataset without provided ground-truth labels. On five topic classification benchmarks, we improve on various unsupervised baselines by a large margin. In datasets with relatively few and balanced outcome classes, DocSCAN approaches the performance of supervised classification. The method fails for other types of classification, such as sentiment analysis, pointing to important conceptual and practical differences between classifying images and texts.Comment: in Proceedings of the 18th Conference on Natural Language Processing (KONVENS 2022). Potsdam, German

    Fuzzy spectral clustering methods for textual data

    Get PDF
    Nowadays, the development of advanced information technologies has determined an increase in the production of textual data. This inevitable growth accentuates the need to advance in the identification of new methods and tools able to efficiently analyse such kind of data. Against this background, unsupervised classification techniques can play a key role in this process since most of this data is not classified. Document clustering, which is used for identifying a partition of clusters in a corpus of documents, has proven to perform efficiently in the analyses of textual documents and it has been extensively applied in different fields, from topic modelling to information retrieval tasks. Recently, spectral clustering methods have gained success in the field of text classification. These methods have gained popularity due to their solid theoretical foundations which do not require any specific assumption on the global structure of the data. However, even though they prove to perform well in text classification problems, little has been done in the field of clustering. Moreover, depending on the type of documents analysed, it might be often the case that textual documents do not contain only information related to a single topic: indeed, there might be an overlap of contents characterizing different knowledge domains. Consequently, documents may contain information that is relevant to different areas of interest to some degree. The first part of this work critically analyses the main clustering algorithms used for text data, involving also the mathematical representation of documents and the pre-processing phase. Then, three novel fuzzy versions of spectral clustering algorithms for text data are introduced. The first one exploits the use of fuzzy K-medoids instead of K-means. The second one derives directly from the first one but is used in combination with Kernel and Set Similarity (KS2M), which takes into account the Jaccard index. Finally, in the third one, in order to enhance the clustering performance, a new similarity measure S∗ is proposed. This last one exploits the inherent sequential nature of text data by means of a weighted combination between the Spectrum string kernel function and a measure of set similarity. The second part of the thesis focuses on spectral bi-clustering algorithms for text mining tasks, which represent an interesting and partially unexplored field of research. In particular, two novel versions of fuzzy spectral bi-clustering algorithms are introduced. The two algorithms differ from each other for the approach followed in the identification of the document and the word partitions. Indeed, the first one follows a simultaneous approach while the second one a sequential approach. This difference leads also to a diversification in the choice of the number of clusters. The adequacy of all the proposed fuzzy (bi-)clustering methods is evaluated by experiments performed on both real and benchmark data sets

    Aplikasi Text Mining untuk Klasterisasi Aduan Masyarakat Kota Semarang Menggunakan Algoritma K-means

    Get PDF
    Social media is a service that is very supportive for government activities, especially in providing openness and community-based government. One form of its implementation is the Semarang City government through the Center for Community Complaints Management (P3M), whose task is to manage community complaints that enter one of the communication channels namely social media twitter. The number of public complaints that enter every day is very varied. This is certainly quite difficult for managers in categorizing complaints reports according to the relevant Local Government Organizations (OPD). This paper focuses on the problem of how to conduct clustering of community complaints. The data source comes from Twitter using the keyword "Laporhendi". Text document data from community complaint tweets was analyzed by text mining methods. A number of pre-processing of text data processing begins with the process of case folding, tokenizing, stemming, stopword removal and word robbering with tf-idf. In conducting cluster mapping, clustering algorithm will be used in dividing the complaint cluster, namely the k-means algorithm. Evaluation of cluster results is done by using purity to determine the accuracy of the results of grouping or clustering
    • …
    corecore