336 research outputs found

    Adaptive firefly algorithm for hierarchical text clustering

    Get PDF
    Text clustering is essentially used by search engines to increase the recall and precision in information retrieval. As search engine operates on Internet content that is constantly being updated, there is a need for a clustering algorithm that offers automatic grouping of items without prior knowledge on the collection. Existing clustering methods have problems in determining optimal number of clusters and producing compact clusters. In this research, an adaptive hierarchical text clustering algorithm is proposed based on Firefly Algorithm. The proposed Adaptive Firefly Algorithm (AFA) consists of three components: document clustering, cluster refining, and cluster merging. The first component introduces Weight-based Firefly Algorithm (WFA) that automatically identifies initial centers and their clusters for any given text collection. In order to refine the obtained clusters, a second algorithm, termed as Weight-based Firefly Algorithm with Relocate (WFAR), is proposed. Such an approach allows the relocation of a pre-assigned document into a newly created cluster. The third component, Weight-based Firefly Algorithm with Relocate and Merging (WFARM), aims to reduce the number of produced clusters by merging nonpure clusters into the pure ones. Experiments were conducted to compare the proposed algorithms against seven existing methods. The percentage of success in obtaining optimal number of clusters by AFA is 100% with purity and f-measure of 83% higher than the benchmarked methods. As for entropy measure, the AFA produced the lowest value (0.78) when compared to existing methods. The result indicates that Adaptive Firefly Algorithm can produce compact clusters. This research contributes to the text mining domain as hierarchical text clustering facilitates the indexing of documents and information retrieval processes

    GF-CLUST: A nature-inspired algorithm for automatic text clustering

    Get PDF
    Text clustering is a task of grouping similar documents into a cluster while assigning the dissimilar ones in other clusters.A well-known clustering method which is the K-means algorithm is extensively employed in many disciplines.However, there is a big challenge to determine the number of clusters using K-means. This paper presents a new clustering algorithm, termed Gravity Firefly Clustering (GF-CLUST) that utilizes Firefly Algorithm for dynamic document clustering. The GF-CLUST features the ability of identifying the appropriate number of clusters for a given text collection, which is a challenging problem in document clustering. It determines documents having strong force as centers and creates clusters based on cosine similarity measurement.This is followed by selecting potential clusters and merging small clusters to them. Experiments on various document datasets, such as 20 Newgroups, Reuters-21578 and TREC collection are conducted to evaluate the performance of the proposed GF-CLUST. The results of purity, F-measure and Entropy of GF-CLUST outperform the ones produced by existing clustering techniques, such as K-means, Particle Swarm Optimization (PSO) and Practical General Stochastic Clustering Method (pGSCM).Furthermore, the number of obtained clusters in GF-CLUST is near to the actual number of clusters as compared to pGSCM

    A Hotspot Discovery Method Based on Improved FIHC Clustering Algorithm

    Get PDF
    It was difficult to find the microblog hotspot because the characteristics of microblog were short, rapid, change and so on. A microblog hotspot detection method based on MFIHC and TOPSIS was proposed in order to solve the problem. Firstly, the calculation of HowNet similarity was used in the score function of FIHC, the semantic links between frequent words were considered, and the initial clusters based on frequent words were produced more accurately. Then the initial cluster of the text repletion of mircoblog was reduced, and the idea of Single-Pass clustering was used to the reduced topic cluster in order to get the Hotspot. At last, an improved TOPSIS model was used to sort the hot topics in order to get the rank of the hot topics. Compared with the other text clustering algorithms and hotspot detection methods, the method has good effect, and can be a more comprehensive response to the current hot topics

    Advances in Meta-Heuristic Optimization Algorithms in Big Data Text Clustering

    Full text link
    This paper presents a comprehensive survey of the meta-heuristic optimization algorithms on the text clustering applications and highlights its main procedures. These Artificial Intelligence (AI) algorithms are recognized as promising swarm intelligence methods due to their successful ability to solve machine learning problems, especially text clustering problems. This paper reviews all of the relevant literature on meta-heuristic-based text clustering applications, including many variants, such as basic, modified, hybridized, and multi-objective methods. As well, the main procedures of text clustering and critical discussions are given. Hence, this review reports its advantages and disadvantages and recommends potential future research paths. The main keywords that have been considered in this paper are text, clustering, meta-heuristic, optimization, and algorithm

    Segmentation of images by color features: a survey

    Get PDF
    En este articulo se hace la revisiĂłn del estado del arte sobre la segmentaciĂłn de imagenes de colorImage segmentation is an important stage for object recognition. Many methods have been proposed in the last few years for grayscale and color images. In this paper, we present a deep review of the state of the art on color image segmentation methods; through this paper, we explain the techniques based on edge detection, thresholding, histogram-thresholding, region, feature clustering and neural networks. Because color spaces play a key role in the methods reviewed, we also explain in detail the most commonly color spaces to represent and process colors. In addition, we present some important applications that use the methods of image segmentation reviewed. Finally, a set of metrics frequently used to evaluate quantitatively the segmented images is shown

    Focus: A Graph Approach for Data-Mining and Domain-Specific Assembly of Next Generation Sequencing Data

    Get PDF
    Next Generation Sequencing (NGS) has emerged as a key technology leading to revolutionary breakthroughs in numerous biomedical research areas. These technologies produce millions to billions of short DNA reads that represent a small fraction of the original target DNA sequence. These short reads contain little information individually but are produced at a high coverage of the original sequence such that many reads overlap. Overlap relationships allow for the reads to be linearly ordered and merged by computational programs called assemblers into long stretches of contiguous sequence called contigs that can be used for research applications. Although the assembly of the reads produced by NGS remains a difficult task, it is the process of extracting useful knowledge from these relatively short sequences that has become one of the most exciting and challenging problems in Bioinformatics. The assembly of short reads is an aggregative process where critical information is lost as reads are merged into contigs. In addition, the assembly process is treated as a black box, with generic assembler tools that do not adapt to input data set characteristics. Finally, as NGS data throughput continues to increase, there is an increasing need for smart parallel assembler implementations. In this dissertation, a new assembly approach called Focus is proposed. Unlike previous assemblers, Focus relies on a novel hybrid graph constructed from multiple graphs at different levels of granularity to represent the assembly problem, facilitating information capture and dynamic adjustment to input data set characteristics. This work is composed of four specific aims: 1) The implementation of a robust assembly and analysis tool built on the hybrid graph platform 2) The development and application of graph mining to extract biologically relevant features in NGS data sets 3) The integration of domain specific knowledge to improve the assembly and analysis process. 4) The construction of smart parallel computing approaches, including the application of energy-aware computing for NGS assembly and knowledge integration to improve algorithm performance. In conclusion, this dissertation presents a complete parallel assembler called Focus that is capable of extracting biologically relevant features directly from its hybrid assembly graph

    Genomic characterization of murine monocytes reveals C/EBPβ transcription factor dependence of Ly6C(-) cells

    Get PDF
    Monocytes are circulating, short-lived mononuclear phagocytes, which in mice and man comprise two main subpopulations. Murine Ly6C(+) monocytes display developmental plasticity and are recruited to complement tissue-resident macrophages and dendritic cells on demand. Murine vascular Ly6C(-) monocytes patrol the endothelium, act as scavengers, and support vessel wall repair. Here we characterized population and single cell transcriptomes, as well as enhancer and promoter landscapes of the murine monocyte compartment. Single cell RNA-seq and transplantation experiments confirmed homeostatic default differentiation of Ly6C(+) into Ly6C(-) monocytes. The main two subsets were homogeneous, but linked by a more heterogeneous differentiation intermediate. We show that monocyte differentiation occurred through de novo enhancer establishment and activation of pre-established (poised) enhancers. Generation of Ly6C(-) monocytes involved induction of the transcription factor C/EBP{beta} and C/EBP{beta}-deficient mice lacked Ly6C(-) monocytes. Mechanistically, C/EBP{beta} bound the Nr4a1 promoter and controlled expression of this established monocyte survival factor

    Development of a R package to facilitate the learning of clustering techniques

    Get PDF
    This project explores the development of a tool, in the form of a R package, to ease the process of learning clustering techniques, how they work and what their pros and cons are. This tool should provide implementations for several different clustering techniques with explanations in order to allow the student to get familiar with the characteristics of each algorithm by testing them against several different datasets while deepening their understanding of them through the explanations. Additionally, these explanations should adapt to the input data, making the tool not only adept for self-regulated learning but for teaching too.Grado en Ingeniería Informátic

    Opinion leader detection in Asian social networks using modified spider monkey optimization

    Get PDF
    The Asian social networks are dominated by the society’s collectivist culture, and this interestingly introduces a influence mechanism aided by word-of-mouth and opinion leaders. An opinion leader can help to generate and shape other people’s opinion and achieve a high information spread on any topic. In this work, a modified spider monkey optimization based opinion leader detection approach is proposed. Firstly, we employ the modified node2vec graph embedding to generate the lower dimensional vectors which act as the initial features for the nodes in a typical Asian social network. Next the entire population is broken down into several groups using the k-means++ algorithm where the number of clusters is equal to the number of opinion leaders to be selected. The local and global leaders are chosen by using the coordinates of the cluster centres of these clusters. The coordinates of the centroids of the clusters are then used to detect the local and global leaders in the network. The local leaders then form the seed set of opinion leaders for the network. The positions of the nodes in the network, including the local and global leaders, are updated over a number of iterations. At the end of these iterations, the seed set generating the maximum influence forms the set of opinion leaders in the network. We test our proposed approach using the popular information diffusion and cognitive opinion dynamics (COD) models. We perform intensive experiments on several real-life social networks based on various performance metrics. The results obtained reveal that the proposed approach outperforms several existing techniques of opinion leader detection
    • …
    corecore