6 research outputs found

    Adaptive firefly algorithm for hierarchical text clustering

    Get PDF
    Text clustering is essentially used by search engines to increase the recall and precision in information retrieval. As search engine operates on Internet content that is constantly being updated, there is a need for a clustering algorithm that offers automatic grouping of items without prior knowledge on the collection. Existing clustering methods have problems in determining optimal number of clusters and producing compact clusters. In this research, an adaptive hierarchical text clustering algorithm is proposed based on Firefly Algorithm. The proposed Adaptive Firefly Algorithm (AFA) consists of three components: document clustering, cluster refining, and cluster merging. The first component introduces Weight-based Firefly Algorithm (WFA) that automatically identifies initial centers and their clusters for any given text collection. In order to refine the obtained clusters, a second algorithm, termed as Weight-based Firefly Algorithm with Relocate (WFAR), is proposed. Such an approach allows the relocation of a pre-assigned document into a newly created cluster. The third component, Weight-based Firefly Algorithm with Relocate and Merging (WFARM), aims to reduce the number of produced clusters by merging nonpure clusters into the pure ones. Experiments were conducted to compare the proposed algorithms against seven existing methods. The percentage of success in obtaining optimal number of clusters by AFA is 100% with purity and f-measure of 83% higher than the benchmarked methods. As for entropy measure, the AFA produced the lowest value (0.78) when compared to existing methods. The result indicates that Adaptive Firefly Algorithm can produce compact clusters. This research contributes to the text mining domain as hierarchical text clustering facilitates the indexing of documents and information retrieval processes

    Detectando anomalías en los metadatos de cartotecas

    Get PDF
    The validation of the metadata associated with digitalized maps is one of the main problems related to the preservation of cartographic information in digital map libraries. This article describes the first research phases of a new automatic method to ensure the validity of spatial metadata. The method is based on the combination of spatial clustering techniques and geographical statistical analysis.La validación de metadatos asociados a mapas digitalizados es uno de los problemas más relevantes relacionados con la preservación de la información cartográfica en las Bibliotecas de Mapas Digitales. Este trabajo presenta las primeras fases de investigación de un nuevo método automático para asegurar su validez. Este método está basado en la combinación de técnicas de agrupación espacial y métodos estadísticos de análisis geográfic

    On the Use of Self-Organizing Map for Text Clustering in Engineering Change Process Analysis: A Case Study

    Get PDF
    In modern industry, the development of complex products involves engineering changes that frequently require redesigning or altering the products or their components. In an engineering change process, engineering change requests (ECRs) are documents (forms) with parts written in natural language describing a suggested enhancement or a problem with a product or a component. ECRs initiate the change process and promote discussions within an organization to help to determine the impact of a change and the best possible solution. Although ECRs can contain important details, that is, recurring problems or examples of good practice repeated across a number of projects, they are often stored but not consulted, missing important opportunities to learn from previous projects. This paper explores the use of Self-Organizing Map (SOM) to the problem of unsupervised clustering of ECR texts. A case study is presented in which ECRs collected during the engineering change process of a railways industry are analyzed. The results show that SOM text clustering has a good potential to improve overall knowledge reuse and exploitation

    Sustainability Performance Assessment Using Self-Organizing Maps (SOM) and Classification and Ensembles of Regression Trees (CART)

    Get PDF
    This study aims to develop a new approach based on machine learning techniques to assess sustainability performance. Two main dimensions of sustainability, ecological sustainability, and human sustainability, were considered in this study. A set of sustainability indicators was used, and the research method in this study was developed using cluster analysis and prediction learning techniques. A Self-Organizing Map (SOM) was applied for data clustering, while Classification and Regression Trees (CART) were applied to assess sustainability performance. The proposed method was evaluated through Sustainability Assessment by Fuzzy Evaluation (SAFE) dataset, which comprises various indicators of sustainability performance in 128 countries. Eight clusters from the data were found through the SOM clustering technique. A prediction model was found in each cluster through the CART technique. In addition, an ensemble of CART was constructed in each cluster of SOM to increase the prediction accuracy of CART. All prediction models were assessed through the adjusted coefficient of determination approach. The results demonstrated that the prediction accuracy values were high in all CART models. The results indicated that the method developed by ensembles of CART and clustering provide higher prediction accuracy than individual CART models. The main advantage of integrating the proposed method is its ability to automate decision rules from big data for prediction models. The method proposed in this study could be implemented as an effective tool for sustainability performance assessment

    Sustainability performance assessment using self-organizing maps (SOM) and classification and ensembles of regression trees (CART)

    Get PDF
    This study aims to develop a new approach based on machine learning techniques to assess sustainability performance. Two main dimensions of sustainability, ecological sustainability, and human sustainability, were considered in this study. A set of sustainability indicators was used, and the research method in this study was developed using cluster analysis and prediction learning techniques. A Self-Organizing Map (SOM) was applied for data clustering, while Classification and Regression Trees (CART) were applied to assess sustainability performance. The proposed method was evaluated through Sustainability Assessment by Fuzzy Evaluation (SAFE) dataset, which comprises various indicators of sustainability performance in 128 countries. Eight clusters from the data were found through the SOM clustering technique. A prediction model was found in each cluster through the CART technique. In addition, an ensemble of CART was constructed in each cluster of SOM to increase the prediction accuracy of CART. All prediction models were assessed through the adjusted coefficient of determination approach. The results demonstrated that the prediction accuracy values were high in all CART models. The results indicated that the method developed by ensembles of CART and clustering provide higher prediction accuracy than individual CART models. The main advantage of integrating the proposed method is its ability to automate decision rules from big data for prediction models. The method proposed in this study could be implemented as an effective tool for sustainability performance assessment

    Semantic feature reduction and hybrid feature selection for clustering of Arabic Web pages

    Get PDF
    In the literature, high-dimensional data reduces the efficiency of clustering algorithms. Clustering the Arabic text is challenging because semantics of the text involves deep semantic processing. To overcome the problems, the feature selection and reduction methods have become essential to select and identify the appropriate features in reducing high-dimensional space. There is a need to develop a suitable design for feature selection and reduction methods that would result in a more relevant, meaningful and reduced representation of the Arabic texts to ease the clustering process. The research developed three different methods for analyzing the features of the Arabic Web text. The first method is based on hybrid feature selection that selects the informative term representation within the Arabic Web pages. It incorporates three different feature selection methods known as Chi-square, Mutual Information and Term Frequency–Inverse Document Frequency to build a hybrid model. The second method is a latent document vectorization method used to represent the documents as the probability distribution in the vector space. It overcomes the problems of high-dimension by reducing the dimensional space. To extract the best features, two document vectorizer methods have been implemented, known as the Bayesian vectorizer and semantic vectorizer. The third method is an Arabic semantic feature analysis used to improve the capability of the Arabic Web analysis. It ensures a good design for the clustering method to optimize clustering ability when analysing these Web pages. This is done by overcoming the problems of term representation, semantic modeling and dimensional reduction. Different experiments were carried out with k-means clustering on two different data sets. The methods provided solutions to reduce high-dimensional data and identify the semantic features shared between similar Arabic Web pages that are grouped together in one cluster. These pages were clustered according to the semantic similarities between them whereby they have a small Davies–Bouldin index and high accuracy. This study contributed to research in clustering algorithm by developing three methods to identify the most relevant features of the Arabic Web pages