45 research outputs found

    Adaptive firefly algorithm for hierarchical text clustering

    Get PDF
    Text clustering is essentially used by search engines to increase the recall and precision in information retrieval. As search engine operates on Internet content that is constantly being updated, there is a need for a clustering algorithm that offers automatic grouping of items without prior knowledge on the collection. Existing clustering methods have problems in determining optimal number of clusters and producing compact clusters. In this research, an adaptive hierarchical text clustering algorithm is proposed based on Firefly Algorithm. The proposed Adaptive Firefly Algorithm (AFA) consists of three components: document clustering, cluster refining, and cluster merging. The first component introduces Weight-based Firefly Algorithm (WFA) that automatically identifies initial centers and their clusters for any given text collection. In order to refine the obtained clusters, a second algorithm, termed as Weight-based Firefly Algorithm with Relocate (WFAR), is proposed. Such an approach allows the relocation of a pre-assigned document into a newly created cluster. The third component, Weight-based Firefly Algorithm with Relocate and Merging (WFARM), aims to reduce the number of produced clusters by merging nonpure clusters into the pure ones. Experiments were conducted to compare the proposed algorithms against seven existing methods. The percentage of success in obtaining optimal number of clusters by AFA is 100% with purity and f-measure of 83% higher than the benchmarked methods. As for entropy measure, the AFA produced the lowest value (0.78) when compared to existing methods. The result indicates that Adaptive Firefly Algorithm can produce compact clusters. This research contributes to the text mining domain as hierarchical text clustering facilitates the indexing of documents and information retrieval processes

    Aco-based feature selection algorithm for classification

    Get PDF
    Dataset with a small number of records but big number of attributes represents a phenomenon called “curse of dimensionality”. The classification of this type of dataset requires Feature Selection (FS) methods for the extraction of useful information. The modified graph clustering ant colony optimisation (MGCACO) algorithm is an effective FS method that was developed based on grouping the highly correlated features. However, the MGCACO algorithm has three main drawbacks in producing a features subset because of its clustering method, parameter sensitivity, and the final subset determination. An enhanced graph clustering ant colony optimisation (EGCACO) algorithm is proposed to solve the three (3) MGCACO algorithm problems. The proposed improvement includes: (i) an ACO feature clustering method to obtain clusters of highly correlated features; (ii) an adaptive selection technique for subset construction from the clusters of features; and (iii) a genetic-based method for producing the final subset of features. The ACO feature clustering method utilises the ability of various mechanisms such as intensification and diversification for local and global optimisation to provide highly correlated features. The adaptive technique for ant selection enables the parameter to adaptively change based on the feedback of the search space. The genetic method determines the final subset, automatically, based on the crossover and subset quality calculation. The performance of the proposed algorithm was evaluated on 18 benchmark datasets from the University California Irvine (UCI) repository and nine (9) deoxyribonucleic acid (DNA) microarray datasets against 15 benchmark metaheuristic algorithms. The experimental results of the EGCACO algorithm on the UCI dataset are superior to other benchmark optimisation algorithms in terms of the number of selected features for 16 out of the 18 UCI datasets (88.89%) and the best in eight (8) (44.47%) of the datasets for classification accuracy. Further, experiments on the nine (9) DNA microarray datasets showed that the EGCACO algorithm is superior than the benchmark algorithms in terms of classification accuracy (first rank) for seven (7) datasets (77.78%) and demonstrates the lowest number of selected features in six (6) datasets (66.67%). The proposed EGCACO algorithm can be utilised for FS in DNA microarray classification tasks that involve large dataset size in various application domains

    The single row layout problem with clearances

    Get PDF
    The single row layout problem (SRLP) is a specially structured instance of the classical facility layout problem, especially used in flexible manufacturing systems. The SRLP consists of finding the most efficient arrangement of a given number of machines along one side of the material handling path with the purpose of minimising the total weighted sum of distances among all machine pairs. To reflect real manufacturing situations, a minimum space (so-called clearances) between machines may be required by observing technological constraints, safety considerations and regulations. This thesis intends to outline the different concepts of clearances used in literature and analyse their effects on modelling and solution approaches for the SRLP. In particular the special characteristics of sequence-dependent, asymmetric clearances are discussed and finally extended to large size clearances (machine-spanning clearances). For this, adjusted and novel model formulations and solution approaches are presented. Furthermore, a comprehensive survey of articles published in this research area since 2000 is provided which identify recent developments and emerging trends in SRLP

    Practical approaches to mining of clinical datasets : from frameworks to novel feature selection

    Get PDF
    Research has investigated clinical data that have embedded within them numerous complexities and uncertainties in the form of missing values, class imbalances and high dimensionality. The research in this thesis was motivated by these challenges to minimise these problems whilst, at the same time, maximising classification performance of data and also selecting the significant subset of variables. As such, this led to the proposal of a data mining framework and feature selection method. The proposed framework has a simple algorithmic framework and makes use of a modified form of existing frameworks to address a variety of different data issues, called the Handling Clinical Data Framework (HCDF). The assessment of data mining techniques reveals that missing values imputation and resampling data for class balancing can improve the performance of classification. Next, the proposed feature selection method was introduced; it involves projecting onto principal component method (FS-PPC) and draws on ideas from both feature extraction and feature selection to select a significant subset of features from the data. This method selects features that have high correlation with the principal component by applying symmetrical uncertainty (SU). However, irrelevant and redundant features are removed by using mutual information (MI). However, this method provides confidence in the selected subset of features that will yield realistic results with less time and effort. FS-PPC is able to retain classification performance and meaningful features while consisting of non-redundant features. The proposed methods have been practically applied to analysis of real clinical data and their effectiveness has been assessed. The results show that the proposed methods are enable to minimise the clinical data problems whilst, at the same time, maximising classification performance of data

    Supplier evaluation and selection in fuzzy environments: a review of MADM approaches

    Get PDF
    In past years, the multi-attribute decision-making (MADM) approaches have been extensively applied by researchers to the supplier evaluation and selection problem. Many of these studies were performed in an uncertain environment described by fuzzy sets. This study provides a review of applications of MADM approaches for evaluation and selection of suppliers in a fuzzy environment. To this aim, a total of 339 publications were examined, including papers in peer-reviewed journals and reputable conferences and also some book chapters over the period of 2001 to 2016. These publications were extracted from many online databases and classified in some categories and subcategories according to the MADM approaches, and then they were analysed based on the frequency of approaches, number of citations, year of publication, country of origin and publishing journals. The results of this study show that the AHP and TOPSIS methods are the most popular approaches. Moreover, China and Taiwan are the top countries in terms of number of publications and number of citations, respectively. The top three journals with highest number of publications were: Expert Systems with Applications, International Journal of Production Research and The International Journal of Advanced Manufacturing Technology

    A systematic literature review of data science, data analytics and machine learning applied to healthcare engineering systems

    Get PDF
    The objective of this paper is to assess and synthesize the published literature related to the application of data analytics, big data, data mining, and machine learning to healthcare engineering systems. A systematic literature review (SLR) was conducted to obtain the most relevant papers related to the research study from three different platforms: EBSCOhost, ProQuest, and Scopus. The literature was assessed and synthesized, conducting analysis associated with the publications, authors, and content. From the SLR, 576 publications were identified and analyzed. The research area seems to show the characteristics of a growing field with new research areas evolving and applications being explored. In addition, the main authors and collaboration groups publishing in this research area were identified throughout a social network analysis. This could lead new and current authors to identify researchers with common interests on the field. The use of the SLR methodology does not guarantee that all relevant publications related to the research are covered and analyzed. However, the authors’ previous knowledge and the nature of the publications were used to select different platforms. To the best of the authors’ knowledge, this paper represents the most comprehensive literature-based study on the fields of data analytics, big data, data mining, and machine learning applied to healthcare engineering systems.N/

    Full Issue

    Get PDF