Search CORE

6,216 research outputs found

A new unsupervised feature selection method for text clustering based on genetic algorithms

Author: Saraee MH
Shamsinejadbabki P
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/06/2012
Field of study

Nowadays a vast amount of textual information is collected and stored in various databases around the world, including the Internet as the largest database of all. This rapidly increasing growth of published text means that even the most avid reader cannot hope to keep up with all the reading in a field and consequently the nuggets of insight or new knowledge are at risk of languishing undiscovered in the literature. Text mining offers a solution to this problem by replacing or supplementing the human reader with automatic systems undeterred by the text explosion. It involves analyzing a large collection of documents to discover previously unknown information. Text clustering is one of the most important areas in text mining, which includes text preprocessing, dimension reduction by selecting some terms (features) and finally clustering using selected terms. Feature selection appears to be the most important step in the process. Conventional unsupervised feature selection methods define a measure of the discriminating power of terms to select proper terms from corpus. However up to now the valuation of terms in groups has not been investigated in reported works. In this paper a new and robust unsupervised feature selection approach is proposed that evaluates terms in groups. In addition a new Modified Term Variance measuring method is proposed for evaluating groups of terms. Furthermore a genetic based algorithm is designed and implemented for finding the most valuable groups of terms based on the new measure. These terms then will be utilized to generate the final feature vector for the clustering process . In order to evaluate and justify our approach the proposed method and also a conventional term variance method are implemented and tested using corpus collection Reuters-21578. For a more accurate comparison, methods have been tested on three corpuses and for each corpus clustering task has been done ten times and results are averaged. Results of comparing these two methods are very promising and show that our method produces better average accuracy and F1-measure than the conventional term variance method

CiteSeerX

University of Salford Institutional Repository

Recommended from our members

A niching memetic algorithm for simultaneous clustering and feature selection

Author: Fairhurst M
Liu X
Sheng W
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/07/2008
Field of study

Clustering is inherently a difficult task, and is made even more difficult when the selection of relevant features is also an issue. In this paper we propose an approach for simultaneous clustering and feature selection using a niching memetic algorithm. Our approach (which we call NMA_CFS) makes feature selection an integral part of the global clustering search procedure and attempts to overcome the problem of identifying less promising locally optimal solutions in both clustering and feature selection, without making any a priori assumption about the number of clusters. Within the NMA_CFS procedure, a variable composite representation is devised to encode both feature selection and cluster centers with different numbers of clusters. Further, local search operations are introduced to refine feature selection and cluster centers encoded in the chromosomes. Finally, a niching method is integrated to preserve the population diversity and prevent premature convergence. In an experimental evaluation we demonstrate the effectiveness of the proposed approach and compare it with other related approaches, using both synthetic and real data

Brunel University Research Archive

The Novel Approach of Adaptive Twin Probability for Genetic Algorithm

Author: Khedkar Anagha P.
Subbaraman Shaila
Publication venue
Publication date: 01/01/2013
Field of study

The performance of GA is measured and analyzed in terms of its performance parameters against variations in its genetic operators and associated parameters. Since last four decades huge numbers of researchers have been working on the performance of GA and its enhancement. This earlier research work on analyzing the performance of GA enforces the need to further investigate the exploration and exploitation characteristics and observe its impact on the behavior and overall performance of GA. This paper introduces the novel approach of adaptive twin probability associated with the advanced twin operator that enhances the performance of GA. The design of the advanced twin operator is extrapolated from the twin offspring birth due to single ovulation in natural genetic systems as mentioned in the earlier works. The twin probability of this operator is adaptively varied based on the fitness of best individual thereby relieving the GA user from statically defining its value. This novel approach of adaptive twin probability is experimented and tested on the standard benchmark optimization test functions. The experimental results show the increased accuracy in terms of the best individual and reduced convergence time.Comment: 7 pages, International Journal of Advanced Studies in Computer Science and Engineering (IJASCSE), Volume 2, Special Issue 2, 201

arXiv.org e-Print Archive

CiteSeerX

A hybrid algorithm for k-medoid clustering of large data sets

Author: Liu X
Sheng W
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1
Field of study

In this paper, we propose a novel local search heuristic and then hybridize it with a genetic algorithm for k-medoid clustering of large data sets, which is an NP-hard optimization problem. The local search heuristic selects k-medoids from the data set and tries to efficiently minimize the total dissimilarity within each cluster. In order to deal with the local optimality, the local search heuristic is hybridized with a genetic algorithm and then the Hybrid K-medoid Algorithm (HKA) is proposed. Our experiments show that, compared with previous genetic algorithm based k-medoid clustering approaches - GCA and RAR/sub w/GA, HKA can provide better clustering solutions and do so more efficiently. Experiments use two gene expression data sets, which may involve large noise components

CiteSeerX

Crossref

Brunel University Research Archive

High-speed detection of emergent market clustering via an unsupervised parallel genetic algorithm

Author: Gebbie Tim
Hendricks Dieter
Wilcox Diane
Publication venue: 'Academy of Science of South Africa'
Publication date: 02/08/2015
Field of study

We implement a master-slave parallel genetic algorithm (PGA) with a bespoke log-likelihood fitness function to identify emergent clusters within price evolutions. We use graphics processing units (GPUs) to implement a PGA and visualise the results using disjoint minimal spanning trees (MSTs). We demonstrate that our GPU PGA, implemented on a commercially available general purpose GPU, is able to recover stock clusters in sub-second speed, based on a subset of stocks in the South African market. This represents a pragmatic choice for low-cost, scalable parallel computing and is significantly faster than a prototype serial implementation in an optimised C-based fourth-generation programming language, although the results are not directly comparable due to compiler differences. Combined with fast online intraday correlation matrix estimation from high frequency data for cluster identification, the proposed implementation offers cost-effective, near-real-time risk assessment for financial practitioners.Comment: 10 pages, 5 figures, 4 tables, More thorough discussion of implementatio

arXiv.org e-Print Archive

Crossref

Academy of Science of South Africa (ASSAf): Open Journal Systems

Directory of Open Access Journals

Genetic algorithms

Author: Bayer Steven E.
Wang Lui
Publication venue
Publication date
Field of study

Genetic algorithms are mathematical, highly parallel, adaptive search procedures (i.e., problem solving methods) based loosely on the processes of natural genetics and Darwinian survival of the fittest. Basic genetic algorithms concepts are introduced, genetic algorithm applications are introduced, and results are presented from a project to develop a software tool that will enable the widespread use of genetic algorithm technology

NASA Technical Reports Server

Frequency planning for clustered jointly processed cellular multiple access channel

Author: Hoshyar Reza
Imran Muhammad A.
Majid Muhammad I.
Publication venue: 'Institution of Engineering and Technology (IET)'
Publication date: 05/11/2013
Field of study

Owing to limited resources, it is hard to guarantee minimum service levels to all users in conventional cellular systems. Although global cooperation of access points (APs) is considered promising, practical means of enhancing efficiency of cellular systems is by considering distributed or clustered jointly processed APs. The authors present a novel `quality of service (QoS) balancing scheme' to maximise sum rate as well as achieve cell-based fairness for clustered jointly processed cellular multiple access channel (referred to as CC-CMAC). Closed-form cell level QoS balancing function is derived. Maximisation of this function is proved as an NP hard problem. Hence, using power-frequency granularity, a modified genetic algorithm (GA) is proposed. For inter site distance (ISD) <; 500 m, results show that with no fairness considered, the upper bound of the capacity region is achievable. Applying hard fairness restraints on users transmitting in moderately dense AP system, 20% reduction in sum rate contribution increases fairness by upto 10%. The flexible QoS can be applied on a GA-based centralised dynamic frequency planner architecture

Enlighten

Surrey Research Insight