748 research outputs found
A hybrid LDA and genetic algorithm for gene selection and classification of microarray data
In supervised classification of Microarray data, gene selection aims at identifying a (small) subset of informative genes from the initial data in order to obtain high predictive accuracy. This paper introduces a new embedded approach to this difficult task where a genetic algorithm (GA) is combined with Fisher\u27s linear discriminant analysis (LDA). This LDA-based GA algorithm has the major characteristic that the GA uses not only a LDA classifier in its fitness function, but also LDA\u27s discriminant coefficients in its dedicated crossover and mutation operators. Computational experiments on seven public datasets show that under an unbiased experimental protocol, the proposed algorithm is able to reach high prediction accuracies with a small number of selected genes
Effect of Feature Selection on Gene Expression Datasets Classification Accurac
Feature selection attracts researchers who deal with machine learning and data mining. It consists of selecting the variables that have the greatest impact on the dataset classification, and discarding the rest. This dimentionality reduction allows classifiers to be fast and more accurate. This paper traits the effect of feature selection on the accuracy of widely used classifiers in literature. These classifiers are compared with three real datasets which are pre-processed with feature selection methods. More than 9% amelioration in classification accuracy is observed, and k-means appears to be the most sensitive classifier to feature selection
Identification of an Efficient Gene Expression Panel for Glioblastoma Classification.
We present here a novel genetic algorithm-based random forest (GARF) modeling technique that enables a reduction in the complexity of large gene disease signatures to highly accurate, greatly simplified gene panels. When applied to 803 glioblastoma multiforme samples, this method allowed the 840-gene Verhaak et al. gene panel (the standard in the field) to be reduced to a 48-gene classifier, while retaining 90.91% classification accuracy, and outperforming the best available alternative methods. Additionally, using this approach we produced a 32-gene panel which allows for better consistency between RNA-seq and microarray-based classifications, improving cross-platform classification retention from 69.67% to 86.07%. A webpage producing these classifications is available at http://simplegbm.semel.ucla.edu
Inverse Projection Representation and Category Contribution Rate for Robust Tumor Recognition
Sparse representation based classification (SRC) methods have achieved
remarkable results. SRC, however, still suffer from requiring enough training
samples, insufficient use of test samples and instability of representation. In
this paper, a stable inverse projection representation based classification
(IPRC) is presented to tackle these problems by effectively using test samples.
An IPR is firstly proposed and its feasibility and stability are analyzed. A
classification criterion named category contribution rate is constructed to
match the IPR and complete classification. Moreover, a statistical measure is
introduced to quantify the stability of representation-based classification
methods. Based on the IPRC technique, a robust tumor recognition framework is
presented by interpreting microarray gene expression data, where a two-stage
hybrid gene selection method is introduced to select informative genes.
Finally, the functional analysis of candidate's pathogenicity-related genes is
given. Extensive experiments on six public tumor microarray gene expression
datasets demonstrate the proposed technique is competitive with
state-of-the-art methods.Comment: 14 pages, 19 figures, 10 table
Effective Prostate Cancer Detection using Enhanced Particle Swarm Optimization Algorithm with Random Forest on the Microarray Data
Prostate Cancer (PC) is the leading cause of mortality among males, therefore an effective system is required for identifying the sensitive bio-markers for early recognition. The objective of the research is to find the potential bio-markers for characterizing the dissimilar types of PC. In this article, the PC-related genes are acquired from the Gene Expression Omnibus (GEO) database. Then, gene selection is accomplished using enhanced Particle Swarm Optimization (PSO) to select the active genes, which are related to the PC. In the enhanced PSO algorithm, the interval-newton approach is included to keep the search space adaptive by varying the swarm diversity that helps to perform the local search significantly. The selected active genes are fed to the random forest classifier for the classification of PC (high and low-risk). As seen in the experimental investigation, the proposed model achieved an overall classification accuracy of 96.71%, which is better compared to the traditional models like naïve Bayes, support vector machine and neural network
Molecular Signature as Optima of Multi-Objective Function with Applications to Prediction in Oncogenomics
Náplní této práce je teoretický úvod a následné praktické zpracování tématu Molekulární signatura jako optimální multi-objektivní funkce s aplikací v predikci v onkogenomice. Úvodní kapitoly jsou zaměřeny na téma rakovina, zejména pak rakovina prsu a její podtyp triple negativní rakovinu prsu. Následuje literární přehled z oblasti optimalizačních metod, zejména se zaměřením na metaheuristické metody a problematiku strojového učení. Část se odkazuje na onkogenomiku a principy microarray a také na statistiku a s důrazem na výpočet p-hodnoty a bimodálního indexu. Praktická část je pak zaměřena na konkrétní průběh výzkumu a nalezené závěry, vedoucí k dalším krokům výzkumu. Implementace vybraných metod byla provedena v programech Matlab a R, s využitím dalších programovacích jazyků a to konkrétně programů Java a Python.Content of this work is theoretical introduction and follow-up practical processing of topic Molecular signature as optima of multi-objective function with applications to prediction in oncogenomics. Opening chapters are targeted on topic of cancer, mainly on breast cancer and its subtype Triple Negative Breast Cancer. Succeeds the literature review of optimization methods, mainly on meta-heuristic methods for multi-objective optimization and problematic of machine learning. Part is focused on the oncogenomics and on the principal of microarray and also to statistics methods with emphasis on the calculation of p-value and Bimodality Index. Practical part of work consists from concrete research and conclusions lead to next steps of research. Implementation of selected methods was realised in Matlab and R, with use of other programming languages Java and Python.
Recommended from our members
Evolutionary computation-based feature selection for finding a stable set of features in high-dimensional data
Evolutionary Computation (EC) algorithms have proved to work well for feature selection because they are powerful search techniques and can produce multiple good solutions. However, they suffer from some limitations for real world applications. Firstly, ECs require high computation time as they evaluate many solutions at each iteration. Secondly, a classifier is usually used as their fitness function which causes the selected subset to perform well only on the utilised classifier (e.g. classifier-bias). Lastly, ECs, as stochastic search methods, return a different final subset in different runs which poses a problem for finding a stable set of features (e.g. stability issue). To address computation time and classifier-bias limitations, this thesis proposes a new two-stage selection approach called filter/filter in which two filter feature selection algorithms are combined. In the first stage, a ranking algorithm forms a reduced dataset by selecting the most informative features from the original dataset. In the second stage, the reduced dataset is fed to a novel EC algorithm to select final feature subset. This new EC algorithm is a Tabu search hybridised with an Asexual Genetic Algorithm called TAGA. TAGA benefits from new search components and solution representation which can effectively reduce computation time. To select a classifier-unbiased final subset, a statistical criterion is used as the fitness function which evaluates the subset independent of any classifier. Experiments show that the proposed filter/filter requires an acceptable computation time and selects more classifier-unbiased features compared to the state-of-the-arts. To find a stable set of features, a novel Generalisation Power Index (GPI) is proposed to analyse the generalisation power of final subsets of an EC in several runs. Generalisation power refers to performance capability of a subset over wide range of classifiers. Computation results confirm that GPI is able to find a stable set of features which achieves near optimal accuracy when used to train various classifiers. To ex amine the suitability of the proposed methods for real-world applications, the filter/filter approach and GPI are integrated to select a stable set of features for METABRIC breast cancer subtype classification problem. Experimental results show that this integration not only can address the limitations of ECs for a real-world biomedical feature selection problem but it performs better than alternatives methods
PMP-SVM: A Hybrid Approach for effective Cancer Diagnosis using Feature Selection and Optimization
Cancer disease is becoming a prominent factor in increasing the death ration over the world due to the late diagnosis. Machine Learning (ML) is playing a vital role in providing computer aided diagnosis models for early diagnosis of cancer. For the diagnosis process the microarray data has its own place. Microarray data contain the genetic information of a patient with a large number of dimensions such as genes with a small sample such as patient details. If the microarray is directly taken without reducing the dimension as the input to any ML model for classification, then Small Sample Size is the resulting issue. So, size of the microarray data needs to be reduces by using either of dimensionality reduction technique or the feature selection technique to increase the model’s performance. In this work, proposed a hybrid model using Principal Component Analysis (PCA), Maximum Relevance Minimum Redundancy (MRMR), Particle Swarm Optimization (PSO) and Support Vector Machine (SVM) for cancer diagnosis. PCA and MRMR is used for feature selection and PSO is applied to get the optimized feature set. Finally, SVM is applied as the classification model. The proposed model is evaluated against multiple cancer microarray datasets to measure the performance in terms of accuracy, precision, recall, and F1 score. Result shows that proposed model performs better than existing state of art model
Protein fold recognition using genetic algorithm optimized voting scheme and profile bigram
In biology, identifying the tertiary structure of a protein helps determine its functions. A step towards tertiary structure identification is predicting a protein’s fold. Computational methods have been applied to determine a protein’s fold by assembling information from its structural, physicochemical and/or evolutionary properties. It has been shown that evolutionary information helps improve prediction accuracy. In this study, a scheme is proposed that uses the genetic algorithm (GA) to optimize a weighted voting scheme to improve protein fold recognition. This scheme incorporates k-separated bigram transition probabilities for feature extraction, which are based on the Position Specific Scoring Matrix (PSSM). A set of SVM classifiers are used for initial classification, whereupon their predictions are consolidated using the optimized weighted voting scheme. This scheme has been demonstrated on the Ding and Dubchak (DD), Extended Ding and Dubchak (EDD) and Taguchi and Gromhia (TG) datasets benchmarked data sets
- …