15 research outputs found

    A new genetic algorithm for multi-label correlation-based feature selection.

    Get PDF
    This paper proposes a new Genetic Algorithm for Multi-Label Correlation-Based Feature Selection (GA-ML-CFS). This GA performs a global search in the space of candidate feature subset, in order to select a high-quality feature subset is used by a multi-label classification algorithm - in this work, the Multi-Label k-NN algorithm. We compare the results of GA-ML-CFS with the results of the previously proposed Hill-Climbing for Multi-Label Correlation-Based Feature Selection (HC-ML-CFS), across 10 multi-label datasets

    Microarray gene expression ranking with Z-score for Cancer Classification

    Get PDF
    Over the past few decade there has been explosion in the amount of genomic data available to biomedical engineer due to the advantage of biotechnology. For example using microarray it is possible to find out a persons gene expressions profile more than 30000 genomes. Among this one of the most important gene selection problem is gene ranking. Here we will describe Z-score ranking for microarray gene expression selection. In that technique it choose the gene and then applied the Z-Score Ranking technique and then divides the genes into subsets with Successive Feature selection and then finally LDA Applied for the result. With this Z-score ranking technique we will get the accurate results and less effort. The Lymphoma and Leukemia dataset genes are utilized. The proposed technique shows capable classification accuracy for the whole test data sets

    Feature Extraction of Chest X-ray Images and Analysis using PCA and kPCA

    Get PDF
    Tuberculosis (TB) is an infectious disease caused by mycobacterium which can be diagnosed by its various symptoms like fever, cough, etc. Tuberculosis can also be analyzed by understanding the chest x-ray of the patient which is revealed by an expert physician .The chest x-ray image contains many features which cannot be directly used by any computer system for analyzing the disease. Features of chest x-ray images must be understood and extracted, so that it can be processed to a form to be fed to any computer system for disease analysis. This paper presents feature extraction of chest x-ray image which can be used as an input for any data mining algorithm for TB disease analysis. So texture and shape based features are extracted from x-ray image using image processing concepts. The features extracted are analyzed using principal component analysis (PCA) and kernel principal component analysis (kPCA) techniques. Filter and wrapper feature selection method using linear regression model were applied on these techniques. The performance of PCA and kPCA are analyzed and found that the accuracy of PCA using wrapper approach is 96.07%   when compared to the accuracy of kPCA which is 62.50%. PCA performs well than kPCA with a good accuracy

    Comparing Prediction Accuracy for Machine Learning and Other Classical Approaches in Gene Expression Data

    Get PDF
    Microarray based gene expression profiling has been emerged as an efficient technique for cancer classification, as well as for diagnosis, prognosis, and treatment purposes. The classification of different tumor types is of great significance in cancer diagnosis and drug innovation. Using a large number of genes to classify samples based on a small number of microarrays remains a difficult problem. Feature selection techniques can be used to extract the marker genes which influence the classification accuracy effectively by eliminating the unwanted noisy and redundant genes. Quite a number of methods have been proposed in recent years with promising results. But there are still a lot of issues which need to be addressed and understood. Diagonal discriminant analysis, regularized discriminant analysis, support vector machines and k-nearest neighbor have been suggested as among the best methods for small sample size situations. In this paper, we have compared the performance of different discrimination methods for the classification of tumors based on gene expression data. The methods are applied to datasets from four recently published cancer gene expression studies. The performance of the classification technique has been evaluated for varying number of selected features in terms of misclassification rate  using hold-out cross validation. Our study shows that KNN, RDA and SVM with linear kernel methods have lower misclassification rate than the other algorithms. Keywords: microarray, gene expression, KNN, DLDA, RDA, SV

    Comparing Prediction Accuracy for Supervised Techniques in Gene Expression Data

    Get PDF
    Classification is one of the most important tasks for different application such as text categorization, tone recognition, image classification, micro-array gene expression, proteins structure predictions, data classification etc. Microarray based gene expression profiling has been emerged as an efficient technique for cancer classification, as well as for diagnosis, prognosis, and treatment purposes. The classification of different tumor types is of great significance in cancer diagnosis and drug innovation. One challenging area in the studies of gene expression data is the classification of different types of tumors into correct classes. Diagonal discriminant analysis, regularized discriminant analysis, support vector machines and k-nearest neighbor have been suggested as among the best methods for small sample size situations. The methods are applied to datasets from four recently published cancer gene expression studies. Four publicly available microarray data sets are Leukemia, Lymphoma, SRBCT & Prostate. The performance of the classification technique has been evaluated according to the percentage of misclassification through hold-out cross validation

    Effect of Feature Selection on Gene Expression Datasets Classification Accurac

    Get PDF
    Feature selection attracts researchers who deal with machine learning and data mining. It consists of selecting the variables that have the greatest impact on the dataset classification, and discarding the rest. This dimentionality reduction allows classifiers to be fast and more accurate. This paper traits the effect of feature selection on the accuracy of widely used classifiers in literature. These classifiers are compared with three real datasets which are pre-processed with feature selection methods. More than 9% amelioration in classification accuracy is observed, and k-means appears to be the most sensitive classifier to feature selection

    EGFAFS:A Novel Feature Selection Algorithm Based on Explosion Gravitation Field Algorithm

    Get PDF
    Feature selection (FS) is a vital step in data mining and machine learning, especially for analyzing the data in high-dimensional feature space. Gene expression data usually consist of a few samples characterized by high-dimensional feature space. As a result, they are not suitable to be processed by simple methods, such as the filter-based method. In this study, we propose a novel feature selection algorithm based on the Explosion Gravitation Field Algorithm, called EGFAFS. To reduce the dimensions of the feature space to acceptable dimensions, we constructed a recommended feature pool by a series of Random Forests based on the Gini index. Furthermore, by paying more attention to the features in the recommended feature pool, we can find the best subset more efficiently. To verify the performance of EGFAFS for FS, we tested EGFAFS on eight gene expression datasets compared with four heuristic-based FS methods (GA, PSO, SA, and DE) and four other FS methods (Boruta, HSICLasso, DNN-FS, and EGSG). The results show that EGFAFS has better performance for FS on gene expression data in terms of evaluation metrics, having more than the other eight FS algorithms. The genes selected by EGFAGS play an essential role in the differential co-expression network and some biological functions further demonstrate the success of EGFAFS for solving FS problems on gene expression data

    Abordagens multivariadas para seleção de variáveis com vistas à classificação e predição de propriedades de amostras

    Get PDF
    A seleção de variáveis é uma etapa importante para a análise de dados, visto que identifica os subconjuntos de variáveis mais informativas para a construção de modelos precisos de classificação e predição. Além disso, a seleção de variáveis facilita a interpretação e análise dos modelos obtidos, potencialmente reduzindo o tempo computacional de geração dos modelos e o custo/tempo para obtenção das amostras. Neste contexto, a presente tese apresenta proposições inovadoras de abordagens com vistas à seleção de variáveis para classificação e predição de propriedades de amostras de produtos diversos. Tais abordagens são abordadas em três artigos apresentados nesta tese, com intuito de melhorar a precisão dos modelos de classificação e predição em diferentes áreas. No primeiro artigo, integram-se índices de importância de variáveis a sistemáticas de classificação hierárquica para categorizar amostras de espumantes de acordo com seu país de origem. No segundo artigo, para selecionar as variáveis mais informativas para a predição de amostras via PLS, propõe-se um índice de importância de variáveis baseado na Lei de Lambert-Beer combinado a um processo iterativo de seleção do tipo forward. Por fim, o terceiro artigo utilizou cluster de variáveis espectrais e índice de importância para selecionar as variáveis que produzem modelos de predição mais consistentes. Em todos os artigos dessa tese, os resultados obtidos pelos métodos propostos foram superiores quando comparados a outros métodos tradicionais da literatura voltados à identificação das variáveis mais informativas.Variable selection is an important step in data analysis, since it identifies the most informative subsets of variables for build accurate classification and prediction models. In addition, variable selection improves the interpretation and analysis of obtained models, reduces the computational time to build models and reduces the obtained samples costs. In this context, this thesis presents propositions for a variable selection method aiming to classifying and predicting sample properties. Such methods are presented in three papers in this thesis, in order to improve the classification and prediction accuracy in different areas. In first paper, we applied variable importance index coupled with a hierarchical classification technique to identify the country of origin of sparkling wines. In second paper, to select the most informative variables for prediction, a variable improtance index was built based on Lambert-Beer law and an iterative forward process was performed. Finally, in third paper was used clustering of variables and variable importance index to select the variables that produce more consistent prediction models. In all papers of this thesis, when conpared to other traditional methods, our proposition obtained better results

    Cell cycle and aging, morphogenesis, and response to stimuli genes are individualized biomarkers of glioblastoma progression and survival

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Glioblastoma is a complex multifactorial disorder that has swift and devastating consequences. Few genes have been consistently identified as prognostic biomarkers of glioblastoma survival. The goal of this study was to identify general and clinical-dependent biomarker genes and biological processes of three complementary events: lifetime, overall and progression-free glioblastoma survival.</p> <p>Methods</p> <p>A novel analytical strategy was developed to identify general associations between the biomarkers and glioblastoma, and associations that depend on cohort groups, such as race, gender, and therapy. Gene network inference, cross-validation and functional analyses further supported the identified biomarkers.</p> <p>Results</p> <p>A total of 61, 47 and 60 gene expression profiles were significantly associated with lifetime, overall, and progression-free survival, respectively. The vast majority of these genes have been previously reported to be associated with glioblastoma (35, 24, and 35 genes, respectively) or with other cancers (10, 19, and 15 genes, respectively) and the rest (16, 4, and 10 genes, respectively) are novel associations. <it>Pik3r1</it>, <it>E2f3, Akr1c3</it>, <it>Csf1</it>, <it>Jag2</it>, <it>Plcg1</it>, <it>Rpl37a</it>, <it>Sod2</it>, <it>Topors</it>, <it>Hras</it>, <it>Mdm2, Camk2g</it>, <it>Fstl1</it>, <it>Il13ra1</it>, <it>Mtap </it>and <it>Tp53 </it>were associated with multiple survival events.</p> <p>Most genes (from 90 to 96%) were associated with survival in a general or cohort-independent manner and thus the same trend is observed across all clinical levels studied. The most extreme associations between profiles and survival were observed for <it>Syne1</it>, <it>Pdcd4</it>, <it>Ighg1</it>, <it>Tgfa</it>, <it>Pla2g7</it>, and <it>Paics</it>. Several genes were found to have a cohort-dependent association with survival and these associations are the basis for individualized prognostic and gene-based therapies. <it>C2</it>, <it>Egfr</it>, <it>Prkcb</it>, <it>Igf2bp3</it>, and <it>Gdf10 </it>had gender-dependent associations; <it>Sox10</it>, <it>Rps20</it>, <it>Rab31</it>, and <it>Vav3 </it>had race-dependent associations; <it>Chi3l1</it>, <it>Prkcb</it>, <it>Polr2d</it>, and <it>Apool </it>had therapy-dependent associations. Biological processes associated glioblastoma survival included morphogenesis, cell cycle, aging, response to stimuli, and programmed cell death.</p> <p>Conclusions</p> <p>Known biomarkers of glioblastoma survival were confirmed, and new general and clinical-dependent gene profiles were uncovered. The comparison of biomarkers across glioblastoma phases and functional analyses offered insights into the role of genes. These findings support the development of more accurate and personalized prognostic tools and gene-based therapies that improve the survival and quality of life of individuals afflicted by glioblastoma multiforme.</p

    Gene selection for cancer classification with the help of bees

    Full text link
    corecore