Search CORE

11,398 research outputs found

Mapping microarray gene expression data into dissimilarity spaces for tumor classification

Author: García Vicente
Sánchez Garreta Josep Salvador
Publication venue: 'Elsevier BV'
Publication date: 01/01/2015
Field of study

Microarray gene expression data sets usually contain a large number of genes, but a small number of samples. In this article, we present a two-stage classification model by combining feature selection with the dissimilarity-based representation paradigm. In the preprocessing stage, the ReliefF algorithm is used to generate a subset with a number of topranked genes; in the learning/classification stage, the samples represented by the previously selected genes are mapped into a dissimilarity space, which is then used to construct a classifier capable of separating the classes more easily than a feature-based model. The ultimate aim of this paper is not to find the best subset of genes, but to analyze the performance of the dissimilarity-based models by means of a comprehensive collection of experiments for the classification of microarray gene expression data. To this end, we compare the classification results of an artificial neural network, a support vector machine and the Fisher’s linear discriminant classifier built on the feature (gene) space with those on the dissimilarity space when varying the number of genes selected by ReliefF, using eight different microarray databases. The results show that the dissimilarity-based classifiers systematically outperform the feature-based models. In addition, classification through the proposed representation appears to be more robust (i.e. less sensitive to the number of genes) than that with the conventional feature-based representation

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

Repositori Institucional de la Universitat Jaume I

Gene encoder: a feature selection technique through unsupervised deep learning-based clustering for large gene expression data

Author: A Ben-Dor
A Brazma
A Shah
AA Alizadeh
C Fraley
C Song
D Jiang
D Sathiaraj
EP Xing
G Gan
H Chen
H Salem
HM Alshamlan
J Oyelade
L Zhu
M Ghosh
MB Eisen
MH Law
MJ Rani
MK Pakhira
P Bihani
P Langley
P Pudil
PA Estévez
PY Chen
R Kohavi
S Mahajan
S Tiwari
SA Armstrong
SM Ayyad
T Kohonen
T Muhammad
TR Golub
W Mao
X Zhu
Z Halim
Z Halim
Z Halim
Z Halim
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

© 2020, Springer-Verlag London Ltd., part of Springer Nature. Cancer is a severe condition of uncontrolled cell division that results in a tumor formation that spreads to other tissues of the body. Therefore, the development of new medication and treatment methods for this is in demand. Classification of microarray data plays a vital role in handling such situations. The relevant gene selection is an important step for the classification of microarray data. This work presents gene encoder, an unsupervised two-stage feature selection technique for the cancer samples’ classification. The first stage aggregates three filter methods, namely principal component analysis, correlation, and spectral-based feature selection techniques. Next, the genetic algorithm is used, which evaluates the chromosome utilizing the autoencoder-based clustering. The resultant feature subset is used for the classification task. Three classifiers, namely support vector machine, k-nearest neighbors, and random forest, are used in this work to avoid the dependency on any one classifier. Six benchmark gene expression datasets are used for the performance evaluation, and a comparison is made with four state-of-the-art related algorithms. Three sets of experiments are carried out to evaluate the proposed method. These experiments are for the evaluation of the selected features based on sample-based clustering, adjusting optimal parameters, and for selecting better performing classifier. The comparison is based on accuracy, recall, false positive rate, precision, F-measure, and entropy. The obtained results suggest better performance of the current proposal

ZU Scholars (Zayed University)

Crossref

Identification of disease-causing genes using microarray data mining and gene ontology

Author: A Mohammadi
A Zhang
AA Alizadeh
Azadeh Mohammadi
B Duval
BF Souza
C Ambroise
C Ding
C Tago
D Lin
D Singh
E Martinez
FM Couto
I Guyon
I Inza
J Jaeger
JJ Jiang
L Li
L Yu
L Ziaei
Mansoor Salehi
Mohammad H Saraee
N Cristianini
P Pavlidis
P Resnik
PA Mundra
PA Mundra
PJ Park
R Genuer
RF Weaver
S Li
S Li
TM Huang
TR Golub
TS Furey
U Alon
W Xu
Y Ding
Y Saeys
Y Wang
YL Chin
Z Xie
Z Zhang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

Background: One of the best and most accurate methods for identifying disease-causing genes is monitoring gene expression values in different samples using microarray technology. One of the shortcomings of microarray data is that they provide a small quantity of samples with respect to the number of genes. This problem reduces the classification accuracy of the methods, so gene selection is essential to improve the predictive accuracy and to identify potential marker genes for a disease. Among numerous existing methods for gene selection, support vector machine-based recursive feature elimination (SVMRFE) has become one of the leading methods, but its performance can be reduced because of the small sample size, noisy data and the fact that the method does not remove redundant genes. Methods: We propose a novel framework for gene selection which uses the advantageous features of conventional methods and addresses their weaknesses. In fact, we have combined the Fisher method and SVMRFE to utilize the advantages of a filtering method as well as an embedded method. Furthermore, we have added a redundancy reduction stage to address the weakness of the Fisher method and SVMRFE. In addition to gene expression values, the proposed method uses Gene Ontology which is a reliable source of information on genes. The use of Gene Ontology can compensate, in part, for the limitations of microarrays, such as having a small number of samples and erroneous measurement results. Results: The proposed method has been applied to colon, Diffuse Large B-Cell Lymphoma (DLBCL) and prostate cancer datasets. The empirical results show that our method has improved classification performance in terms of accuracy, sensitivity and specificity. In addition, the study of the molecular function of selected genes strengthened the hypothesis that these genes are involved in the process of cancer growth. Conclusions: The proposed method addresses the weakness of conventional methods by adding a redundancy reduction stage and utilizing Gene Ontology information. It predicts marker genes for colon, DLBCL and prostate cancer with a high accuracy. The predictions made in this study can serve as a list of candidates for subsequent wet-lab verification and might help in the search for a cure for cancers

University of Salford Institutional Repository

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Correcting for selection bias via cross-validation in the classification of microarray data

Author: Chevelu J.
McLachlan G. J.
Zhu J.
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2008
Field of study

There is increasing interest in the use of diagnostic rules based on microarray data. These rules are formed by considering the expression levels of thousands of genes in tissue samples taken on patients of known classification with respect to a number of classes, representing, say, disease status or treatment strategy. As the final versions of these rules are usually based on a small subset of the available genes, there is a selection bias that has to be corrected for in the estimation of the associated error rates. We consider the problem using cross-validation. In particular, we present explicit formulae that are useful in explaining the layers of validation that have to be performed in order to avoid improperly cross-validated estimates.Comment: Published in at http://dx.doi.org/10.1214/193940307000000284 the IMS Collections (http://www.imstat.org/publications/imscollections.htm) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

Crossref

University of Queensland eSpace

Inverse Projection Representation and Category Contribution Rate for Robust Tumor Recognition

Author: Chen Yun-Mei
Tian Li
Wu Wen-Ming
Xu Shuang
Yang Li-Jun
Yang Xiao-Hui
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

Sparse representation based classification (SRC) methods have achieved remarkable results. SRC, however, still suffer from requiring enough training samples, insufficient use of test samples and instability of representation. In this paper, a stable inverse projection representation based classification (IPRC) is presented to tackle these problems by effectively using test samples. An IPR is firstly proposed and its feasibility and stability are analyzed. A classification criterion named category contribution rate is constructed to match the IPR and complete classification. Moreover, a statistical measure is introduced to quantify the stability of representation-based classification methods. Based on the IPRC technique, a robust tumor recognition framework is presented by interpreting microarray gene expression data, where a two-stage hybrid gene selection method is introduced to select informative genes. Finally, the functional analysis of candidate's pathogenicity-related genes is given. Extensive experiments on six public tumor microarray gene expression datasets demonstrate the proposed technique is competitive with state-of-the-art methods.Comment: 14 pages, 19 figures, 10 table

arXiv.org e-Print Archive

Crossref

Analysis of microarray and next generation sequencing data for classification and biomarker discovery in relation to complex diseases

Author: Elyasigomari Vahid
Publication venue: 'Queen Mary University of London'
Publication date: 21/09/2017
Field of study

PhDThis thesis presents an investigation into gene expression profiling, using microarray and next generation sequencing (NGS) datasets, in relation to multi-category diseases such as cancer. It has been established that if the sequence of a gene is mutated, it can result in the unscheduled production of protein, leading to cancer. However, identifying the molecular signature of different cancers amongst thousands of genes is complex. This thesis investigates tools that can aid the study of gene expression to infer useful information towards personalised medicine. For microarray data analysis, this study proposes two new techniques to increase the accuracy of cancer classification. In the first method, a novel optimisation algorithm, COA-GA, was developed by synchronising the Cuckoo Optimisation Algorithm and the Genetic Algorithm for data clustering in a shuffle setup, to choose the most informative genes for classification purposes. Support Vector Machine (SVM) and Multilayer Perceptron (MLP) artificial neural networks are utilised for the classification step. Results suggest this method can significantly increase classification accuracy compared to other methods. An additional method involving a two-stage gene selection process was developed. In this method, a subset of the most informative genes are first selected by the Minimum Redundancy Maximum Relevance (MRMR) method. In the second stage, optimisation algorithms are used in a wrapper setup with SVM to minimise the selected genes whilst maximising the accuracy of classification. A comparative performance assessment suggests that the proposed algorithm significantly outperforms other methods at selecting fewer genes that are highly relevant to the cancer type, while maintaining a high classification accuracy. In the case of NGS, a state-of-the-art pipeline for the analysis of RNA-Seq data is investigated to discover differentially expressed genes and differential exon usages between normal and AIP positive Drosophila datasets, which are produced in house at Queen Mary, University of London. Functional genomic of differentially expressed genes were examined and found to be relevant to the case study under investigation. Finally, after normalising the RNA-Seq data, machine learning approaches similar to those in microarray was successfully implemented for these datasets

Queen Mary Research Online

Stable Feature Selection for Biomarker Discovery

Author: He Zengyou
Yu Weichuan
Publication venue
Publication date: 01/01/2010
Field of study

Feature selection techniques have been used as the workhorse in biomarker discovery applications for a long time. Surprisingly, the stability of feature selection with respect to sampling variations has long been under-considered. It is only until recently that this issue has received more and more attention. In this article, we review existing stable feature selection methods for biomarker discovery using a generic hierarchal framework. We have two objectives: (1) providing an overview on this new yet fast growing topic for a convenient reference; (2) categorizing existing methods under an expandable framework for future research and development

arXiv.org e-Print Archive

CiteSeerX

Hong Kong University of Science and Technology Institutional Repository

Feature selection for microarray gene expression data using simulated annealing guided by the multivariate joint entropy

Author: Belanche Muñoz Luis Antonio
González Navarro Félix Fernando
Publication venue
Publication date: 01/01/2013
Field of study

In this work a new way to calculate the multivariate joint entropy is presented. This measure is the basis for a fast information-theoretic based evaluation of gene relevance in a Microarray Gene Expression data context. Its low complexity is based on the reuse of previous computations to calculate current feature relevance. The mu-TAFS algorithm --named as such to differentiate it from previous TAFS algorithms-- implements a simulated annealing technique specially designed for feature subset selection. The algorithm is applied to the maximization of gene subset relevance in several public-domain microarray data sets. The experimental results show a notoriously high classification performance and low size subsets formed by biologically meaningful genes.Postprint (published version

arXiv.org e-Print Archive

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Assessment of SVM Reliability for Microarray Data Analysis

Author: Blanzieri Enrico
Malossini Andrea
Ng Raymond
Publication venue
Publication date: 01/12/2004
Field of study

The goal of our research is to provide techniques that can assess and validate the results of SVM-based analysis of microarray data. We present preliminary results of the effect of mislabeled training samples. We conducted several systematic experiments on artificial and real medical data using SVMs. We systematically flipped the labels of a fraction of the training data. We show that a relatively small number of mislabeled examples can dramatically decrease the performance as visualized on the ROC graphs. This phenomenon persists even if the dimensionality of the input space is drastically decreased, by using for example feature selection. Moreover we show that for SVM recursive feature elimination, even a small fraction of mislabeled samples can completely change the resulting set of genes. This work is an extended version of the previous paper [MBN04]

Unitn-eprints Research