503 research outputs found

    Recursive Cluster Elimination (RCE) for classification and feature selection from gene expression data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Classification studies using gene expression datasets are usually based on small numbers of samples and tens of thousands of genes. The selection of those genes that are important for distinguishing the different sample classes being compared, poses a challenging problem in high dimensional data analysis. We describe a new procedure for selecting significant genes as recursive cluster elimination (RCE) rather than recursive feature elimination (RFE). We have tested this algorithm on six datasets and compared its performance with that of two related classification procedures with RFE.</p> <p>Results</p> <p>We have developed a novel method for selecting significant genes in comparative gene expression studies. This method, which we refer to as SVM-RCE, combines K-means, a clustering method, to identify correlated gene clusters, and Support Vector Machines (SVMs), a supervised machine learning classification method, to identify and score (rank) those gene clusters for the purpose of classification. K-means is used initially to group genes into clusters. Recursive cluster elimination (RCE) is then applied to iteratively remove those clusters of genes that contribute the least to the classification performance. SVM-RCE identifies the clusters of correlated genes that are most significantly differentially expressed between the sample classes. Utilization of gene clusters, rather than individual genes, enhances the supervised classification accuracy of the same data as compared to the accuracy when either SVM or Penalized Discriminant Analysis (PDA) with recursive feature elimination (SVM-RFE and PDA-RFE) are used to remove genes based on their individual discriminant weights.</p> <p>Conclusion</p> <p>SVM-RCE provides improved classification accuracy with complex microarray data sets when it is compared to the classification accuracy of the same datasets using either SVM-RFE or PDA-RFE. SVM-RCE identifies clusters of correlated genes that when considered together provide greater insight into the structure of the microarray data. Clustering genes for classification appears to result in some concomitant clustering of samples into subgroups.</p> <p>Our present implementation of SVM-RCE groups genes using the correlation metric. The success of the SVM-RCE method in classification suggests that gene interaction networks or other biologically relevant metrics that group genes based on functional parameters might also be useful.</p> <p/

    Learning from positive examples when the negative class is undetermined- microRNA gene identification

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The application of machine learning to classification problems that depend only on positive examples is gaining attention in the computational biology community. We and others have described the use of two-class machine learning to identify novel miRNAs. These methods require the generation of an artificial negative class. However, designation of the negative class can be problematic and if it is not properly done can affect the performance of the classifier dramatically and/or yield a biased estimate of performance. We present a study using one-class machine learning for microRNA (miRNA) discovery and compare one-class to two-class approaches using naïve Bayes and Support Vector Machines. These results are compared to published two-class miRNA prediction approaches. We also examine the ability of the one-class and two-class techniques to identify miRNAs in newly sequenced species.</p> <p>Results</p> <p>Of all methods tested, we found that 2-class naive Bayes and Support Vector Machines gave the best accuracy using our selected features and optimally chosen negative examples. One class methods showed average accuracies of 70–80% versus 90% for the two 2-class methods on the same feature sets. However, some one-class methods outperform some recently published two-class approaches with different selected features. Using the EBV genome as and external validation of the method we found one-class machine learning to work as well as or better than a two-class approach in identifying true miRNAs as well as predicting new miRNAs.</p> <p>Conclusion</p> <p>One and two class methods can both give useful classification accuracies when the negative class is well characterized. The advantage of one class methods is that it eliminates guessing at the optimal features for the negative class when they are not well defined. In these cases one-class methods can be superior to two-class methods when the features which are chosen as representative of that positive class are well defined.</p> <p>Availability</p> <p>The OneClassmiRNA program is available at: <abbrgrp><abbr bid="B1">1</abbr></abbrgrp></p

    Isoform-level gene signature improves prognostic stratification and accurately classifies glioblastoma subtypes.

    Get PDF
    Molecular stratification of tumors is essential for developing personalized therapies. Although patient stratification strategies have been successful; computational methods to accurately translate the gene-signature from high-throughput platform to a clinically adaptable low-dimensional platform are currently lacking. Here, we describe PIGExClass (platform-independent isoform-level gene-expression based classification-system), a novel computational approach to derive and then transfer gene-signatures from one analytical platform to another. We applied PIGExClass to design a reverse transcriptase-quantitative polymerase chain reaction (RT-qPCR) based molecular-subtyping assay for glioblastoma multiforme (GBM), the most aggressive primary brain tumors. Unsupervised clustering of TCGA (the Cancer Genome Altas Consortium) GBM samples, based on isoform-level gene-expression profiles, recaptured the four known molecular subgroups but switched the subtype for 19% of the samples, resulting in significant (P = 0.0103) survival differences among the refined subgroups. PIGExClass derived four-class classifier, which requires only 121 transcript-variants, assigns GBM patients' molecular subtype with 92% accuracy. This classifier was translated to an RT-qPCR assay and validated in an independent cohort of 206 GBM samples. Our results demonstrate the efficacy of PIGExClass in the design of clinically adaptable molecular subtyping assay and have implications for developing robust diagnostic assays for cancer patient stratification

    Classification and Prediction of Survival in Patients with the Leukemic Phase of Cutaneous T Cell Lymphoma

    Get PDF
    We have used cDNA arrays to investigate gene expression patterns in peripheral blood mononuclear cells from patients with leukemic forms of cutaneous T cell lymphoma, primarily Sezary syndrome (SS). When expression data for patients with high blood tumor burden (Sezary cells >60% of the lymphocytes) and healthy controls are compared by Student's t test, at P < 0.01, we find 385 genes to be differentially expressed. Highly overexpressed genes include Th2 cells–specific transcription factors Gata-3 and Jun B, as well as integrin β1, proteoglycan 2, the RhoB oncogene, and dual specificity phosphatase 1. Highly underexpressed genes include CD26, Stat-4, and the IL-1 receptors. Message for plastin-T, not normally expressed in lymphoid tissue, is detected only in patient samples and may provide a new marker for diagnosis. Using penalized discriminant analysis, we have identified a panel of eight genes that can distinguish SS in patients with as few as 5% circulating tumor cells. This suggests that, even in early disease, Sezary cells produce chemokines and cytokines that induce an expression profile in the peripheral blood distinctive to SS. Finally, we show that using 10 genes, we can identify a class of patients who will succumb within six months of sampling regardless of their tumor burden

    Genetic and morphological analyses of Gracilaria firma and G. changii (Gracilariaceae, Rhodophyta), the commercially important agarophytes in western Pacific

    Get PDF
    Many studies classifying Gracilaria species for the exploitation of agarophytes and the development of the agar industry were conducted before the prevalence of molecular tools, resulting in the description of many species based solely on their morphology. Gracilaria firma and G. changii are among the commercially important agarophytes from the western Pacific; both feature branches with basal constrictions that taper toward acute apices. In this study, we contrasted the morpho-anatomical circumscriptions of the two traditionally described species with molecular data from samples that included representatives of G. changii collected from its type locality. Concerted molecular analyses using the rbcL and cox1 gene sequences, coupled with morphological observations of the collections from the western Pacific, revealed no inherent differences to support the treatment of the two entities as distinct taxa. We propose merging G. changii (a later synonym) into G. firma and recognize G. firma based on thallus branches with abrupt basal constrictions that gradually taper toward acute (or sometimes broken) apices, cystocarps consisting of small gonimoblast cells and inconspicuous multinucleate tubular nutritive cells issuing from gonimoblasts extending into the inner pericarp at the cystocarp floor, as well as deep spermatangial conceptacles of the verrucosatype. The validation of specimens under different names as a single genetic species is useful to allow communication and knowledge transfer among groups from different fields. This study also revealed considerably low number of haplotypes and nucleotide diversity with apparent phylogeographic patterns for G. firma in the region. Populations from the Philippines and Taiwan were divergent from each other as well as from the populations from Malaysia, Thailand, Singapore and Vietnam. Establishment of baseline data on the genetic diversity of this commercially important agarophyte is relevant in the context of cultivation, as limited genetic diversity may jeopardize the potential for its genetic improvement over time

    Transcriptome analysis of a respiratory Saccharomyces cerevisiae strain suggests the expression of its phenotype is glucose insensitive and predominantly controlled by Hap4, Cat8 and Mig1

    Get PDF
    BACKGROUND: We previously described the first respiratory Saccharomyces cerevisiae strain, KOY.TM6*P, by integrating the gene encoding a chimeric hexose transporter, Tm6*, into the genome of an hxt null yeast. Subsequently we transferred this respiratory phenotype in the presence of up to 50 g/L glucose to a yeast strain, V5 hxt1-7Delta, in which only HXT1-7 had been deleted. In this study, we compared the transcriptome of the resultant strain, V5.TM6*P, with that of its wild-type parent, V5, at different glucose concentrations. RESULTS: cDNA array analyses revealed that alterations in gene expression that occur when transitioning from a respiro-fermentative (V5) to a respiratory (V5.TM6*P) strain, are very similar to those in cells undergoing a diauxic shift. We also undertook an analysis of transcription factor binding sites in our dataset by examining previously-published biological data for Hap4 (in complex with Hap2, 3, 5), Cat8 and Mig1, and used this in combination with verified binding consensus sequences to identify genes likely to be regulated by one or more of these. Of the induced genes in our dataset, 77% had binding sites for the Hap complex, with 72% having at least two. In addition, 13% were found to have a binding site for Cat8 and 21% had a binding site for Mig1. Unexpectedly, both the up- and down-regulation of many of the genes in our dataset had a clear glucose dependence in the parent V5 strain that was not present in V5.TM6*P. This indicates that the relief of glucose repression is already operable at much higher glucose concentrations than is widely accepted and suggests that glucose sensing might occur inside the cell. CONCLUSION: Our dataset gives a remarkably complete view of the involvement of genes in the TCA cycle, glyoxylate cycle and respiratory chain in the expression of the phenotype of V5.TM6*P. Furthermore, 88% of the transcriptional response of the induced genes in our dataset can be related to the potential activities of just three proteins: Hap4, Cat8 and Mig1. Overall, our data support genetic remodelling in V5.TM6*P consistent with a respiratory metabolism which is insensitive to external glucose concentrations

    Genome-wide analysis of host-chromosome binding sites for Epstein-Barr Virus Nuclear Antigen 1 (EBNA1)

    Get PDF
    The Epstein-Barr Virus (EBV) Nuclear Antigen 1 (EBNA1) protein is required for the establishment of EBV latent infection in proliferating B-lymphocytes. EBNA1 is a multifunctional DNA-binding protein that stimulates DNA replication at the viral origin of plasmid replication (OriP), regulates transcription of viral and cellular genes, and tethers the viral episome to the cellular chromosome. EBNA1 also provides a survival function to B-lymphocytes, potentially through its ability to alter cellular gene expression. To better understand these various functions of EBNA1, we performed a genome-wide analysis of the viral and cellular DNA sites associated with EBNA1 protein in a latently infected Burkitt lymphoma B-cell line. Chromatin-immunoprecipitation (ChIP) combined with massively parallel deep-sequencing (ChIP-Seq) was used to identify cellular sites bound by EBNA1. Sites identified by ChIP-Seq were validated by conventional real-time PCR, and ChIP-Seq provided quantitative, high-resolution detection of the known EBNA1 binding sites on the EBV genome at OriP and Qp. We identified at least one cluster of unusually high-affinity EBNA1 binding sites on chromosome 11, between the divergent FAM55 D and FAM55B genes. A consensus for all cellular EBNA1 binding sites is distinct from those derived from the known viral binding sites, suggesting that some of these sites are indirectly bound by EBNA1. EBNA1 also bound close to the transcriptional start sites of a large number of cellular genes, including HDAC3, CDC7, and MAP3K1, which we show are positively regulated by EBNA1. EBNA1 binding sites were enriched in some repetitive elements, especially LINE 1 retrotransposons, and had weak correlations with histone modifications and ORC binding. We conclude that EBNA1 can interact with a large number of cellular genes and chromosomal loci in latently infected cells, but that these sites are likely to represent a complex ensemble of direct and indirect EBNA1 binding sites

    Peripheral Immune Cell Gene Expression Predicts Survival of Patients with Non-Small Cell Lung Cancer

    Get PDF
    Prediction of cancer recurrence in patients with non-small cell lung cancer (NSCLC) currently relies on the assessment of clinical characteristics including age, tumor stage, and smoking history. A better prediction of early stage cancer patients with poorer survival and late stage patients with better survival is needed to design patient-tailored treatment protocols. We analyzed gene expression in RNA from peripheral blood mononuclear cells (PBMC) of NSCLC patients to identify signatures predictive of overall patient survival. We find that PBMC gene expression patterns from NSCLC patients, like patterns from tumors, have information predictive of patient outcomes. We identify and validate a 26 gene prognostic panel that is independent of clinical stage. Many additional prognostic genes are specific to myeloid cells and are more highly expressed in patients with shorter survival. We also observe that significant numbers of prognostic genes change expression levels in PBMC collected after tumor resection. These post-surgery gene expression profiles may provide a means to re-evaluate prognosis over time. These studies further suggest that patient outcomes are not solely determined by tumor gene expression profiles but can also be influenced by the immune response as reflected in peripheral immune cells

    Detection of chromosome aberrations in metaphase and interphase tumor cells by in situ hybridization using chromosome-specific library probes

    Get PDF
    Chromosome aberrations in two glioma cell lines were analyzed using biotinylated DNA library probes that specifically decorate chromosomes 1, 4, 7, 18 and 22 from pter to qter. Numerical changes, deletions and rearrangements of these chromosomes were radily visualized in metaphase spreads, as well as in early prophase and interphase nuclei. Complete chromosomes, deleted chromosomes and segments of translocated chromosomes were rapidly delineated in very complex karyotypes. Simultaneous hybridizations with additional subregional probes were used to further define aberrant chromosomes. Digital image analysis was used to quantitate the total complement of specific chromosomal DNAs in individual metaphase and interphase cells of each cell line. In spite of the fact that both glioma lines have been passaged in vitro for many years, an under-representation of chromosome 22 and an over-representation of chromosome 7 (specifically 7p) were observed. These observations agree with previous studies on gliomas. In addition, sequences of chromosome 4 were also found to be under-represented, especially in TC 593. These analyses indicate the power of these methods for pinpointing chromosome segments that are altered in specific types of tumors

    Classification and biomarker identification using gene network modules and support vector machines

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Classification using microarray datasets is usually based on a small number of samples for which tens of thousands of gene expression measurements have been obtained. The selection of the genes most significant to the classification problem is a challenging issue in high dimension data analysis and interpretation. A previous study with SVM-RCE (Recursive Cluster Elimination), suggested that classification based on groups of correlated genes sometimes exhibits better performance than classification using single genes. Large databases of gene interaction networks provide an important resource for the analysis of genetic phenomena and for classification studies using interacting genes.</p> <p>We now demonstrate that an algorithm which integrates network information with recursive feature elimination based on SVM exhibits good performance and improves the biological interpretability of the results. We refer to the method as SVM with Recursive Network Elimination (SVM-RNE)</p> <p>Results</p> <p>Initially, one thousand genes selected by t-test from a training set are filtered so that only genes that map to a gene network database remain. The Gene Expression Network Analysis Tool (GXNA) is applied to the remaining genes to form <it>n </it>clusters of genes that are highly connected in the network. Linear SVM is used to classify the samples using these clusters, and a weight is assigned to each cluster based on its importance to the classification. The least informative clusters are removed while retaining the remainder for the next classification step. This process is repeated until an optimal classification is obtained.</p> <p>Conclusion</p> <p>More than 90% accuracy can be obtained in classification of selected microarray datasets by integrating the interaction network information with the gene expression information from the microarrays.</p> <p>The Matlab version of SVM-RNE can be downloaded from <url>http://web.macam.ac.il/~myousef</url></p
    corecore