1,793 research outputs found

    Knowledge about the presence or absence of miRNA isoforms (isomiRs) can successfully discriminate amongst 32 TCGA cancer types.

    Get PDF
    Isoforms of human miRNAs (isomiRs) are constitutively expressed with tissue- and disease-subtype-dependencies. We studied 10 271 tumor datasets from The Cancer Genome Atlas (TCGA) to evaluate whether isomiRs can distinguish amongst 32 TCGA cancers. Unlike previous approaches, we built a classifier that relied solely on \u27binarized\u27 isomiR profiles: each isomiR is simply labeled as \u27present\u27 or \u27absent\u27. The resulting classifier successfully labeled tumor datasets with an average sensitivity of 90% and a false discovery rate (FDR) of 3%, surpassing the performance of expression-based classification. The classifier maintained its power even after a 15× reduction in the number of isomiRs that were used for training. Notably, the classifier could correctly predict the cancer type in non-TCGA datasets from diverse platforms. Our analysis revealed that the most discriminatory isomiRs happen to also be differentially expressed between normal tissue and cancer. Even so, we find that these highly discriminating isomiRs have not been attracting the most research attention in the literature. Given their ability to successfully classify datasets from 32 cancers, isomiRs and our resulting \u27Pan-cancer Atlas\u27 of isomiR expression could serve as a suitable framework to explore novel cancer biomarkers

    Inference from binary gene expression data

    No full text
    Microarrays provide a practical method for measuring the mRNA abundances of thousands of genes in a single experiment. Analysing such large dimensional data is a challenge which attracts researchers from many different fields and machine learning is one of them. However, the biological properties of mRNA such as its low stability, measurements being taken from a population of cells rather than from a single cell, etc. should make researchers sceptical about the high numerical precision reported and thus the reproducibility of these measurements. In this study we explore data representation at lower numerical precision, down to binary (retaining only the information whether a gene is expressed or not), thereby improving the quality of inferences drawn from microarray studies. With binary representation, we propose a solution to reduce the effect of algorithmic choice in the pre-processing stages.First we compare the information loss if researchers made the inferences from quantized transcriptome data rather than the continuous values. Classification, clustering, periodicity detection and analysis of developmental time series data are considered here. Our results showed that there is not much information loss with binary data. Then, by focusing on the two most widely used inference tools, classification and clustering, we show that inferences drawn from transcriptome data can actually be improved with a metric suitable for binary data. This is explained with the uncertainties of the probe level data. We also show that binary transcriptome data can be used in cross-platform studies and when used with Tanimoto kernel, this increase the performance of inferences when compared to individual datasets. In the last part of this work we show that binary transcriptome data reduces the effect of algorithm choice for pre-processing raw data. While there are many different algorithms for pre-processing stages there are few guidelines for the users as to which one to choose. In many studies it has been shown that the choice of algorithms has significant impact on the overall results of microarray studies. Here we show in classification, that if transcriptome data is binarized after pre-processed with any combination of algorithms it has the effect of reducing the variability of the results and increasing the performance of the classifier simultaneously

    rfTSP: A Non-parametric predictive model with order-based feature selection for transcriptomic data

    Get PDF
    Genomic data has strong potential to predict biologic classifications using gene expression data. For example, tumor subtype can be determined using machine learning models and gene expression profiles. We propose the use of Top Scoring Pairs in combination with machine learning to improve inter-study prediction of genomic profiles. Inter-study prediction refers to two studies that are completely independent either in terms of platform or tissue. Top Scoring Pairs (TSPs) rank pairs of genes according to how well they are expressed between different groups of subjects. For example, gene A will be lowly expressed in cases, and gene B will be highly expressed in controls, while gene A will be highly expressed in controls, and gene B will be lowly expressed in cases. The pairs demonstrate an inverse relationship with respect to one and another. Using TSPs act not only as a feature selection step, but also allows for a non parametric method that transforms the continuous expression data to 0,1, which is based on the rank of the pairs. Due to the robust nature of the transformed data, our methods demonstrate that the use of TSP binary data is much more effective in prediction than continuous data, particularly in cross study prediction. Furthermore, we extend the use of TSPs to not only binary and multi-class label prediction, but also continuous classification. The objective of this paper is to demonstrate how using dichotomized data from TSPs as the feature space for machine learning methods, particularly random forest, returns stronger prediction accuracy across independent studies than traditional machine learning techniques with log2 and quantile normalization of data. This work has significant public health impact as accurate genomic prediction is crucial for early detection of many serious illnesses such as cancer

    Gene selection for optimal prediction of cell position in tissues from single-cell transcriptomics data.

    Get PDF
    Single-cell RNA-sequencing (scRNAseq) technologies are rapidly evolving. Although very informative, in standard scRNAseq experiments, the spatial organization of the cells in the tissue of origin is lost. Conversely, spatial RNA-seq technologies designed to maintain cell localization have limited throughput and gene coverage. Mapping scRNAseq to genes with spatial information increases coverage while providing spatial location. However, methods to perform such mapping have not yet been benchmarked. To fill this gap, we organized the DREAM Single-Cell Transcriptomics challenge focused on the spatial reconstruction of cells from the Drosophila embryo from scRNAseq data, leveraging as silver standard, genes with in situ hybridization data from the Berkeley Drosophila Transcription Network Project reference atlas. The 34 participating teams used diverse algorithms for gene selection and location prediction, while being able to correctly localize clusters of cells. Selection of predictor genes was essential for this task. Predictor genes showed a relatively high expression entropy, high spatial clustering and included prominent developmental genes such as gap and pair-rule genes and tissue markers. Application of the top 10 methods to a zebra fish embryo dataset yielded similar performance and statistical properties of the selected genes than in the Drosophila data. This suggests that methods developed in this challenge are able to extract generalizable properties of genes that are useful to accurately reconstruct the spatial arrangement of cells in tissues

    Using Pre-existing Microarray Datasets to Increase Experimental Power: Application to Insulin Resistance

    Get PDF
    Although they have become a widely used experimental technique for identifying differentially expressed (DE) genes, DNA microarrays are notorious for generating noisy data. A common strategy for mitigating the effects of noise is to perform many experimental replicates. This approach is often costly and sometimes impossible given limited resources; thus, analytical methods are needed which increase accuracy at no additional cost. One inexpensive source of microarray replicates comes from prior work: to date, data from hundreds of thousands of microarray experiments are in the public domain. Although these data assay a wide range of conditions, they cannot be used directly to inform any particular experiment and are thus ignored by most DE gene methods. We present the SVD Augmented Gene expression Analysis Tool (SAGAT), a mathematically principled, data-driven approach for identifying DE genes. SAGAT increases the power of a microarray experiment by using observed coexpression relationships from publicly available microarray datasets to reduce uncertainty in individual genes' expression measurements. We tested the method on three well-replicated human microarray datasets and demonstrate that use of SAGAT increased effective sample sizes by as many as 2.72 arrays. We applied SAGAT to unpublished data from a microarray study investigating transcriptional responses to insulin resistance, resulting in a 50% increase in the number of significant genes detected. We evaluated 11 (58%) of these genes experimentally using qPCR, confirming the directions of expression change for all 11 and statistical significance for three. Use of SAGAT revealed coherent biological changes in three pathways: inflammation, differentiation, and fatty acid synthesis, furthering our molecular understanding of a type 2 diabetes risk factor. We envision SAGAT as a means to maximize the potential for biological discovery from subtle transcriptional responses, and we provide it as a freely available software package that is immediately applicable to any human microarray study

    A study of the suitability of autoencoders for preprocessing data in breast cancer experimentation

    Get PDF
    Breast cancer is the most common cause of cancer death in women. Today, post-transcriptional protein products of the genes involved in breast cancer can be identified by immunohistochemistry. However, this method has problems arising from the intra-observer and inter-observer variability in the assess ment of pathologic variables, which may result in misleading conclusions. Using an optimal selection of preprocessing techniques may help to reduce observer variability. Deep learning has emerged as a powerful technique for any tasks related to machine learning such as classification and regression. The aim of this work is to use autoencoders (neural networks commonly used to feed deep learning architec tures) to improve the quality of the data for developing immunohistochemistry signatures with prognos tic value in breast cancer. Our testing on data from 222 patients with invasive non-special type breast carcinoma shows that an automatic binarization of experimental data after autoencoding could outper form other classical preprocessing techniques (such as human-dependent or automatic binarization only) when applied to the prognosis of breast cancer by immunohistochemical signaturesMinisterio de Economía y Competitividad TIN2014-55894-C2-1-

    Comparative genome analysis of a large Dutch Legionella pneumophila strain collection identifies five markers highly correlated with clinical strains

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Discrimination between clinical and environmental strains within many bacterial species is currently underexplored. Genomic analyses have clearly shown the enormous variability in genome composition between different strains of a bacterial species. In this study we have used <it>Legionella pneumophila</it>, the causative agent of Legionnaire's disease, to search for genomic markers related to pathogenicity. During a large surveillance study in The Netherlands well-characterized patient-derived strains and environmental strains were collected. We have used a mixed-genome microarray to perform comparative-genome analysis of 257 strains from this collection.</p> <p>Results</p> <p>Microarray analysis indicated that 480 DNA markers (out of in total 3360 markers) showed clear variation in presence between individual strains and these were therefore selected for further analysis. Unsupervised statistical analysis of these markers showed the enormous genomic variation within the species but did not show any correlation with a pathogenic phenotype. We therefore used supervised statistical analysis to identify discriminating markers. Genetic programming was used both to identify predictive markers and to define their interrelationships. A model consisting of five markers was developed that together correctly predicted 100% of the clinical strains and 69% of the environmental strains.</p> <p>Conclusions</p> <p>A novel approach for identifying predictive markers enabling discrimination between clinical and environmental isolates of <it>L. pneumophila </it>is presented. Out of over 3000 possible markers, five were selected that together enabled correct prediction of all the clinical strains included in this study. This novel approach for identifying predictive markers can be applied to all bacterial species, allowing for better discrimination between strains well equipped to cause human disease and relatively harmless strains.</p

    Pancancer analysis of DNA methylation-driven genes using MethylMix.

    Get PDF
    Aberrant DNA methylation is an important mechanism that contributes to oncogenesis. Yet, few algorithms exist that exploit this vast dataset to identify hypo- and hypermethylated genes in cancer. We developed a novel computational algorithm called MethylMix to identify differentially methylated genes that are also predictive of transcription. We apply MethylMix to 12 individual cancer sites, and additionally combine all cancer sites in a pancancer analysis. We discover pancancer hypo- and hypermethylated genes and identify novel methylation-driven subgroups with clinical implications. MethylMix analysis on combined cancer sites reveals 10 pancancer clusters reflecting new similarities across malignantly transformed tissues

    Understanding cellular function and disease with comparative pathway analysis

    Get PDF
    Pathway analysis is important in interpreting the functional implications of high-throughput experimental results, but robust comparison across platforms and species is problematic. A new approach, Pathprinting, provides a cross-platform, cross-species comparative analysis of pathway expression signatures. This method calculates pathway-level statistics from gene expression across nearly 180,000 microarrays in the Gene Expression Omnibus. Pathprinting can accurately retrieve phenotypically similar samples and identify sets of human and mouse genes that are prognostic in cancer
    corecore