36 research outputs found

    Seq-ing improved gene expression estimates from microarrays using machine learning

    Get PDF
    BACKGROUND: Quantifying gene expression by RNA-Seq has several advantages over microarrays, including greater dynamic range and gene expression estimates on an absolute, rather than a relative scale. Nevertheless, microarrays remain in widespread use, demonstrated by the ever-growing numbers of samples deposited in public repositories. RESULTS: We propose a novel approach to microarray analysis that attains many of the advantages of RNA-Seq. This method, called Machine Learning of Transcript Expression (MaLTE), leverages samples for which both microarray and RNA-Seq data are available, using a Random Forest to learn the relationship between the fluorescence intensity of sets of microarray probes and RNA-Seq transcript expression estimates. We trained MaLTE on data from the Genotype-Tissue Expression (GTEx) project, consisting of Affymetrix gene arrays and RNA-Seq from over 700 samples across a broad range of human tissues. CONCLUSION: This approach can be used to accurately estimate absolute expression levels from microarray data, at both gene and transcript level, which has not previously been possible. This methodology will facilitate re-analysis of archived microarray data and broaden the utility of the vast quantities of data still being generated

    The regulatory effect of miRNAs is a heritable genetic trait in humans

    Get PDF
    BACKGROUND:microRNAs (miRNAs) have been shown to regulate the expression of a large number of genes and play key roles in many biological processes. Several previous studies have quantified the inhibitory effect of a miRNA indirectly by considering the expression levels of genes that are predicted to be targeted by the miRNA and this approach has been shown to be robust to the choice of prediction algorithm. Given a gene expression dataset, Cheng et al. defined the regulatory effect score (RE-score) of a miRNA as the difference in the gene expression rank of targets of the miRNA compared to non-targeted genes. RESULTS: Using microarray data from parent-offspring trios from the International HapMap project, we show that the RE-score of most miRNAs is correlated between parents and offspring and, thus, inter-individual variation in RE-score has a genetic component in humans. Indeed, the mean RE-score across miRNAs is correlated between parents and offspring, suggesting genetic differences in the overall efficiency of the miRNA biogenesis pathway between individuals. To explore the genetics of this quantitative trait further, we carried out a genome-wide association study of the mean RE-score separately in two HapMap populations (CEU and YRI). No genome-wide significant associations were discovered; however, a SNP rs17409624, in an intron of DROSHA, was significantly associated with mean RE-score in the CEU population following permutation-based control for multiple testing based on all SNPs mapped to the canonical miRNA biogenesis pathway; of 244 individual miRNA RE-scores assessed in the CEU, 214 were associated (p < 0.05) with rs17409624. The SNP was also nominally significantly associated (p = 0.04) with mean RE-score in the YRI population. Interestingly, the same SNP was associated with 17 (8.5% of all expressed) miRNA expression levels in the CEU. We also show here that the expression of the targets of most miRNAs is more highly correlated with global changes in miRNA regulatory effect than with the expression of the miRNA itself. CONCLUSIONS: We present evidence that miRNA regulatory effect is a heritable trait in humans and that a polymorphism of the DROSHA gene contributes to the observed inter-individual differences

    Machine learning and data mining frameworks for predicting drug response in cancer:An overview and a novel <i>in silico</i> screening process based on association rule mining

    Get PDF

    Analysis of gene regulation using high throughput genomics

    Get PDF
    The recent development of high-throughput genomics techniques and their subsequent applications have completely transformed the study of biology. The analysis, interpretation and storage of the resulting large volumes of data have created a wide range of computational challenges and opportunities that have driven the majority of recent bioinformatics research. In this thesis we focus on four research questions grounded in functional genomics and epigenomics, yielding novel methodologies and biological insights. The first research question relates to whether miRNA activity, as a general regulatory effect, is a heritable trait. To do this, we used Affymetrix Human Exon Microarray and RNA-seq data from the International HapMap project. We confirmed such an association in humans using the regulatory effect score (RE-score) of a miRNA, which has previously been defined as the difference in the gene expression rank of targets of the miRNA compared to non-targeted genes. We also identified a SNP in the miRNA processing gene textit{DROSHA}, which is associated with inter-individual difference in miRNA regulatory effect. During this analysis we noted that correlations between gene expression measures from RNA-seq and gene expression microarray platforms were often relatively poor. This led us to develop a method to improve the estimation of gene expression from microarrays. Our method uses samples for which there is both microarray and RNA-seq data available and builds statistical models which learn the relationship between probe level gene expression, as measured by the microarrays, and gene level expression, as measured by RNA-seq. These models can then be used to estimate gene expression on separate sets of microarray samples. We have assessed the performance of our method in comparison to Affymetrix Power Tools (APT). To do this, we fitted models for all genes on a training set of the HapMap YRI samples and tested performance on the HapMap CEU (both microarray and RNA-seq data are available for all of these samples). Overall, our method improves within sample correlations with RNA-seq substantially, but does not achieve the same level of performance as APT in terms of across sample correlations. The third research question aimed to determine whether or not it was possible to ascertain a consistent pattern of differential methylation in a limited number of ulcerative colitis (UC) biopsies, using data generated with the Agilent Human CpG Island microarray. Although there were no statistically significant differences between the sample groups at CpG island or probe level, we did uncover evidence of overall CpG island hypermethylation in UC. Subsequently, gene set analysis (GSA) revealed highly significant results for several GO biological processes. It became apparent that these results were a consequence of a sampling effect, which stems from the large differences in numbers of probes (targeting CpG sites) associated with genes in different gene sets. The fourth and final research question consisted of the development of a method to correct the bias in GSA analysis of these data. We applied our method to both the UC microarray dataset and a previously published genome-wide CpG island study of DNA methylation in lung cancer. We obtained novel biological insights into both of these conditions, consistent with their respective pathologies. Finally, we showed that this bias is also found with next generation sequencing based methylation assays, which we demonstrated using a HELP-seq dataset. In conclusion, this thesis presents novel analytical strategies encompassing gene expression and genome-wide methylation, and it also introduces methodologies that link microarray and RNA-seq measures of expression. It documents for the first time a correction for an intrinsic bias in GSA associated with many CpG island methylation platforms, and yields results of biological consequence with regard to endogenous RNAi regulatory processes and the epigenetic characterization of several human diseases

    Cancer eQTLs

    No full text

    Seq-ing improved gene expression estimates from microarrays using machine learning

    No full text
    Background: Quantifying gene expression by RNA-Seq has several advantages over microarrays, including greater dynamic range and gene expression estimates on an absolute, rather than a relative scale. Nevertheless, microarrays remain in widespread use, demonstrated by the ever-growing numbers of samples deposited in public repositories. Results: We propose a novel approach to microarray analysis that attains many of the advantages of RNA-Seq. This method, called Machine Learning of Transcript Expression (MaLTE), leverages samples for which both microarray and RNA-Seq data are available, using a Random Forest to learn the relationship between the fluorescence intensity of sets of microarray probes and RNA-Seq transcript expression estimates. We trained MaLTE on data from the Genotype-Tissue Expression (GTEx) project, consisting of Affymetrix gene arrays and RNA-Seq from over 700 samples across a broad range of human tissues. Conclusion: This approach can be used to accurately estimate absolute expression levels from microarray data, at both gene and transcript level, which has not previously been possible. This methodology will facilitate re-analysis of archived microarray data and broaden the utility of the vast quantities of data still being generated

    pRRophetic: an R package for prediction of clinical chemotherapeutic response from tumor gene expression levels.

    No full text
    We recently described a methodology that reliably predicted chemotherapeutic response in multiple independent clinical trials. The method worked by building statistical models from gene expression and drug sensitivity data in a very large panel of cancer cell lines, then applying these models to gene expression data from primary tumor biopsies. Here, to facilitate the development and adoption of this methodology we have created an R package called pRRophetic. This also extends the previously described pipeline, allowing prediction of clinical drug response for many cancer drugs in a user-friendly R environment. We have developed several other important use cases; as an example, we have shown that prediction of bortezomib sensitivity in multiple myeloma may be improved by training models on a large set of neoplastic hematological cell lines. We have also shown that the package facilitates model development and prediction using several different classes of data

    Additional file 1 of Seq-ing improved gene expression estimates from microarrays using machine learning

    No full text
    Additional information containing online methods is provided as a PDF file.Figure S1. The effect of training size on predictions. A test set of 100 samples was used for all training sample sizes. Figure S2. Out-of-bag (OOB) filtering. Scatter plot of cross-sample correlation coefficients from on OOB gene expression estimates and estimates obtained on test data for 1,000 randomly selected genes. Figure S3. Comparison of differential expression results. Comparison of the performance of MaLTE and median-polish on the problem of detecting differential gene expression between five heart-muscle and five skeletal-muscle samples in the GTEx dataset. Each method results in a list of genes, ranked by the q-value from the comparison of the gene expression level in the two groups of samples. We used the (a) cumulative Jaccard index and (b) concordance correlation to compare the similarity as a function of rank and the overall similarity, respectively, between lists of genes ranked by MaLTE/median-polish and by RNA-seq. The set of genes assessed were defined by RNA-Seq: genes with RPKM of above one in both tissues. Bootstrap re-sampling (100 pseudo-replicates) was used to assess the effect of sampling error for both cumulative Jaccard index and concordance correlations. Results of bootstrapping are shown as faint lines around the observed Jaccard index and yield the density plots shown in (b). Log of fold change estimates for (c) 120 differentially expressed genes common to MaLTE and median-polish and (d) MaLTE and PLIER for six common genes. Figure S4. Comparison of 10-sample DGE to 44-sample DGE. (a) Self-comparisons (e.g. MaLTE 10-sample DGE to MaLTE 44-sample DGE, with five and 22 samples in each tissue, respectively). Forty-four-sample DGE results are treated as putative true differential expressions at the indicated (top right or each plot) false discover rates (FDRs). FDRs were computed using the q-value method (Storey and Tibshirani 2003). Comparisons are made using the Jaccard index. (b) Comparison of each method to 44-sample DGE using RNA-Seq. Figure S5. Transcript isoform expression estimates. Densities of (a) Pearson and (b) Spearman cross-sample correlations for transcript isoform expression estimates obtained using MaLTE. Filtered data corresponds to transcript isoforms with rOOB>0. Figure S6. Box plots of slopes β computed by linearly regressing RNA-Seq against array method. Dotted red line indicates unit gradient. Each box plot represents 12 slopes for brain samples. RNA-Seq expression is restricted to RPKM between one and 1000 because of high uncertainty at low values and few genes at high values representing 78% of genes quantified. Using all genes results in the same medians but wider variance in slopes due to outlying genes. Figure S7. Estimation of tissue mixture proportions. (a) MaLTE, (b) median-polish (RMA), and (c) PLIER. Each plot shows comparisons of true and estimated proportions with key statistics indicated in the legends
    corecore