15 research outputs found

    RNA and DNA Sequence analysis of the human transcriptome

    Get PDF
    The manifestation of phenotype at the cellular and organismal level is determined in large part by gene expression, or the transcription of DNA into RNA. As such, the study of the transcriptome, or the characterization and quantification of all RNA produced in the cell, is important. Recent advances in sequencing technology have allowed for unprecedented interrogation of the transcriptome at single-nucleotide resolution. In the first part of this thesis, we use RNA-Sequencing (RNA-Seq) to study the human B-cell transcriptome and determine the experimental parameters necessary for sequencing-based studies of gene expression. We discover that deep sequencing is necessary to detect fully and quantify accurately the complexity of human transcriptomes. Furthermore, we find that at high sequencing depths, the vast majority of transcribed elements in human B-cells are detected. In the second part of this thesis, we utilize the sequence information provided by RNA-Seq to analyze systematic differences between DNA and RNA sequence. The transmission of information from DNA to RNA is a critical process and is expected to occur in a one-to-one fashion. By comparing the DNA sequence to RNA sequence of the same individuals, we found all 12 types of RNA-DNA sequence differences (RDDs), the majority of which cannot be explained by known mechanisms such as RNA editing or transcriptional infidelity. We developed computational methods to robustly identify RDDs and control for false positives resulting from genotyping, sequencing, and alignment error. Finally, we explore the genetic basis of RDD levels, or the proportion of reads at a site bearing the sequence difference allele. In particular, we analyzed the levels of RNA editing in unrelated and related individuals and found that a significant portion of individual variation in A-to-G editing levels contains a genetic component. In summary, our results demonstrate that RNA-Seq is a powerful technique for comprehensive and quantitative analysis of gene expression. In addition, the resolution offered by RNA-Seq enables a detailed view of sequence differences between RNA and DNA. Future work will focus on understanding the mechanisms and factors influencing RDDs. Our results suggest that RDD levels may be considered a quantitative and heritable phenotype; as such, a genetic approach may be a sensible method for finding the determinants and mechanism of RDDs

    RNA-sequence analysis of human B-cells

    No full text
    RNA-sequencing (RNA-seq) allows quantitative measurement of expression levels of genes and their transcripts. In this study, we sequenced complementary DNA fragments of cultured human B-cells and obtained 879 million 50-bp reads comprising 44 Gb of sequence. The results allowed us to study the gene expression profile of B-cells and to determine experimental parameters for sequencing-based expression studies. We identified 20,766 genes and 67,453 of their alternatively spliced transcripts. More than 90% of the genes with multiple exons are alternatively spliced; for most genes, one isoform is predominantly expressed. We found that while chromosomes differ in gene density, the percentage of transcribed genes in each chromosome is less variable. In addition, genes involved in related biological processes are expressed at more similar levels than genes with different functions. Besides characterizing gene expression, we also used the data to investigate the effect of sequencing depth on gene expression measurements. While 100 million reads are sufficient to detect most expressed genes and transcripts, about 500 million reads are needed to measure accurately their expression levels. We provide examples in which deep sequencing is needed to determine the relative abundance of genes and their isoforms. With data from 20 individuals and about 40 million sequence reads per sample, we uncovered only 21 alternatively spliced, multi-exon genes that are not in databases; this result suggests that at this sequence coverage, we can detect most of the known genes. Results from this project are available on the UCSC Genome Browser to allow readers to study the expression and structure of genes in human B-cells

    Detection Theory in Identification of RNA-DNA Sequence Differences Using RNA-Sequencing

    No full text
    <div><p>Advances in sequencing technology have allowed for detailed analyses of the transcriptome at single-nucleotide resolution, facilitating the study of RNA editing or sequence differences between RNA and DNA genome-wide. In humans, two types of post-transcriptional RNA editing processes are known to occur: A-to-I deamination by ADAR and C-to-U deamination by APOBEC1. In addition to these sequence differences, researchers have reported the existence of all 12 types of RNA-DNA sequence differences (RDDs); however, the validity of these claims is debated, as many studies claim that technical artifacts account for the majority of these non-canonical sequence differences. In this study, we used a detection theory approach to evaluate the performance of RNA-Sequencing (RNA-Seq) and associated aligners in accurately identifying RNA-DNA sequence differences. By generating simulated RNA-Seq datasets containing RDDs, we assessed the effect of alignment artifacts and sequencing error on the sensitivity and false discovery rate of RDD detection. Overall, we found that even in the presence of sequencing errors, false negative and false discovery rates of RDD detection can be contained below 10% with relatively lenient thresholds. We also assessed the ability of various filters to target false positive RDDs and found them to be effective in discriminating between true and false positives. Lastly, we used the optimal thresholds we identified from our simulated analyses to identify RDDs in a human lymphoblastoid cell line. We found approximately 6,000 RDDs, the majority of which are A-to-G edits and likely to be mediated by ADAR. Moreover, we found the majority of non A-to-G RDDs to be associated with poorer alignments and conclude from these results that the evidence for widespread non-canonical RDDs in humans is weak. Overall, we found RNA-Seq to be a powerful technique for surveying RDDs genome-wide when coupled with the appropriate thresholds and filters.</p></div

    Number of RNA-DNA sequence differences removed by various bioinformatics filters.

    No full text
    <p>Number of RNA-DNA sequence differences removed by various bioinformatics filters.</p

    False discovery rate of RNA-DNA sequence difference detection.

    No full text
    <p>Here we depict the false discovery rate of RNA-DNA sequence difference detection under various thresholds on the coverage, level of sequence difference, and number of reads bearing the sequence difference base per the aligner. Calculations are averaged across the three replicates and error bars represent standard deviation values.</p

    Sensitivity of RDD detection versus the simulated RDD level.

    No full text
    <p>Here we depict the true positive rate of RDD detection versus the simulated RDD level, or the percentage of reads at the site bearing the sequence difference allele. A minimum of 1 read bearing the RNA-DNA sequence difference is sufficient for a site to be deemed correctly identified. Sites with coverage less than 10x per the simulated RNA-Seq dataset are removed from consideration.</p

    Sensitivity of RNA-DNA sequence difference detection versus coverage.

    No full text
    <p>The sensitivity or true positive rate of RNA-DNA sequence difference identification is shown versus various thresholds on the minimum depth of coverage required at the site of the simulated difference. For all four aligners, the true positive rate increases sharply upon raising the minimum depth of coverage required for detection from 0x to approximately 50x, after which it plateaus.</p

    Simulated versus observed levels of RNA-DNA sequence differences.

    No full text
    <p>Here we plot the simulated RDD level versus the observed level as determined by GSNAP, MapSplice, RUM, or Tophat for replicate 1. Sites with coverage less than 10x or a RDD level less than 10% per the simulated dataset are removed from consideration. Overall, we observed the correlation between simulated and observed levels to be approximately 98% in both datasets and across the various aligners and replicates.</p

    Distribution of RNA-DNA sequence differences in GM12878.

    No full text
    <p>Here we depict the distribution of RNA-DNA sequence differences in GM12878 after removing sites using various filters.</p
    corecore