6 research outputs found

    Sources of bias in measures of allele-specific expression derived from RNA-seq data aligned to a single reference genome

    Full text link
    Abstract Background RNA-seq can be used to measure allele-specific expression (ASE) by assigning sequence reads to individual alleles; however, relative ASE is systematically biased when sequence reads are aligned to a single reference genome. Aligning sequence reads to both parental genomes can eliminate this bias, but this approach is not always practical, especially for non-model organisms. To improve accuracy of ASE measured using a single reference genome, we identified properties of differentiating sites responsible for biased measures of relative ASE. Results We found that clusters of differentiating sites prevented sequence reads from an alternate allele from aligning to the reference genome, causing a bias in relative ASE favoring the reference allele. This bias increased with greater sequence divergence between alleles. Increasing the number of mismatches allowed when aligning sequence reads to the reference genome and restricting analysis to genomic regions with fewer differentiating sites than the number of mismatches allowed almost completely eliminated this systematic bias. Accuracy of allelic abundance was increased further by excluding differentiating sites within sequence reads that could not be aligned uniquely within the genome (imperfect mappability) and reads that overlapped one or more insertions or deletions (indels) between alleles. Conclusions After aligning sequence reads to a single reference genome, excluding differentiating sites with at least as many neighboring differentiating sites as the number of mismatches allowed, imperfect mappability, and/or an indel(s) nearby resulted in measures of allelic abundance comparable to those derived from aligning sequence reads to both parental genomes.http://deepblue.lib.umich.edu/bitstream/2027.42/112895/1/12864_2013_Article_5263.pd

    Correcting Reference Bias in High-throughput Sequencing Analysis

    Get PDF
    Mapping reads to a reference sequence is a common step when analyzing high throughput sequencing data. The choice of reference is critical because its effect on quantitative sequence analysis is non-negligible. Recent studies suggest aligning to a single standard reference sequence, as is common practice, can lead to an underlying bias depending the genetic distances of the target sequences from the reference. To avoid this bias researchers have resorted to using modified reference sequences. Even with this improvement, various limitations and problems remain unsolved, which include reduced mapping ratios, shifts in read mappings, and the selection of which variants to include to remove biases. To address these issues, I proposed novel and generic pipelines that integrate the genomic variations from known or suspected founders into reference sequences and then perform read alignment. Experiments show that my pipelines can align more reads with much lower reference bias than the traditional pipeline where reads are mapped against the standard reference sequence. They can be applied to a wide range of organisms, including inbreds, F1s, and outbreds, and various high throughput sequencing approaches, such as RNAseq, DNAseq, ChiPseq, etc.Doctor of Philosoph

    The Evolution of Gene Regulation in Drosophila.

    Full text link
    Differences in gene expression drive phenotypic diversity. At the level of transcription, these differences are largely controlled by the complex interplay between trans-acting factors and the cis-regulatory sequences to which they bind. In this dissertation, I characterized how the regulation of gene expression has evolved both within and between several species of the Drosophila lineage. I began by describing current methodology to accurately quantify allele-specific expression (ASE) from RNA-seq data, comparing two different methods and highlighting sources of bias. Using this methodology, I measured allele-specific differences in F1 hybrids made by reciprocally crossing two strains of D. melanogaster. These two sets of genetically-identical hybrids differed only by which strain contributed the maternal or paternal allele, allowing me to test the hypothesis that D. melanogaster do not imprint their genome. Next, for that same intraspecific comparison as well as two interspecific comparisons, I measured total and allele-specific gene expression to categorize regulatory differences across divergence times ranging from 0.01-2.5 million years ago. This allowed me to test the hypothesis that cis-regulatory differences account for a higher proportion of the total regulatory differences between species, as well as to determine how patterns for inheritance of gene expression differ across an evolutionary timescale. Because all of these comparisons were made using female whole flies, I tested the prevalence of sex- and tissue-specific differences using gene expression data from female and male carcass and gonad tissues between D. pseudoobscura and its closely-related subspecies D. p. bogotana and their F1 hybrids. I determined that one must use caution when inferring patterns of regulatory divergence in whole flies, as the integration over all different tissue types can mask the complexity of gene regulation in individual tissues. The work in this dissertation expands our knowledge of how the regulation of gene expression differs across a well-characterized lineage and will continue to drive further studies of these phenomena in even more distantly-related species.PHDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/110328/4/Kraig_Stevenson_Dissertation.pd
    corecore