482 research outputs found

    SEESAW: detecting isoform-level allelic imbalance accounting for inferential uncertainty.

    Get PDF
    Detecting allelic imbalance at the isoform level requires accounting for inferential uncertainty, caused by multi-mapping of RNA-seq reads. Our proposed method, SEESAW, uses Salmon and Swish to offer analysis at various levels of resolution, including gene, isoform, and aggregating isoforms to groups by transcription start site. The aggregation strategies strengthen the signal for transcripts with high uncertainty. The SEESAW suite of methods is shown to have higher power than other allelic imbalance methods when there is isoform-level allelic imbalance. We also introduce a new test for detecting imbalance that varies across a covariate, such as time

    Highly accurate quantification of allelic gene expression for population and disease genetics

    Get PDF
    Publisher Copyright: © 2022 Saukkonen et al.Analysis of allele-specific gene expression (ASE) is a powerful approach for studying gene regulation, particularly when sample sizes are small, such as for rare diseases, or when studying the effects of rare genetic variation. However, detection of ASE events relies on accurate alignment of RNA sequencing reads, where challenges still remain, particularly for reads containing genetic variants or those that align to many different genomic locations. We have developed the Personalised ASE Caller (PAC), a tool that combines multiple steps to improve the quantification of allelic reads, including personalized (i.e., diploid) read alignment with improved allocation of multimapping reads. Using simulated RNA sequencing data, we show that PAC outperforms standard alignment approaches for ASE detection, reducing the number of sites with incorrect biases (>10%) by ∼80% and increasing the number of sites that can be reliably quantified by ∼3%. Applying PAC to real RNA sequencing data from 670 whole-blood samples, we show that genetic regulatory signatures inferred from ASE data more closely match those from population-based methods that are less prone to alignment biases. Finally, we use PAC to characterize cell type–specific ASE events that would be missed by standard alignment approaches, and in doing so identify disease relevant genes that may modulate their effects through the regulation of gene expression. PAC can be applied to the vast quantity of existing RNA sequencing data sets to better understand a wide array of fundamental biological and disease processes.Peer reviewe

    Functional analysis of genetic risk markers

    Get PDF
    Regulatory variants are the main factors responsible for genetic predisposition to how e.g. humans react differently to the environment. Therefore, it is important to locate and measure their effects, which can result in pre-disease intervention, new drugs, or as part in the personal medicine era, where selection and dose of a drug is based on a person’s genetic profile. In this thesis we have investigated the potential to link genetic markers to transcription using allele specific expression (ASE), which can avoid influence of both population stratification bias and trans-factors, increasing the statistical power compared to using total RNA based linkage methods. To quantify expression levels, we have used RNA-sequencing, which automatically makes it possible to measure ASE, provided that there is a heterozygous variant within the transcribed fragment, which in turn makes it possible to discern the expression between the two alleles. RNA sequencing data tend to be complex and requires to be summarized into count measures before further analyzed for ASE. To facilitate this process and provide additional analytical support, we developed the software AllelicImbalance, which now is freely accessible within bioconductor, a bioinformatics repository for code and data. Using this software we investigated ASE behavior on the individual level of a single transcribed variant, within a gene, and for connections between an ASE event and known risk markers, previously established from Genome Wide Association Studies (GWAS). We showed in a dataset of 10 individuals that by measuring a consistent ASE over consecutive exons withing the same gene that an ASE signature is robust against dissimilarities in sequence. Further, because we showed that ASE stability covered several SNPs we established that short read sequencing is not a fundamental obstacle to the implementation of this technique. However, more individuals were needed to better assess a link to genetic variants. We continued our analysis in a larger dataset, in which one of the sequenced tissues had a representation of 680 individuals. This was enough to measure ASE as a regression of allelic fraction by genotype (aeQTL), conceptually similar to the regression of expression by genotype commonly used in eQTL studies. In this data we were able to explain novel risk SNPs using the aeQTL method, and showed that any bias for the reference allele had no significant effect on the regression. We moved on to test if aeQTL could pick up unique signals for 205 individuals in a tissue previously investigated for eQTL using a large cohort of more than 5000 individuals. Indeed, we detected 15 novel aeQTLs, which probably were masked by trans-regulation in the previous investigation. In addition, we describe the software ClusterSignificance, which tests for separation of groups in data with reduced dimensionality. The algorithm sets statistical rigor to a task previously done by visual inspection. This thesis gives an overview of progress of us and others in ASE investigations, which is becoming more than being just a compliment to eQTL. The future signals a more dominant role as more sequencing data becomes readily available, accessing the closest active link to cis-regulation

    Is it time to change the reference genome?

    Get PDF
    The use of the human reference genome has shaped methods and data across modern genomics. This has offered many benefits while creating a few constraints. In the following opinion, we outline the history, properties, and pitfalls of the current human reference genome. In a few illustrative analyses, we focus on its use for variant-calling, highlighting its nearness to a 'type specimen'. We suggest that switching to a consensus reference would offer important advantages over the continued use of the current reference with few disadvantages

    Genomic data analysis workflows for tumors from patient-derived xenografts (PDXs): challenges and guidelines.

    Get PDF
    BACKGROUND: Patient-derived xenograft (PDX) models are in vivo models of human cancer that have been used for translational cancer research and therapy selection for individual patients. The Jackson Laboratory (JAX) PDX resource comprises 455 models originating from 34 different primary sites (as of 05/08/2019). The models undergo rigorous quality control and are genomically characterized to identify somatic mutations, copy number alterations, and transcriptional profiles. Bioinformatics workflows for analyzing genomic data obtained from human tumors engrafted in a mouse host (i.e., Patient-Derived Xenografts; PDXs) must address challenges such as discriminating between mouse and human sequence reads and accurately identifying somatic mutations and copy number alterations when paired non-tumor DNA from the patient is not available for comparison. RESULTS: We report here data analysis workflows and guidelines that address these challenges and achieve reliable identification of somatic mutations, copy number alterations, and transcriptomic profiles of tumors from PDX models that lack genomic data from paired non-tumor tissue for comparison. Our workflows incorporate commonly used software and public databases but are tailored to address the specific challenges of PDX genomics data analysis through parameter tuning and customized data filters and result in improved accuracy for the detection of somatic alterations in PDX models. We also report a gene expression-based classifier that can identify EBV-transformed tumors. We validated our analytical approaches using data simulations and demonstrated the overall concordance of the genomic properties of xenograft tumors with data from primary human tumors in The Cancer Genome Atlas (TCGA). CONCLUSIONS: The analysis workflows that we have developed to accurately predict somatic profiles of tumors from PDX models that lack normal tissue for comparison enable the identification of the key oncogenic genomic and expression signatures to support model selection and/or biomarker development in therapeutic studies. A reference implementation of our analysis recommendations is available at https://github.com/TheJacksonLaboratory/PDX-Analysis-Workflows

    Reducing Reference Bias in Genomic Sequence Data Processing

    Get PDF
    A reference genome facilitates genomic sequence data processing by serving as a matching template and providing a coordinate system. Despite the benefits, differences between the reference and donor genomes can result in "reference biases." Two major sources of reference bias include lack of genetic diversity and assembly artifacts. This thesis first presents computational methods that reduce reference bias by incorporating genetic diversity. We discuss an alignment method using population references to achieve an alignment accuracy level near using a personalized solution. We develop an efficient alignment lift-over software to convert alignments from a customized genomic coordinate system to a standard one. We adapt a widely-used variant caller to consider genetic diversity and show substantial variant calling improvements. Further, we leverage the first complete human genome to reduce reference biases caused by assembly artifacts. We describe an improved lift-over method that handles structural variations between two references. We apply a selective strategy to improve efficiency and reduce false positives. The approach mitigates reference biases substantially in hard-to-map regions

    Using Pan-Genomic Data Structures to Incoporate Diversity Into Genomic Analyses

    Get PDF
    The alignment of sequencing reads to the reference genome is a process subject to reference bias, a phenomenon where reads containing alternative alleles have a smaller likelihood of aligning to the reference when compared to reads that are more similar to the reference. Because the human reference genome is largely comprised of the genomic sequence of a single individual, it is apparent that either changing or modifying the representation of the reference genome in order to incorporate diversity from other individuals can reduce reference bias. We discuss methods for alleviating reference bias through the use of novel text indexing data structures and algorithms that can incorporate such diversity. First, we present data structures built on top of the Run-Length FM Index that can be used to index and query a pan-genome, ie. a representation of the genome that incorporates known variation within the species. Then, we use pan-genome indexes in a workflow for constructing a personalized genome from a set of sequencing reads. This personalized genome can be used in lieu of the reference genome during alignment in order to alleviate reference bias. We also discuss how alignments against personalized genomes can be used in downstream analyses by "lifting" these alignments back over to the reference genome
    • …
    corecore