20 research outputs found

    Efficient Haplotype Block Matching in Bi-Directional PBWT

    Get PDF
    Efficient haplotype matching search is of great interest when large genotyped cohorts are becoming available. Positional Burrows-Wheeler Transform (PBWT) enables efficient searching for blocks of haplotype matches. However, existing efficient PBWT algorithms sweep across the haplotype panel from left to right, capturing all exact matches. As a result, PBWT does not account for mismatches. It is also not easy to investigate the patterns of changes between the matching blocks. Here, we present an extension to PBWT, called bi-directional PBWT that allows the information about the blocks of matches to be present at both sides of each site. We also present a set of algorithms to efficiently merge the matching blocks or examine the patterns of changes on both sides of each site. The time complexity of the algorithms to find and merge matching blocks using bi-directional PBWT is linear to the input size. Using real data from the UK Biobank, we demonstrate the run time and memory efficiency of our algorithms. More importantly, our algorithms can identify more blocks by enabling tolerance of mismatches. Moreover, by using mutual information (MI) between the forward and the reverse PBWT matching block sets as a measure of haplotype consistency, we found the MI derived from European samples in the 1000 Genomes Project is highly correlated (Spearman correlation r=0.87) with the deCODE recombination map

    Multicolor CRISPR labeling of chromosomal loci in human cells

    Get PDF
    The intranuclear location of genomic loci and the dynamics of these loci are important parameters for understanding the spatial and temporal regulation of gene expression. Recently it has proven possible to visualize endogenous genomic loci in live cells by the use of transcription activator-like effectors (TALEs), as well as modified versions of the bacterial immunity clustered regularly interspersed short palindromic repeat (CRISPR)/CRISPR-associated protein 9 (Cas9) system. Here we report the design of multicolor versions of CRISPR using catalytically inactive Cas9 endonuclease (dCas9) from three bacterial orthologs. Each pair of dCas9-fluorescent proteins and cognate single-guide RNAs (sgRNAs) efficiently labeled several target loci in live human cells. Using pairs of differently colored dCas9-sgRNAs, it was possible to determine the intranuclear distance between loci on different chromosomes. In addition, the fluorescence spatial resolution between two loci on the same chromosome could be determined and related to the linear distance between them on the chromosome\u27s physical map, thereby permitting assessment of the DNA compaction of such regions in a live cell

    CRISPR-Cas9 nuclear dynamics and target recognition in living cells

    Get PDF
    The bacterial CRISPR-Cas9 system has been repurposed for genome engineering, transcription modulation, and chromosome imaging in eukaryotic cells. However, the nuclear dynamics of clustered regularly interspaced short palindromic repeats (CRISPR)-associated protein 9 (Cas9) guide RNAs and target interrogation are not well defined in living cells. Here, we deployed a dual-color CRISPR system to directly measure the stability of both Cas9 and guide RNA. We found that Cas9 is essential for guide RNA stability and that the nuclear Cas9-guide RNA complex levels limit the targeting efficiency. Fluorescence recovery after photobleaching measurements revealed that single mismatches in the guide RNA seed sequence reduce the target residence time from \u3e3 h to as low as time

    Simultaneous Epigenetic Perturbation and Genome Imaging Reveal Distinct Roles of H3K9me3 in Chromatin Architecture and Transcription [preprint]

    Get PDF
    Despite the long-observed correlation between H3K9me3, chromatin architecture and transcriptional repression, how H3K9me3 regulates genome higher-order organization and transcriptional activity in living cells remains unclear. Here we develop EpiGo (Epigenetic perturbation induced Genome organization)-KRAB to introduce H3K9me3 at hundreds of loci spanning megabases on human chromosome 19 and simultaneously track genome organization. EpiGo-KRAB is sufficient to induce de novo heterochromatin-like domain formation, which requires SETDB1, a methyltransferase of H3K9me3. Unexpectedly, EpiGo-KRAB induced heterochromatin-like domain does not result in widespread gene repression except a small set of genes with concurrent loss of H3K4me3 and H3K27ac. Ectopic H3K9me3 appears to spread in inactive regions but is largely restricted to transcriptional initiation sites in active regions. Finally, Hi-C analysis showed that EpiGo-KRAB induced to reshape existing compartments. These results reveal the role of H3K9me3 in genome organization could be partially separated from its function in gene repression

    Simultaneous epigenetic perturbation and genome imaging reveal distinct roles of H3K9me3 in chromatin architecture and transcription

    Get PDF
    INTRODUCTION: Despite the long-observed correlation between H3K9me3, chromatin architecture, and transcriptional repression, how H3K9me3 regulates genome higher-order organization and transcriptional activity in living cells remains unclear. RESULT: Here, we develop EpiGo (Epigenetic perturbation induced Genome organization)-KRAB to introduce H3K9me3 at hundreds of loci spanning megabases on human chromosome 19 and simultaneously track genome organization. EpiGo-KRAB is sufficient to induce genomic clustering and de novo heterochromatin-like domain formation, which requires SETDB1, a methyltransferase of H3K9me3. Unexpectedly, EpiGo-KRAB-induced heterochromatin-like domain does not result in widespread gene repression except a small set of genes with concurrent loss of H3K4me3 and H3K27ac. Ectopic H3K9me3 appears to spread in inactive regions but is largely restricted from transcriptional initiation sites in active regions. Finally, Hi-C analysis showed that EpiGo-KRAB reshapes existing compartments mainly at compartment boundaries. CONCLUSIONS: These results reveal the role of H3K9me3 in genome organization could be partially separated from its function in gene repression

    Analysis of Large-scale Population Genetic Data Using Efficient Algorithms and Data Structures

    No full text
    With the availability of genotyping data of very large samples, there is an increasing need for tools that can efficiently identify genetic relationships among all individuals in the sample. Modern biobanks cover genotypes up to 0.1%-1% of an entire large population. At this scale, genetic relatedness among samples is ubiquitous. However, current methods are not efficient for uncovering genetic relatedness at such a scale. We developed a new method, Random Projection for IBD Detection (RaPID), for detecting Identical-by-Descent (IBD) segments, a fundamental concept in genetics in large panels. RaPID detects all IBD segments over a certain length in time linear to the sample size. We take advantage of an efficient population genotype index, Positional BWT (PBWT), by Richard Durbin. PBWT achieves linear time query of perfectly identical subsequences among all samples. However, the original PBWT is not tolerant to genotyping errors which often interrupt long IBD segments into short fragments. The key idea of RaPID is that the problem of approximate high-resolution matching over a long range can be mapped to the problem of exact matching of low-resolution subsampled sequences with high probability. PBWT provides an appropriate data structure for bi-allelic data. With the increasing sample sizes, more multi-allelic sites are expected to be observed. Hence, there is a necessity to handle multi-allelic genotype data. We also introduce a multi-allelic version of the original Positional Burrows-Wheeler Transform (mPBWT). The increasingly large cohorts of whole genome genotype data present an opportunity for searching genetically related people within a large cohort to an individual. At the same time, doing so efficiently presents a challenge. The PBWT algorithm offers constant time matching between one haplotype and an arbitrarily large panel at each position, but only for the maximal matches. We used the PBWT data structure to develop a method to search for all matches of a given query in a panel. The matches larger than a given length correspond to the all shared IBD segments of certain lengths between the query and other individuals in the panel. The time complexity of the proposed method is independent from the number of individuals in the panel. In order to achieve a time complexity independent from the number of haplotypes, additional data structures are introduced. Some regions of genome may be shared by multiple individuals rather than only a pair. Clusters of identical haplotypes could reveal information about the history of intermarriage, isolation of a population and also be medically important. We propose an efficient method to find clusters of identical segments among individuals in a large panel, called cPBWT, using PBWT data structure. The time complexity of finding all clusters of identical matches is linear to the sample size. Human genome harbors several runs of homozygous sites (ROHs) where identical haplotypes are inherited from each parent. We applied cPBWT on UK-Biobank and searched for clusters of ROH region that are shared among multiple. We discovered strong associations between ROH regions and some non-cancerous diseases, specifically auto-immune disorders

    FiMAP: A fast identity-by-descent mapping test for biobank-scale cohorts.

    No full text
    Although genome-wide association studies (GWAS) have identified tens of thousands of genetic loci, the genetic architecture is still not fully understood for many complex traits. Most GWAS and sequencing association studies have focused on single nucleotide polymorphisms or copy number variations, including common and rare genetic variants. However, phased haplotype information is often ignored in GWAS or variant set tests for rare variants. Here we leverage the identity-by-descent (IBD) segments inferred from a random projection-based IBD detection algorithm in the mapping of genetic associations with complex traits, to develop a computationally efficient statistical test for IBD mapping in biobank-scale cohorts. We used sparse linear algebra and random matrix algorithms to speed up the computation, and a genome-wide IBD mapping scan of more than 400,000 samples finished within a few hours. Simulation studies showed that our new method had well-controlled type I error rates under the null hypothesis of no genetic association in large biobank-scale cohorts, and outperformed traditional GWAS single-variant tests when the causal variants were untyped and rare, or in the presence of haplotype effects. We also applied our method to IBD mapping of six anthropometric traits using the UK Biobank data and identified a total of 3,442 associations, 2,131 (62%) of which remained significant after conditioning on suggestive tag variants in the ± 3 centimorgan flanking regions from GWAS

    RAFFI: Accurate and fast familial relationship inference in large scale biobank studies using RaPID.

    No full text
    Inference of relationships from whole-genome genetic data of a cohort is a crucial prerequisite for genome-wide association studies. Typically, relationships are inferred by computing the kinship coefficients (ϕ) and the genome-wide probability of zero IBD sharing (π0) among all pairs of individuals. Current leading methods are based on pairwise comparisons, which may not scale up to very large cohorts (e.g., sample size >1 million). Here, we propose an efficient relationship inference method, RAFFI. RAFFI leverages the efficient RaPID method to call IBD segments first, then estimate the ϕ and π0 from detected IBD segments. This inference is achieved by a data-driven approach that adjusts the estimation based on phasing quality and genotyping quality. Using simulations, we showed that RAFFI is robust against phasing/genotyping errors, admix events, and varying marker densities, and achieves higher accuracy compared to KING, the current leading method, especially for more distant relatives. When applied to the phased UK Biobank data with ~500K individuals, RAFFI is approximately 18 times faster than KING. We expect RAFFI will offer fast and accurate relatedness inference for even larger cohorts
    corecore