17 research outputs found

    PRINCE: Accurate approximation of the copy number of tandem repeats

    Get PDF
    Variable-Number Tandem Repeats (VNTR) are genomic regions where a short sequence of DNA is repeated with no space in between repeats. While a fixed set of VNTRs is typically identified for a given species, the copy number at each VNTR varies between individuals within a species. Although VNTRs are found in both prokaryotic and eukaryotic genomes, the methodology called multi-locus VNTR analysis (MLVA) is widely used to distinguish different strains of bacteria, as well as cluster strains that might be epidemiologically related and investigate evolutionary rates. We propose PRINCE (Processing Reads to Infer the Number of Copies via Estimation), an algorithm that is able to accurately estimate the copy number of a VNTR given the sequence of a single repeat unit and a set of short reads from a whole-genome sequence (WGS) experiment. This is a challenging problem, especially in the cases when the repeat region is longer than the expected read length. Our proposed method computes a statistical approximation of the local coverage inside the repeat region. This approximation is then mapped to the copy number using a linear function whose parameters are fitted to simulated data. We test PRINCE on the genomes of three datasets of Mycobacterium tuberculosis strains and show that it is more than twice as accurate as a previous method. An implementation of PRINCE in the Python language is freely available at https://github.com/WGS-TB/PythonPRINCE

    WebSTR: a population-wide database of short tandem repeat variation in humans.

    Get PDF
    Short tandem repeats (STRs) are consecutive repetitions of one to six nucleotide motifs. They are hypervariable due to the high prevalence of repeat unit insertions or deletions primarily caused by polymerase slippage during replication. Genetic variation at STRs has been shown to influence a range of traits in humans, including gene expression, cancer risk, and autism. Until recently STRs have been poorly studied since they pose significant challenges to bioinformatics analyses. Moreover, genome-wide analysis of STR variation in population-scale cohorts requires large amounts of data and computational resources. However, the recent advent of genome-wide analysis tools has resulted in multiple large genome-wide datasets of STR variation spanning nearly two million genomic loci in thousands of individuals from diverse populations. Here we present WebSTR, a database of genetic variation and other characteristics of genome-wide STRs across human populations. WebSTR is based on reference panels of more than 1.7 million human STRs created with state of the art repeat annotation methods and can easily be extended to include additional cohorts or species. It currently contains data based on STR genotypes for individuals from the 1000 Genomes Project, H3Africa, the Genotype-Tissue Expression (GTEx) Project and colorectal cancer patients from the TCGA dataset. WebSTR is implemented as a relational database with programmatic access available through an API and a web portal for browsing data. The web portal is publicly available at http://webstr.ucsd.edu

    Profiling variable-number tandem repeat variation across populations using repeat-pangenome graphs.

    Get PDF
    Variable number tandem repeats (VNTRs) are composed of consecutive repetitive DNA with hypervariable repeat count and composition. They include protein coding sequences and associations with clinical disorders. It has been difficult to incorporate VNTR analysis in disease studies that use short-read sequencing because the traditional approach of mapping to the human reference is less effective for repetitive and divergent sequences. In this work, we solve VNTR mapping for short reads with a repeat-pangenome graph (RPGG), a data structure that encodes both the population diversity and repeat structure of VNTR loci from multiple haplotype-resolved assemblies. We develop software to build a RPGG, and use the RPGG to estimate VNTR composition with short reads. We use this to discover VNTRs with length stratified by continental population, and expression quantitative trait loci, indicating that RPGG analysis of VNTRs will be critical for future studies of diversity and disease

    A reference haplotype panel for genome-wide imputation of short tandem repeats.

    Get PDF
    Short tandem repeats (STRs) are involved in dozens of Mendelian disorders and have been implicated in complex traits. However, genotyping arrays used in genome-wide association studies focus on single nucleotide polymorphisms (SNPs) and do not readily allow identification of STR associations. We leverage next-generation sequencing (NGS) from 479 families to create a SNP + STR reference haplotype panel. Our panel enables imputing STR genotypes into SNP array data when NGS is not available for directly genotyping STRs. Imputed genotypes achieve mean concordance of 97% with observed genotypes in an external dataset compared to 71% expected under a naive model. Performance varies widely across STRs, with near perfect concordance at bi-allelic STRs vs. 70% at highly polymorphic repeats. Imputation increases power over individual SNPs to detect STR associations with gene expression. Imputing STRs into existing SNP datasets will enable the first large-scale STR association studies across a range of complex traits

    A study of the variability of minisatellite tandem repeat loci in the human genome based on high-throughput sequencing data

    Get PDF
    Variable Number Tandem Repeats (VNTRs) are repetitive sequences of DNA which exhibit polymorphism in the number of copies of the repeating pattern. As with the better known SNPs, CNVs, and other mutations, VNTRs are a form of variation in the genome. Diseases such as Fragile X syndrome, and even behavioral disorders, such as ADHD, have been attributed to VNTR polymorphisms, where changes in copy number affect chromosome and protein structure, and gene expression. Microsatellite (TRs with a pattern length <7 nt) VNTRs are well-characterized and have been used for DNA fingerprinting. Minisatellite VNTRs (pattern length ≥7 nt), however, are a relatively understudied source of genetic variation; computational complexity and the lack of specialized tools available make detecting and studying them difficult. The traditional method for examining these features involves targeted amplification and gel electrophoresis to distinguish array lengths. In this work, I discuss our effort to discover a comprehensive set of VNTRs using VNTRseek, a tool developed in Dr. Gary Benson's lab for detecting VNTRs in silico using whole genome sequencing reads. I further discuss the curation and analyses we have performed in order to build a researcher-oriented tool, the VNTRdb, which allows other researchers access to this work and enables them to perform similar analyses. Having a tool with which VNTRs can be detected with relative ease, alongside a well-curated resource for VNTR alleles, will help promote further research into how they may be related to complex diseases, natural variation, or other areas of study

    Characterizing VNTRs in human populations

    Get PDF
    Over half the human genome consists of repetitive sequences. One major class is the tandem repeats (TRs), which are defined by their location in the genome, repeat unit, and copy number. TRs loci that exhibit variant copy numbers are called Variable Number Tandem Repeats (VNTRs). High VNTR mutation rates of approximately 0.0001 per generation make them suitable for forensic studies, and of interest for potential roles in gene regulation and disease. TRs are generally divided into three classes: 1) microsatellites or short tandem repeats (STRs) with patterns 100 bp. To date, mini- and macrosatellites have been poorly characterized, mainly due to a lack of computational tools. In this thesis, I utilize a tool, VNTRseek, to identify human minisatellite VNTRs using short-read sequencing data from nearly 2,800 individuals and developed a new computational tool, MaSUD, to identify human macrosatellite VNTRs using data from 2,504 individuals. MaSUD is the first high-throughput tool to genotype macrosatellites using short reads. I identified over 35,000 minisatellite VNTRs and over 4,000 macrosatellite VNTRs, most previously unknown. A small subset in each VNTR class was validated experimentally and in silico. The detected VNTRs were further studied for their effects on gene expression, ability to distinguish human populations, and functional enrichment. Unlike STRs, mini- and macrosatellite VNTRs are enriched in regions with functional importance, e.g., introns, promoters, and transcription factor binding sites. A study of VNTRs across 26 populations shows that minisatellite VNTR genotypes can be used to predict super-populations with >90% accuracy. In addition, genotypes for 195 minisatellite VNTRs and 22 macrosatellite VNTRs were shown to be associated with differential expression in nearby genes (eQTLs). Finally, I developed a computational tool, mlZ, to infer undetected VNTR alleles and to detect false positive predictions. mlZ is applicable to other tools that use read support for predicting short variants. Overall, these studies provide the most comprehensive analysis of mini- and macrosatellites in human populations and will facilitate the application of VNTRs for clinical purposes

    A polymorphic transcriptional regulatory domain in the amyotrophic lateral sclerosis risk gene CFAP410 correlates with differential isoform expression

    Get PDF
    We describe the characterisation of a variable number tandem repeat (VNTR) domain within intron 1 of the amyotrophic lateral sclerosis (ALS) risk gene CFAP410 (Cilia and flagella associated protein 410) (previously known as C21orf2), providing insight into how this domain could support differential gene expression and thus be a modulator of ALS progression or risk. We demonstrated the VNTR was functional in a reporter gene assay in the HEK293 cell line, exhibiting both the properties of an activator domain and a transcriptional start site, and that the differential expression was directed by distinct repeat number in the VNTR. These properties embedded in the VNTR demonstrated the potential for this VNTR to modulate CFAP410 expression. We extrapolated these findings in silico by utilisation of tagging SNPs for the two most common VNTR alleles to establish a correlation with endogenous gene expression. Consistent with in vitro data, CFAP410 isoform expression was found to be variable in the brain. Furthermore, although the number of matched controls was low, there was evidence for one specific isoform being correlated with lower expression in those with ALS. To address if the genotype of the VNTR was associated with ALS risk, we characterised the variation of the CFAP410 VNTR in ALS cases and matched controls by PCR analysis of the VNTR length, defining eight alleles of the VNTR. No significant difference was observed between cases and controls, we noted, however, the cohort was unlikely to contain sufficient power to enable any firm conclusion to be drawn from this analysis. This data demonstrated that the VNTR domain has the potential to modulate CFAP410 expression as a regulatory element that could play a role in its tissue-specific and stimulus-inducible regulation that could impact the mechanism by which CFAP410 is involved in ALS

    Searching the dark genome for Alzheimer's disease risk variants

    Get PDF
    Sporadic Alzheimer’s disease (AD) is a complex genetic disease, and the leading cause of dementia worldwide. Over the past 3 decades, extensive pioneering research has discovered more than 70 common and rare genetic risk variants. These discoveries have contributed massively to our understanding of the pathogenesis of AD but approximately half of the heritability for AD remains unaccounted for. There are regions of the genome that are not assayed by mainstream genotype and sequencing technology. These regions, known as the Dark Genome, often harbour large structural DNA variants that are likely relevant to disease risk. Here, we describe the dark genome and review current technological and bioinformatics advances that will enable researchers to shed light on these hidden regions of the genome. We highlight the potential importance of the hidden genome in complex disease and how these strategies will assist in identifying the missing heritability of AD. Identification of novel protein-coding structural variation that increases risk of AD will open new avenues for translational research and new drug targets that have the potential for clinical benefit to delay or even prevent clinical symptoms of disease

    Paragraph: A graph-based structural variant genotyper for short-read sequence data

    Get PDF
    Accurate detection and genotyping of structural variations (SVs) from short-read data is a long-standing area of development in genomics research and clinical sequencing pipelines. We introduce Paragraph, an accurate genotyper that models SVs using sequence graphs and SV annotations. We demonstrate the accuracy of Paragraph on whole-genome sequence data from three samples using long-read SV calls as the truth set, and then apply Paragraph at scale to a cohort of 100 short-read sequenced samples of diverse ancestry. Our analysis shows that Paragraph has better accuracy than other existing genotypers and can be applied to population-scale studies. © 2019 The Author(s)

    A polymorphic transcriptional regulatory domain in the amyotrophic lateral sclerosis risk gene <i>CFAP410</i> correlates with differential isoform expression.

    Get PDF
    We describe the characterisation of a variable number tandem repeat (VNTR) domain within intron 1 of the amyotrophic lateral sclerosis (ALS) risk gene CFAP410 (Cilia and flagella associated protein 410) (previously known as C21orf2), providing insight into how this domain could support differential gene expression and thus be a modulator of ALS progression or risk. We demonstrated the VNTR was functional in a reporter gene assay in the HEK293 cell line, exhibiting both the properties of an activator domain and a transcriptional start site, and that the differential expression was directed by distinct repeat number in the VNTR. These properties embedded in the VNTR demonstrated the potential for this VNTR to modulate CFAP410 expression. We extrapolated these findings in silico by utilisation of tagging SNPs for the two most common VNTR alleles to establish a correlation with endogenous gene expression. Consistent with in vitro data, CFAP410 isoform expression was found to be variable in the brain. Furthermore, although the number of matched controls was low, there was evidence for one specific isoform being correlated with lower expression in those with ALS. To address if the genotype of the VNTR was associated with ALS risk, we characterised the variation of the CFAP410 VNTR in ALS cases and matched controls by PCR analysis of the VNTR length, defining eight alleles of the VNTR. No significant difference was observed between cases and controls, we noted, however, the cohort was unlikely to contain sufficient power to enable any firm conclusion to be drawn from this analysis. This data demonstrated that the VNTR domain has the potential to modulate CFAP410 expression as a regulatory element that could play a role in its tissue-specific and stimulus-inducible regulation that could impact the mechanism by which CFAP410 is involved in ALS
    corecore