2,882 research outputs found

    The variant call format and VCFtools

    Get PDF
    Summary: The variant call format (VCF) is a generic format for storing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations. VCF is usually stored in a compressed manner and can be indexed for fast data retrieval of variants from a range of positions on the reference genome. The format was developed for the 1000 Genomes Project, and has also been adopted by other projects such as UK10K, dbSNP and the NHLBI Exome Project. VCFtools is a software suite that implements various utilities for processing VCF files, including validation, merging, comparing and also provides a general Perl API

    estMOI: estimating multiplicity of infection using parasite deep sequencing data.

    Get PDF
    Individuals living in endemic areas generally harbour multiple parasite strains. Multiplicity of infection (MOI) can be an indicator of immune status and transmission intensity. It has a potentially confounding effect on a number of population genetic analyses, which often assume isolates are clonal. Polymerase chain reaction-based approaches to estimate MOI can lack sensitivity. For example, in the human malaria parasite Plasmodium falciparum, genotyping of the merozoite surface protein (MSP1/2) genes is a standard method for assessing MOI, despite the apparent problem of underestimation. The availability of deep coverage data from massively parallizable sequencing technologies means that MOI can be detected genome wide by considering the abundance of heterozygous genotypes. Here, we present a method to estimate MOI, which considers unique combinations of polymorphisms from sequence reads. The method is implemented within the estMOI software. When applied to clinical P.falciparum isolates from three continents, we find that multiple infections are common, especially in regions with high transmission

    An heuristic filtering tool to identify phenotype-associated genetic variants applied to human intellectual disability and canine coat colors

    Get PDF
    Background: Identification of one or several disease causing variant(s) from the large collection of variants present in an individual is often achieved by the sequential use of heuristic filters. The recent development of whole exome sequencing enrichment designs for several non-model species created the need for a species-independent, fast and versatile analysis tool, capable of tackling a wide variety of standard and more complex inheritance models. With this aim, we developed "Mendelian", an R-package that can be used for heuristic variant filtering. Results: The R-package Mendelian offers fast and convenient filters to analyze putative variants for both recessive and dominant models of inheritance, with variable degrees of penetrance and detectance. Analysis of trios is supported. Filtering against variant databases and annotation of variants is also included. This package is not species specific and supports parallel computation. We validated this package by reanalyzing data from a whole exome sequencing experiment on intellectual disability in humans. In a second example, we identified the mutations responsible for coat color in the dog. This is the first example of whole exome sequencing without prior mapping in the dog. Conclusion: We developed an R-package that enables the identification of disease-causing variants from the long list of variants called in sequencing experiments. The software and a detailed manual are available at https://github.com/BartBroeckx/Mendelian

    Second-generation PLINK: rising to the challenge of larger and richer datasets

    Get PDF
    PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for even faster and more scalable implementations of key functions. In addition, GWAS and population-genetic data now frequently contain probabilistic calls, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1's primary data format. To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, O(sqrt(n))-time/constant-space Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. This will be followed by PLINK 2.0, which will introduce (a) a new data format capable of efficiently representing probabilities, phase, and multiallelic variants, and (b) extensions of many functions to account for the new types of information. The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.Comment: 2 figures, 1 additional fil

    Gene expression in Leishmania is regulated predominantly by gene dosage

    Get PDF
    ABSTRACT Leishmania tropica, a unicellular eukaryotic parasite present in North and East Africa, the Middle East, and the Indian subcontinent, has been linked to large outbreaks of cutaneous leishmaniasis in displaced populations in Iraq, Jordan, and Syria. Here, we report the genome sequence of this pathogen and 7,863 identified protein-coding genes, and we show that the majority of clinical isolates possess high levels of allelic diversity, genetic admixture, heterozygosity, and extensive aneuploidy. By utilizing paired genome-wide high-throughput DNA sequencing (DNA-seq) with RNA-seq, we found that gene dosage, at the level of individual genes or chromosomal “somy” (a general term covering disomy, trisomy, tetrasomy, etc.), accounted for greater than 85% of total gene expression variation in genes with a 2-fold or greater change in expression. High gene copy number variation (CNV) among membrane-bound transporters, a class of proteins previously implicated in drug resistance, was found for the most highly differentially expressed genes. Our results suggest that gene dosage is an adaptive trait that confers phenotypic plasticity among natural Leishmania populations by rapid down- or upregulation of transporter proteins to limit the effects of environmental stresses, such as drug selection. IMPORTANCE Leishmania is a genus of unicellular eukaryotic parasites that is responsible for a spectrum of human diseases that range from cutaneous leishmaniasis (CL) and mucocutaneous leishmaniasis (MCL) to life-threatening visceral leishmaniasis (VL). Developmental and strain-specific gene expression is largely thought to be due to mRNA message stability or posttranscriptional regulatory networks for this species, whose genome is organized into polycistronic gene clusters in the absence of promoter-mediated regulation of transcription initiation of nuclear genes. Genetic hybridization has been demonstrated to yield dramatic structural genomic variation, but whether such changes in gene dosage impact gene expression has not been formally investigated. Here we show that the predominant mechanism determining transcript abundance differences (>85%) in Leishmania tropica is that of gene dosage at the level of individual genes or chromosomal somy
    corecore