10,089 research outputs found

    Second-generation PLINK: rising to the challenge of larger and richer datasets

    Get PDF
    PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for even faster and more scalable implementations of key functions. In addition, GWAS and population-genetic data now frequently contain probabilistic calls, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1's primary data format. To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, O(sqrt(n))-time/constant-space Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. This will be followed by PLINK 2.0, which will introduce (a) a new data format capable of efficiently representing probabilities, phase, and multiallelic variants, and (b) extensions of many functions to account for the new types of information. The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.Comment: 2 figures, 1 additional fil

    Analysis of five deep-sequenced trio-genomes of the Peninsular Malaysia Orang Asli and North Borneo populations

    Get PDF
    BackgroundRecent advances in genomic technologies have facilitated genome-wide investigation of human genetic variations. However, most efforts have focused on the major populations, yet trio genomes of indigenous populations from Southeast Asia have been under-investigated.ResultsWe analyzed the whole-genome deep sequencing data (30x) of five native trios from Peninsular Malaysia and North Borneo, and characterized the genomic variants, including single nucleotide variants (SNVs), small insertions and deletions (indels) and copy number variants (CNVs). We discovered approximately 6.9 million SNVs, 1.2 million indels, and 9000 CNVs in the 15 samples, of which 2.7% SNVs, 2.3% indels and 22% CNVs were novel, implying the insufficient coverage of population diversity in existing databases. We identified a higher proportion of novel variants in the Orang Asli (OA) samples, i.e., the indigenous people from Peninsular Malaysia, than that of the North Bornean (NB) samples, likely due to more complex demographic history and long-time isolation of the OA groups. We used the pedigree information to identify de novo variants and estimated the autosomal mutation rates to be 0.81x10(-8) - 1.33x10(-8), 1.0x10(-9) - 2.9x10(-9), and 0.001 per site per generation for SNVs, indels, and CNVs, respectively. The trio-genomes also allowed for haplotype phasing with high accuracy, which serves as references to the future genomic studies of OA and NB populations. In addition, high-frequency inherited CNVs specific to OA or NB were identified. One example is a 50-kb duplication in DEFA1B detected only in the Negrito trios, implying plausible effects on host defense against the exposure of diverse microbial in tropical rainforest environment of these hunter-gatherers. The CNVs shared between OA and NB groups were much fewer than those specific to each group. Nevertheless, we identified a 142-kb duplication in AMY1A in all the 15 samples, and this gene is associated with the high-starch diet. Moreover, novel insertions shared with archaic hominids were identified in our samples.ConclusionOur study presents a full catalogue of the genome variants of the native Malaysian populations, which is a complement of the genome diversity in Southeast Asians. It implies specific population history of the native inhabitants, and demonstrated the necessity of more genome sequencing efforts on the multi-ethnic native groups of Malaysia and Southeast Asia

    Assessment of the genetic basis of rosacea by genome-wide association study.

    Get PDF
    Rosacea is a common, chronic skin disease that is currently incurable. Although environmental factors influence rosacea, the genetic basis of rosacea is not established. In this genome-wide association study, a discovery group of 22,952 individuals (2,618 rosacea cases and 20,334 controls) was analyzed, leading to identification of two significant single-nucleotide polymorphisms (SNPs) associated with rosacea, one of which replicated in a new group of 29,481 individuals (3,205 rosacea cases and 26,262 controls). The confirmed SNP, rs763035 (P=8.0 × 10(-11) discovery group; P=0.00031 replication group), is intergenic between HLA-DRA and BTNL2. Exploratory immunohistochemical analysis of HLA-DRA and BTNL2 expression in papulopustular rosacea lesions from six individuals, including one with the rs763035 variant, revealed staining in the perifollicular inflammatory infiltrate of rosacea for both proteins. In addition, three HLA alleles, all MHC class II proteins, were significantly associated with rosacea in the discovery group and confirmed in the replication group: HLA-DRB1*03:01 (P=1.0 × 10(-8) discovery group; P=4.4 × 10(-6) replication group), HLA-DQB1*02:01 (P=1.3 × 10(-8) discovery group; P=7.2 × 10(-6) replication group), and HLA-DQA1*05:01 (P=1.4 × 10(-8) discovery group; P=7.6 × 10(-6) replication group). Collectively, the gene variants identified in this study support the concept of a genetic component for rosacea, and provide candidate targets for future studies to better understand and treat rosacea

    Genetic affinities within a large global collection of pathogenic <i>Leptospira</i>: implications for strain identification and molecular epidemiology

    Get PDF
    Leptospirosis is an important zoonosis with widespread human health implications. The non-availability of accurate identification methods for the individualization of different Leptospira for outbreak investigations poses bountiful problems in the disease control arena. We harnessed fluorescent amplified fragment length polymorphism analysis (FAFLP) for Leptospira and investigated its utility in establishing genetic relationships among 271 isolates in the context of species level assignments of our global collection of isolates and strains obtained from a diverse array of hosts. In addition, this method was compared to an in-house multilocus sequence typing (MLST) method based on polymorphisms in three housekeeping genes, the rrs locus and two envelope proteins. Phylogenetic relationships were deduced based on bifurcating Neighbor-joining trees as well as median joining network analyses integrating both the FAFLP data and MLST based haplotypes. The phylogenetic relationships were also reproduced through Bayesian analysis of the multilocus sequence polymorphisms. We found FAFLP to be an important method for outbreak investigation and for clustering of isolates based on their geographical descent rather than by genome species types. The FAFLP method was, however, not able to convey much taxonomical utility sufficient to replace the highly tedious serotyping procedures in vogue. MLST, on the other hand, was found to be highly robust and efficient in identifying ancestral relationships and segregating the outbreak associated strains or otherwise according to their genome species status and, therefore, could unambiguously be applied for investigating phylogenetics of Leptospira in the context of taxonomy as well as gene flow. For instance, MLST was more efficient, as compared to FAFLP method, in clustering strains from the Andaman island of India, with their counterparts from mainland India and Sri Lanka, implying that such strains share genetic relationships and that leptospiral strains might be frequently circulating between the islands and the mainland

    Jabba: hybrid error correction for long sequencing reads using maximal exact matches

    Get PDF
    Third generation sequencing platforms produce longer reads with higher error rates than second generation sequencing technologies. While the improved read length can provide useful information for downstream analysis, underlying algorithms are challenged by the high error rate. Error correction methods in which accurate short reads are used to correct noisy long reads appear to be attractive to generate high-quality long reads. Methods that align short reads to long reads do not optimally use the information contained in the second generation data, and suffer from large runtimes. Recently, a new hybrid error correcting method has been proposed, where the second generation data is first assembled into a de Bruijn graph, on which the long reads are then aligned. In this context we present Jabba, a hybrid method to correct long third generation reads by mapping them on a corrected de Bruijn graph that was constructed from second generation data. Unique to our method is that this mapping is constructed with a seed and extend methodology, using maximal exact matches as seeds. In addition to benchmark results, certain theoretical results concerning the possibilities and limitations of the use of maximal exact matches in the context of third generation reads are presented

    Jabba: hybrid error correction for long sequencing reads

    Get PDF
    Background: Third generation sequencing platforms produce longer reads with higher error rates than second generation technologies. While the improved read length can provide useful information for downstream analysis, underlying algorithms are challenged by the high error rate. Error correction methods in which accurate short reads are used to correct noisy long reads appear to be attractive to generate high-quality long reads. Methods that align short reads to long reads do not optimally use the information contained in the second generation data, and suffer from large runtimes. Recently, a new hybrid error correcting method has been proposed, where the second generation data is first assembled into a de Bruijn graph, on which the long reads are then aligned. Results: In this context we present Jabba, a hybrid method to correct long third generation reads by mapping them on a corrected de Bruijn graph that was constructed from second generation data. Unique to our method is the use of a pseudo alignment approach with a seed-and-extend methodology, using maximal exact matches (MEMs) as seeds. In addition to benchmark results, certain theoretical results concerning the possibilities and limitations of the use of MEMs in the context of third generation reads are presented. Conclusion: Jabba produces highly reliable corrected reads: almost all corrected reads align to the reference, and these alignments have a very high identity. Many of the aligned reads are error-free. Additionally, Jabba corrects reads using a very low amount of CPU time. From this we conclude that pseudo alignment with MEMs is a fast and reliable method to map long highly erroneous sequences on a de Bruijn graph

    Genome-wide Identity-by-Descent Sharing among CEPH Siblings

    Get PDF
    The concept of genetic identity–by–descent (IBD) has markedly advanced our understanding of the genetic similarity among relatives and triggered a number of developments in epidemiological genetics. However, no empirical measure of this relatedness throughout the whole human genome has yet been published. Analyzing highly polymorphic genetic variations from the Centre d’études du polymorphisme humain (CEPH) database, we report the first genome–wide estimation of the mean and variation in IBD sharing among siblings. From 1,522 microsatellite markers spaced at an average of 2.3 cM on 498 sibling pairs, we estimated a mean of 0.4994 and a standard deviation of 0.0395. In order to account for the impact of varying chromosomal lengths and recombination rates, the analysis was also performed at the chromosomal and marker levels and for paternal and maternal DNA separately. Based on the variation, we estimate an “effective number of segregating loci” of around 80 for sibling pairs over the whole genome (i.e., the number of loci that would yield the same standard deviation in IBD sharing if all loci were segregating independently). Finally, we briefly assess the impact of genotyping errors on IBD estimations, compare our results to published theoretical and simulated expectations, and discuss some implications of our findings
    corecore