39 research outputs found

    How do users design scientific workflows? The Case of Snakemake

    Full text link
    Scientific workflows automate the analysis of large-scale scientific data, fostering the reuse of data processing operators as well as the reproducibility and traceability of analysis results. In exploratory research, however, workflows are continuously adapted, utilizing a wide range of tools and software libraries, to test scientific hypotheses. Script-based workflow engines cater to the required flexibility through direct integration of programming primitives but lack abstractions for interactive exploration of the workflow design by a user during workflow execution. To derive requirements for such interactive workflows, we conduct an empirical study on the use of Snakemake, a popular Python-based workflow engine. Based on workflows collected from 1602 GitHub repositories, we present insights on common structures of Snakemake workflows, as well as the language features typically adopted in their specification

    PopDel identifies medium-size deletions simultaneously in tens of thousands of genomes

    Get PDF
    Thousands of genomic structural variants (SVs) segregate in the human population and can impact phenotypic traits and diseases. Their identification in whole-genome sequence data of large cohorts is a major computational challenge. Most current approaches identify SVs in single genomes and afterwards merge the identified variants into a joint call set across many genomes. We describe the approach PopDel, which directly identifies deletions of about 500 to at least 10,000 bp in length in data of many genomes jointly, eliminating the need for subsequent variant merging. PopDel scales to tens of thousands of genomes as we demonstrate in evaluations on up to 49,962 genomes. We show that PopDel reliably reports common, rare and de novo deletions. On genomes with available high-confidence reference call sets PopDel shows excellent recall and precision. Genotype inheritance patterns in up to 6794 trios indicate that genotypes predicted by PopDel are more reliable than those of previous SV callers. Furthermore, PopDel’s running time is competitive with the fastest tested previous tools. The demonstrated scalability and accuracy of PopDel enables routine scans for deletions in large-scale sequencing studies

    Comprehensive population-wide analysis of Lynch syndrome in Iceland reveals founder mutations in MSH6 and PMS2.

    Get PDF
    To access publisher's full text version of this article, please click on the hyperlink in Additional Links field or click on the hyperlink at the top of the page marked FilesLynch syndrome, caused by germline mutations in the mismatch repair genes, is associated with increased cancer risk. Here using a large whole-genome sequencing data bank, cancer registry and colorectal tumour bank we determine the prevalence of Lynch syndrome, associated cancer risks and pathogenicity of several variants in the Icelandic population. We use colorectal cancer samples from 1,182 patients diagnosed between 2000-2009. One-hundred and thirty-two (11.2%) tumours are mismatch repair deficient per immunohistochemistry. Twenty-one (1.8%) have Lynch syndrome while 106 (9.0%) have somatic hypermethylation or mutations in the mismatch repair genes. The population prevalence of Lynch syndrome is 0.442%. We discover a translocation disrupting MLH1 and three mutations in MSH6 and PMS2 that increase endometrial, colorectal, brain and ovarian cancer risk. We find thirteen mismatch repair variants of uncertain significance that are not associated with cancer risk. We find that founder mutations in MSH6 and PMS2 prevail in Iceland unlike most other populations.Ohio State University (OSU) Comprehensive Cancer Center OSU Colorectal Cancer Research fund Obrine-Weaver Fund Pelotonia Fellowship Award deCODE genetic

    Whole genome characterization of sequence diversity of 15,220 Icelanders

    Get PDF
    Understanding of sequence diversity is the cornerstone of analysis of genetic disorders, population genetics, and evolutionary biology. Here, we present an update of our sequencing set to 15,220 Icelanders who we sequenced to an average genome-wide coverage of 34X. We identified 39,020,168 autosomal variants passing GATK filters: 31,079,378 SNPs and 7,940,790 indels. Calling de novo mutations (DNMs) is a formidable challenge given the high false positive rate in sequencing datasets relative to the mutation rate. Here we addressed this issue by using segregation of alleles in three-generation families. Using this transmission assay, we controlled the false positive rate and identified 108,778 high quality DNMs. Furthermore, we used our extended family structure and read pair tracing of DNMs to a panel of phased SNPs, to determine the parent of origin of 42,961 DNMs.Peer Reviewe

    A sequence variant associating with educational attainment also affects childhood cognition

    Get PDF
    Only a few common variants in the sequence of the genome have been shown to impact cognitive traits. Here we demonstrate that polygenic scores of educational attainment predict specific aspects of childhood cognition, as measured with IQ. Recently, three sequence variants were shown to associate with educational attainment, a confluence phenotype of genetic and environmental factors contributing to academic success. We show that one of these variants associating with educational attainment, rs4851266-T, also associates with Verbal IQ in dyslexic children (P=4.3 x 10(-4), beta=0.16 s.d.). The effect of 0.16 s.d. corresponds to 1.4 IQ points for heterozygotes and 2.8 IQ points for homozygotes. We verified this association in independent samples consisting of adults (P=8.3 x 10(-5), beta=0.12 s.d., combined P=2.2 x 10(-7), beta=0.14 s.d.). Childhood cognition is unlikely to be affected by education attained later in life, and the variant explains a greater fraction of the variance in verbal IQ than in educational attainment (0.7% vs 0.12%,. P=1.0 x 10(-5))

    A sequence variant associating with educational attainment also affects childhood cognition

    Get PDF
    Only a few common variants in the sequence of the genome have been shown to impact cognitive traits. Here we demonstrate that polygenic scores of educational attainment predict specific aspects of childhood cognition, as measured with IQ. Recently, three sequence variants were shown to associate with educational attainment, a confluence phenotype of genetic and environmental factors contributing to academic success. We show that one of these variants associating with educational attainment, rs4851266-T, also associates with Verbal IQ in dyslexic children (P=4.3 x 10(-4), beta=0.16 s.d.). The effect of 0.16 s.d. corresponds to 1.4 IQ points for heterozygotes and 2.8 IQ points for homozygotes. We verified this association in independent samples consisting of adults (P=8.3 x 10(-5), beta=0.12 s.d., combined P=2.2 x 10(-7), beta=0.14 s.d.). Childhood cognition is unlikely to be affected by education attained later in life, and the variant explains a greater fraction of the variance in verbal IQ than in educational attainment (0.7% vs 0.12%,. P=1.0 x 10(-5))

    STELLAR: fast and exact local alignments

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Large-scale comparison of genomic sequences requires reliable tools for the search of local alignments. Practical local aligners are in general fast, but heuristic, and hence sometimes miss significant matches.</p> <p>Results</p> <p>We present here the local pairwise aligner STELLAR that has full sensitivity for <it>ε</it>-alignments, i.e. guarantees to report all local alignments of a given minimal length and maximal error rate. The aligner is composed of two steps, filtering and verification. We apply the SWIFT algorithm for lossless filtering, and have developed a new verification strategy that we prove to be exact. Our results on simulated and real genomic data confirm and quantify the conjecture that heuristic tools like BLAST or BLAT miss a large percentage of significant local alignments.</p> <p>Conclusions</p> <p>STELLAR is very practical and fast on very long sequences which makes it a suitable new tool for finding local alignments between genomic sequences under the edit distance model. Binaries are freely available for Linux, Windows, and Mac OS X at <url>http://www.seqan.de/projects/stellar</url>. The source code is freely distributed with the SeqAn C++ library version 1.3 and later at <url>http://www.seqan.de</url>.</p

    New basal cell carcinoma susceptibility loci.

    Get PDF
    To access publisher's full text version of this article, please click on the hyperlink in Additional Links field or click on the hyperlink at the top of the page marked Files. This article is open access.In an ongoing screen for DNA sequence variants that confer risk of cutaneous basal cell carcinoma (BCC), we conduct a genome-wide association study (GWAS) of 24,988,228 SNPs and small indels detected through whole-genome sequencing of 2,636 Icelanders and imputed into 4,572 BCC patients and 266,358 controls. Here we show the discovery of four new BCC susceptibility loci: 2p24 MYCN (rs57244888[C], OR=0.76, P=4.7 × 10(-12)), 2q33 CASP8-ALS2CR12 (rs13014235[C], OR=1.15, P=1.5 × 10(-9)), 8q21 ZFHX4 (rs28727938[G], OR=0.70, P=3.5 × 10(-12)) and 10p14 GATA3 (rs73635312[A], OR=0.74, P=2.4 × 10(-16)). Fine mapping reveals that two variants correlated with rs73635312[A] occur in conserved binding sites for the GATA3 transcription factor. In addition, expression microarrays and RNA-seq show that rs13014235[C] and a related SNP rs700635[C] are associated with expression of CASP8 splice variants in which sequences from intron 8 are retained.NCI\SAIC-Frederick, Inc. (SAIC-F) 10XS170 Roswell Park Cancer Institute 10XS171 Science Care Inc. X10S172 Laboratory, Data Analysis and Coordinating Center (LDACC) HHSN268201000029C SAIC-F 10ST1035 HHSN261200800001E Brain Bank DA006227 DA033684 N01MH000028 University of Geneva MH090941 MH101814 University of Chicago MH090951 MH090937 MH101820 MH101825 University of North Carolina-Chapel Hill MH090936 MH101819 Harvard University MH090948 Stanford University MH101782 Washington University St Louis MH101810 University of Pennsylvania MH10182

    Insertion of an SVA-E retrotransposon into the CASP8 gene is associated with protection against prostate cancer

    Get PDF
    To access publisher's full text version of this article, please click on the hyperlink in Additional Links field or click on the hyperlink at the top of the page marked Files. This article is open access.Transcriptional and splicing anomalies have been observed in intron 8 of the CASP8 gene (encoding procaspase-8) in association with cutaneous basal-cell carcinoma (BCC) and linked to a germline SNP rs700635. Here, we show that the rs700635[C] allele, which is associated with increased risk of BCC and breast cancer, is protective against prostate cancer [odds ratio (OR) = 0.91, P = 1.0 × 10(-6)]. rs700635[C] is also associated with failures to correctly splice out CASP8 intron 8 in breast and prostate tumours and in corresponding normal tissues. Investigation of rs700635[C] carriers revealed that they have a human-specific short interspersed element-variable number of tandem repeat-Alu (SINE-VNTR-Alu), subfamily-E retrotransposon (SVA-E) inserted into CASP8 intron 8. The SVA-E shows evidence of prior activity, because it has transduced some CASP8 sequences during subsequent retrotransposition events. Whole-genome sequence (WGS) data were used to tag the SVA-E with a surrogate SNP rs1035142[T] (r(2) = 0.999), which showed associations with both the splicing anomalies (P = 6.5 × 10(-32)) and with protection against prostate cancer (OR = 0.91, P = 3.8 × 10(-7)).National Cancer Research Institute (NCRI) G0500966/75466 Department of Health, Medical Research Council Cancer Research UK University of Cambridge NIHR Department of Health Anniversary Fund of the Austrian National Bank 15079 Medical and Scientific Fund of the Mayor of the City of Vienna 10077 Common Fund of the Office of the Director of the National Institutes of Health NCI NHGRI NHLBI NIDA NIMH NINDS NCI\SAIC-Frederick, Inc. (SAIC-F) 10XS170 Roswell Park Cancer Institute 10XS171 Science Care, Inc. X10S172 SAIC-F 10ST1035 HHSN261200800001E deCODE genetics/AMGEN HHSN268201000029C DA006227 DA033684 N01MH000028 MH090941 MH101814 MH090951 MH090937 MH101820 MH101825 MH090936 MH101819 MH090948 MH101782 MH101810 MH10182
    corecore