39 research outputs found
How do users design scientific workflows? The Case of Snakemake
Scientific workflows automate the analysis of large-scale scientific data,
fostering the reuse of data processing operators as well as the reproducibility
and traceability of analysis results. In exploratory research, however,
workflows are continuously adapted, utilizing a wide range of tools and
software libraries, to test scientific hypotheses. Script-based workflow
engines cater to the required flexibility through direct integration of
programming primitives but lack abstractions for interactive exploration of the
workflow design by a user during workflow execution. To derive requirements for
such interactive workflows, we conduct an empirical study on the use of
Snakemake, a popular Python-based workflow engine. Based on workflows collected
from 1602 GitHub repositories, we present insights on common structures of
Snakemake workflows, as well as the language features typically adopted in
their specification
PopDel identifies medium-size deletions simultaneously in tens of thousands of genomes
Thousands of genomic structural variants (SVs) segregate in the human population and can impact phenotypic traits and diseases. Their identification in whole-genome sequence data of large cohorts is a major computational challenge. Most current approaches identify SVs in single genomes and afterwards merge the identified variants into a joint call set across many genomes. We describe the approach PopDel, which directly identifies deletions of about 500 to at least 10,000 bp in length in data of many genomes jointly, eliminating the need for subsequent variant merging. PopDel scales to tens of thousands of genomes as we demonstrate in evaluations on up to 49,962 genomes. We show that PopDel reliably reports common, rare and de novo deletions. On genomes with available high-confidence reference call sets PopDel shows excellent recall and precision. Genotype inheritance patterns in up to 6794 trios indicate that genotypes predicted by PopDel are more reliable than those of previous SV callers. Furthermore, PopDel’s running time is competitive with the fastest tested previous tools. The demonstrated scalability and accuracy of PopDel enables routine scans for deletions in large-scale sequencing studies
Comprehensive population-wide analysis of Lynch syndrome in Iceland reveals founder mutations in MSH6 and PMS2.
To access publisher's full text version of this article, please click on the hyperlink in Additional Links field or click on the hyperlink at the top of the page marked FilesLynch syndrome, caused by germline mutations in the mismatch repair genes, is associated with increased cancer risk. Here using a large whole-genome sequencing data bank, cancer registry and colorectal tumour bank we determine the prevalence of Lynch syndrome, associated cancer risks and pathogenicity of several variants in the Icelandic population. We use colorectal cancer samples from 1,182 patients diagnosed between 2000-2009. One-hundred and thirty-two (11.2%) tumours are mismatch repair deficient per immunohistochemistry. Twenty-one (1.8%) have Lynch syndrome while 106 (9.0%) have somatic hypermethylation or mutations in the mismatch repair genes. The population prevalence of Lynch syndrome is 0.442%. We discover a translocation disrupting MLH1 and three mutations in MSH6 and PMS2 that increase endometrial, colorectal, brain and ovarian cancer risk. We find thirteen mismatch repair variants of uncertain significance that are not associated with cancer risk. We find that founder mutations in MSH6 and PMS2 prevail in Iceland unlike most other populations.Ohio State University (OSU) Comprehensive Cancer Center
OSU Colorectal Cancer Research fund
Obrine-Weaver Fund
Pelotonia Fellowship Award
deCODE genetic
Whole genome characterization of sequence diversity of 15,220 Icelanders
Understanding of sequence diversity is the cornerstone of analysis of genetic disorders, population genetics, and evolutionary biology. Here, we present an update of our sequencing set to 15,220 Icelanders who we sequenced to an average genome-wide coverage of 34X. We identified 39,020,168 autosomal variants passing GATK filters: 31,079,378 SNPs and 7,940,790 indels. Calling de novo mutations (DNMs) is a formidable challenge given the high false positive rate in sequencing datasets relative to the mutation rate. Here we addressed this issue by using segregation of alleles in three-generation families. Using this transmission assay, we controlled the false positive rate and identified 108,778 high quality DNMs. Furthermore, we used our extended family structure and read pair tracing of DNMs to a panel of phased SNPs, to determine the parent of origin of 42,961 DNMs.Peer Reviewe
A sequence variant associating with educational attainment also affects childhood cognition
Only a few common variants in the sequence of the genome have been shown to impact cognitive traits. Here we demonstrate that polygenic scores of educational attainment predict specific aspects of childhood cognition, as measured with IQ. Recently, three sequence variants were shown to associate with educational attainment, a confluence phenotype of genetic and environmental factors contributing to academic success. We show that one of these variants associating with educational attainment, rs4851266-T, also associates with Verbal IQ in dyslexic children (P=4.3 x 10(-4), beta=0.16 s.d.). The effect of 0.16 s.d. corresponds to 1.4 IQ points for heterozygotes and 2.8 IQ points for homozygotes. We verified this association in independent samples consisting of adults (P=8.3 x 10(-5), beta=0.12 s.d., combined P=2.2 x 10(-7), beta=0.14 s.d.). Childhood cognition is unlikely to be affected by education attained later in life, and the variant explains a greater fraction of the variance in verbal IQ than in educational attainment (0.7% vs 0.12%,. P=1.0 x 10(-5))
A sequence variant associating with educational attainment also affects childhood cognition
Only a few common variants in the sequence of the genome have been shown to impact cognitive traits. Here we demonstrate that polygenic scores of educational attainment predict specific aspects of childhood cognition, as measured with IQ. Recently, three sequence variants were shown to associate with educational attainment, a confluence phenotype of genetic and environmental factors contributing to academic success. We show that one of these variants associating with educational attainment, rs4851266-T, also associates with Verbal IQ in dyslexic children (P=4.3 x 10(-4), beta=0.16 s.d.). The effect of 0.16 s.d. corresponds to 1.4 IQ points for heterozygotes and 2.8 IQ points for homozygotes. We verified this association in independent samples consisting of adults (P=8.3 x 10(-5), beta=0.12 s.d., combined P=2.2 x 10(-7), beta=0.14 s.d.). Childhood cognition is unlikely to be affected by education attained later in life, and the variant explains a greater fraction of the variance in verbal IQ than in educational attainment (0.7% vs 0.12%,. P=1.0 x 10(-5))
STELLAR: fast and exact local alignments
<p>Abstract</p> <p>Background</p> <p>Large-scale comparison of genomic sequences requires reliable tools for the search of local alignments. Practical local aligners are in general fast, but heuristic, and hence sometimes miss significant matches.</p> <p>Results</p> <p>We present here the local pairwise aligner STELLAR that has full sensitivity for <it>ε</it>-alignments, i.e. guarantees to report all local alignments of a given minimal length and maximal error rate. The aligner is composed of two steps, filtering and verification. We apply the SWIFT algorithm for lossless filtering, and have developed a new verification strategy that we prove to be exact. Our results on simulated and real genomic data confirm and quantify the conjecture that heuristic tools like BLAST or BLAT miss a large percentage of significant local alignments.</p> <p>Conclusions</p> <p>STELLAR is very practical and fast on very long sequences which makes it a suitable new tool for finding local alignments between genomic sequences under the edit distance model. Binaries are freely available for Linux, Windows, and Mac OS X at <url>http://www.seqan.de/projects/stellar</url>. The source code is freely distributed with the SeqAn C++ library version 1.3 and later at <url>http://www.seqan.de</url>.</p
New basal cell carcinoma susceptibility loci.
To access publisher's full text version of this article, please click on the hyperlink in Additional Links field or click on the hyperlink at the top of the page marked Files.
This article is open access.In an ongoing screen for DNA sequence variants that confer risk of cutaneous basal cell carcinoma (BCC), we conduct a genome-wide association study (GWAS) of 24,988,228 SNPs and small indels detected through whole-genome sequencing of 2,636 Icelanders and imputed into 4,572 BCC patients and 266,358 controls. Here we show the discovery of four new BCC susceptibility loci: 2p24 MYCN (rs57244888[C], OR=0.76, P=4.7 × 10(-12)), 2q33 CASP8-ALS2CR12 (rs13014235[C], OR=1.15, P=1.5 × 10(-9)), 8q21 ZFHX4 (rs28727938[G], OR=0.70, P=3.5 × 10(-12)) and 10p14 GATA3 (rs73635312[A], OR=0.74, P=2.4 × 10(-16)). Fine mapping reveals that two variants correlated with rs73635312[A] occur in conserved binding sites for the GATA3 transcription factor. In addition, expression microarrays and RNA-seq show that rs13014235[C] and a related SNP rs700635[C] are associated with expression of CASP8 splice variants in which sequences from intron 8 are retained.NCI\SAIC-Frederick, Inc. (SAIC-F) 10XS170
Roswell Park Cancer Institute 10XS171
Science Care Inc. X10S172
Laboratory, Data Analysis and Coordinating Center (LDACC)
HHSN268201000029C
SAIC-F
10ST1035
HHSN261200800001E
Brain Bank
DA006227
DA033684
N01MH000028
University of Geneva
MH090941
MH101814
University of Chicago
MH090951
MH090937
MH101820
MH101825
University of North Carolina-Chapel Hill
MH090936
MH101819
Harvard University
MH090948
Stanford University
MH101782
Washington University St Louis
MH101810
University of Pennsylvania
MH10182
Insertion of an SVA-E retrotransposon into the CASP8 gene is associated with protection against prostate cancer
To access publisher's full text version of this article, please click on the hyperlink in Additional Links field or click on the hyperlink at the top of the page marked Files.
This article is open access.Transcriptional and splicing anomalies have been observed in intron 8 of the CASP8 gene (encoding procaspase-8) in association with cutaneous basal-cell carcinoma (BCC) and linked to a germline SNP rs700635. Here, we show that the rs700635[C] allele, which is associated with increased risk of BCC and breast cancer, is protective against prostate cancer [odds ratio (OR) = 0.91, P = 1.0 × 10(-6)]. rs700635[C] is also associated with failures to correctly splice out CASP8 intron 8 in breast and prostate tumours and in corresponding normal tissues. Investigation of rs700635[C] carriers revealed that they have a human-specific short interspersed element-variable number of tandem repeat-Alu (SINE-VNTR-Alu), subfamily-E retrotransposon (SVA-E) inserted into CASP8 intron 8. The SVA-E shows evidence of prior activity, because it has transduced some CASP8 sequences during subsequent retrotransposition events. Whole-genome sequence (WGS) data were used to tag the SVA-E with a surrogate SNP rs1035142[T] (r(2) = 0.999), which showed associations with both the splicing anomalies (P = 6.5 × 10(-32)) and with protection against prostate cancer (OR = 0.91, P = 3.8 × 10(-7)).National Cancer Research Institute (NCRI)
G0500966/75466
Department of Health, Medical Research Council
Cancer Research UK
University of Cambridge
NIHR
Department of Health
Anniversary Fund of the Austrian National Bank
15079
Medical and Scientific Fund of the Mayor of the City of Vienna
10077
Common Fund of the Office of the Director of the National Institutes of Health
NCI
NHGRI
NHLBI
NIDA
NIMH
NINDS
NCI\SAIC-Frederick, Inc. (SAIC-F)
10XS170
Roswell Park Cancer Institute
10XS171
Science Care, Inc.
X10S172
SAIC-F
10ST1035
HHSN261200800001E
deCODE genetics/AMGEN
HHSN268201000029C
DA006227
DA033684
N01MH000028
MH090941
MH101814
MH090951
MH090937
MH101820
MH101825
MH090936
MH101819
MH090948
MH101782
MH101810
MH10182