55 research outputs found
Interest-based RDF Update Propagation
Many LOD datasets, such as DBpedia and LinkedGeoData, are voluminous and
process large amounts of requests from diverse applications. Many data products
and services rely on full or partial local LOD replications to ensure faster
querying and processing. While such replicas enhance the flexibility of
information sharing and integration infrastructures, they also introduce data
duplication with all the associated undesirable consequences. Given the
evolving nature of the original and authoritative datasets, to ensure
consistent and up-to-date replicas frequent replacements are required at a
great cost. In this paper, we introduce an approach for interest-based RDF
update propagation, which propagates only interesting parts of updates from the
source to the target dataset. Effectively, this enables remote applications to
`subscribe' to relevant datasets and consistently reflect the necessary changes
locally without the need to frequently replace the entire dataset (or a
relevant subset). Our approach is based on a formal definition for
graph-pattern-based interest expressions that is used to filter interesting
parts of updates from the source. We implement the approach in the iRap
framework and perform a comprehensive evaluation based on DBpedia Live updates,
to confirm the validity and value of our approach.Comment: 16 pages, Keywords: Change Propagation, Dataset Dynamics, Linked
Data, Replicatio
Compression of Structured High-Throughput Sequencing Data
Large biological datasets are being produced at a rapid pace and create substantial storage challenges, particularly in the domain of high-throughput sequencing (HTS). Most approaches currently used to store HTS data are either unable to quickly adapt to the requirements of new sequencing or analysis methods (because they do not support schema evolution), or fail to provide state of the art compression of the datasets. We have devised new approaches to store HTS data that support seamless data schema evolution and compress datasets substantially better than existing approaches. Building on these new approaches, we discuss and demonstrate how a multi-tier data organization can dramatically reduce the storage, computational and network burden of collecting, analyzing, and archiving large sequencing datasets. For instance, we show that spliced RNA-Seq alignments can be stored in less than 4% the size of a BAM file with perfect data fidelity. Compared to the previous compression state of the art, these methods reduce dataset size more than 40% when storing exome, gene expression or DNA methylation datasets. The approaches have been integrated in a comprehensive suite of software tools (http://goby.campagnelab.org) that support common analyses for a range of high-throughput sequencing assays.National Center for Research Resources (U.S.) (Grant UL1 RR024996)Leukemia & Lymphoma Society of America (Translational Research Program Grant LLS 6304-11)National Institute of Mental Health (U.S.) (R01 MH086883
mRNA stability and m(6)A are major determinants of subcellular mRNA localization in neurons
For cells to perform their biological functions, they need to adopt specific shapes and form functionally distinct subcellular compartments. This is achieved in part via an asymmetric distribution of mRNAs within cells. Currently, the main model of mRNA localization involves specific sequences called "zipcodes" that direct mRNAs to their proper locations. However, while thousands of mRNAs localize within cells, only a few zipcodes have been identified, suggesting that additional mechanisms contribute to localization. Here, we assess the role of mRNA stability in localization by combining the isolation of the soma and neurites of mouse primary cortical and mESC-derived neurons, SLAM-seq, m(6)A-RIP-seq, the perturbation of mRNA destabilization mechanisms, and the analysis of multiple mRNA localization datasets. We show that depletion of mRNA destabilization elements, such as m(6)A, AU-rich elements, and suboptimal codons, functions as a mechanism that mediates the localization of mRNAs associated with housekeeping functions to neurites in several types of neurons
Hijacking of transcriptional condensates by endogenous retroviruses
Most endogenous retroviruses (ERVs) in mammals are incapable of retrotransposition; therefore, why ERV derepression is associated with lethality during early development has been a mystery. Here, we report that rapid and selective degradation of the heterochromatin adapter protein TRIM28 triggers dissociation of transcriptional condensates from loci encoding super-enhancer (SE)-driven pluripotency genes and their association with transcribed ERV loci in murine embryonic stem cells. Knockdown of ERV RNAs or forced expression of SE-enriched transcription factors rescued condensate localization at SEs in TRIM28-degraded cells. In a biochemical reconstitution system, ERV RNA facilitated partitioning of RNA polymerase II and the Mediator coactivator into phase-separated droplets. In TRIM28 knockout mouse embryos, single-cell RNA-seq analysis revealed specific depletion of pluripotent lineages. We propose that coding and noncoding nascent RNAs, including those produced by retrotransposons, may facilitate ‘hijacking’ of transcriptional condensates in various developmental and disease contexts
Analysis of exome data for 4293 trios suggests GPI-anchor biogenesis defects are a rare cause of developmental disorders.
Over 150 different proteins attach to the plasma membrane using glycosylphosphatidylinositol (GPI) anchors. Mutations in 18 genes that encode components of GPI-anchor biogenesis result in a phenotypic spectrum that includes learning disability, epilepsy, microcephaly, congenital malformations and mild dysmorphic features. To determine the incidence of GPI-anchor defects, we analysed the exome data from 4293 parent-child trios recruited to the Deciphering Developmental Disorders (DDD) study. All probands recruited had a neurodevelopmental disorder. We searched for variants in 31 genes linked to GPI-anchor biogenesis and detected rare biallelic variants in PGAP3, PIGN, PIGT (n=2), PIGO and PIGL, providing a likely diagnosis for six families. In five families, the variants were in a compound heterozygous configuration while in a consanguineous Afghani kindred, a homozygous c.709G>C; p.(E237Q) variant in PIGT was identified within 10-12 Mb of autozygosity. Validation and segregation analysis was performed using Sanger sequencing. Across the six families, five siblings were available for testing and in all cases variants co-segregated consistent with them being causative. In four families, abnormal alkaline phosphatase results were observed in the direction expected. FACS analysis of knockout HEK293 cells that had been transfected with wild-type or mutant cDNA constructs demonstrated that the variants in PIGN, PIGT and PIGO all led to reduced activity. Splicing assays, performed using leucocyte RNA, showed that a c.336-2A>G variant in PIGL resulted in exon skipping and p.D113fs*2. Our results strengthen recently reported disease associations, suggest that defective GPI-anchor biogenesis may explain ~0.15% of individuals with developmental disorders and highlight the benefits of data sharing
Factors influencing success of clinical genome sequencing across a broad spectrum of disorders
To assess factors influencing the success of whole-genome sequencing for mainstream clinical diagnosis, we sequenced 217 individuals from 156 independent cases or families across a broad spectrum of disorders in whom previous screening had identified no pathogenic variants. We quantified the number of candidate variants identified using different strategies for variant calling, filtering, annotation and prioritization. We found that jointly calling variants across samples, filtering against both local and external databases, deploying multiple annotation tools and using familial transmission above biological plausibility contributed to accuracy. Overall, we identified disease-causing variants in 21% of cases, with the proportion increasing to 34% (23/68) for mendelian disorders and 57% (8/14) in family trios. We also discovered 32 potentially clinically actionable variants in 18 genes unrelated to the referral disorder, although only 4 were ultimately considered reportable. Our results demonstrate the value of genome sequencing for routine clinical diagnosis but also highlight many outstanding challenges
Factors influencing success of clinical genome sequencing across a broad spectrum of disorders
To assess factors influencing the success of whole-genome sequencing for mainstream clinical diagnosis, we sequenced 217 individuals from 156 independent cases or families across a broad spectrum of disorders in whom previous screening had identified no pathogenic variants. We quantified the number of candidate variants identified using different strategies for variant calling, filtering, annotation and prioritization. We found that jointly calling variants across samples, filtering against both local and external databases, deploying multiple annotation tools and using familial transmission above biological plausibility contributed to accuracy. Overall, we identified disease-causing variants in 21% of cases, with the proportion increasing to 34% (23/68) for mendelian disorders and 57% (8/14) in family trios. We also discovered 32 potentially clinically actionable variants in 18 genes unrelated to the referral disorder, although only 4 were ultimately considered reportable. Our results demonstrate the value of genome sequencing for routine clinical diagnosis but also highlight many outstanding challenges
Structural and non-coding variants increase the diagnostic yield of clinical whole genome sequencing for rare diseases
BACKGROUND: Whole genome sequencing is increasingly being used for the diagnosis of patients with rare diseases. However, the diagnostic yields of many studies, particularly those conducted in a healthcare setting, are often disappointingly low, at 25–30%. This is in part because although entire genomes are sequenced, analysis is often confined to in silico gene panels or coding regions of the genome. METHODS: We undertook WGS on a cohort of 122 unrelated rare disease patients and their relatives (300 genomes) who had been pre-screened by gene panels or arrays. Patients were recruited from a broad spectrum of clinical specialties. We applied a bioinformatics pipeline that would allow comprehensive analysis of all variant types. We combined established bioinformatics tools for phenotypic and genomic analysis with our novel algorithms (SVRare, ALTSPLICE and GREEN-DB) to detect and annotate structural, splice site and non-coding variants. RESULTS: Our diagnostic yield was 43/122 cases (35%), although 47/122 cases (39%) were considered solved when considering novel candidate genes with supporting functional data into account. Structural, splice site and deep intronic variants contributed to 20/47 (43%) of our solved cases. Five genes that are novel, or were novel at the time of discovery, were identified, whilst a further three genes are putative novel disease genes with evidence of causality. We identified variants of uncertain significance in a further fourteen candidate genes. The phenotypic spectrum associated with RMND1 was expanded to include polymicrogyria. Two patients with secondary findings in FBN1 and KCNQ1 were confirmed to have previously unidentified Marfan and long QT syndromes, respectively, and were referred for further clinical interventions. Clinical diagnoses were changed in six patients and treatment adjustments made for eight individuals, which for five patients was considered life-saving. CONCLUSIONS: Genome sequencing is increasingly being considered as a first-line genetic test in routine clinical settings and can make a substantial contribution to rapidly identifying a causal aetiology for many patients, shortening their diagnostic odyssey. We have demonstrated that structural, splice site and intronic variants make a significant contribution to diagnostic yield and that comprehensive analysis of the entire genome is essential to maximise the value of clinical genome sequencing
Recommended from our members
Structural and non-coding variants increase the diagnostic yield of clinical whole genome sequencing for rare diseases.
BACKGROUND: Whole genome sequencing is increasingly being used for the diagnosis of patients with rare diseases. However, the diagnostic yields of many studies, particularly those conducted in a healthcare setting, are often disappointingly low, at 25-30%. This is in part because although entire genomes are sequenced, analysis is often confined to in silico gene panels or coding regions of the genome. METHODS: We undertook WGS on a cohort of 122 unrelated rare disease patients and their relatives (300 genomes) who had been pre-screened by gene panels or arrays. Patients were recruited from a broad spectrum of clinical specialties. We applied a bioinformatics pipeline that would allow comprehensive analysis of all variant types. We combined established bioinformatics tools for phenotypic and genomic analysis with our novel algorithms (SVRare, ALTSPLICE and GREEN-DB) to detect and annotate structural, splice site and non-coding variants. RESULTS: Our diagnostic yield was 43/122 cases (35%), although 47/122 cases (39%) were considered solved when considering novel candidate genes with supporting functional data into account. Structural, splice site and deep intronic variants contributed to 20/47 (43%) of our solved cases. Five genes that are novel, or were novel at the time of discovery, were identified, whilst a further three genes are putative novel disease genes with evidence of causality. We identified variants of uncertain significance in a further fourteen candidate genes. The phenotypic spectrum associated with RMND1 was expanded to include polymicrogyria. Two patients with secondary findings in FBN1 and KCNQ1 were confirmed to have previously unidentified Marfan and long QT syndromes, respectively, and were referred for further clinical interventions. Clinical diagnoses were changed in six patients and treatment adjustments made for eight individuals, which for five patients was considered life-saving. CONCLUSIONS: Genome sequencing is increasingly being considered as a first-line genetic test in routine clinical settings and can make a substantial contribution to rapidly identifying a causal aetiology for many patients, shortening their diagnostic odyssey. We have demonstrated that structural, splice site and intronic variants make a significant contribution to diagnostic yield and that comprehensive analysis of the entire genome is essential to maximise the value of clinical genome sequencing
- …