Search CORE

56 research outputs found

A novel compression approach for mapped high-throughput sequencing data set

Author: Popitsch Niko
Publication venue
Publication date: 01/01/2012
Field of study

Eine der größten aktuellen Herausforderungen im Zusammenhang mit Hochdurchsatz-Sequenzierungsexperimenten (High-Throughput Sequencing, HTS) liegt nicht im Erzeugen der Daten selbst, sondern in deren Prozessierung, Speicherung und Übertragung. Die enorme Größe dieser Daten motiviert die Entwicklung von Datenkompressionsalgorithmen für die Realisierung der verschiedenen Datenspeicherkonzepte die auf die produzierten (Zwischen-)Ergebnisse von HTS Experimenten angewandt werden. Die vorliegende Arbeit gibt einen Überblick über das Feld der Hochdurchsatz-Nukleinsäure-Sequenzierung und in aktuelle Ansätze für die Kompression solcher Daten. Im Hauptteil der Arbeit wird NGC vorgestellt, ein Werkzeug für die Kompression von gemappten reads die im weitverbreiteten SAM Format gespeichert sind (eine Art von HTS Daten). NGC ermöglicht sowohl verlustfreie als auch verlustbehaftete Kompression und beinhaltet zwei neuartige Ideen: Erstens enthält es eine Methode zur Reduktion der erforderlichen Code-Wörter, welche gemeinsame Merkmale der reads die an dieselbe genomische Position gemappt wurden ausnützt. Zweitens beinhaltet NGC eine konfigurierbare Methode für die Quantisierung der Qualitätswerte welche deren Einfluss auf nach-gelagerte Anwendungen berücksichtigt. NGC, mit mehreren echten Datensätzen evaluiert, spart 33-66% des benötigten Speicherplatzes bei verlustfreier und bis zu 98% des benötigten Speicherplatzes bei verlustbehafteter Kompression ein. Durch die Anwendung zweier gängiger Varianten- und Genotyp-Vorhersagewerkzeuge auf die dekomprimierten Daten wird gezeigt, dass die verlustbehaftete Kompression, besser als vergleichbare Werkzeuge in manchen Konfigurationen, über 99% der gefundenen Varianten präserviert.A major challenge of current high-throughput sequencing (HTS) experiments is not only the generation of the sequencing data itself but also their processing, storage and transmission. The enormous size of these data motivates the development of data compression algorithms usable for the implementation of the various storage policies that are applied to the produced intermediate and final result files. This thesis gives a brief introduction into the field of high-throughput nucleic acid sequencing and into current approaches for the compression of the data resulting from such experiments. In the main part of the thesis, NGC, a tool for the compression of mapped read data stored in the SAM format (one kind of HTS data), is presented. NGC enables lossless and lossy compression and introduces two novel ideas: First, it contains a way to reduce the number of required code words by exploiting common features of the sequenced reads mapped to the same genomic positions; second, it contains a highly configurable way for the quantization of per-base quality values which takes their influence on downstream analyses into account. NGC, evaluated with several real-world data sets, saves 33-66% of disc space using lossless and up to 98% disc space using lossy compression. By applying two popular variant and genotype prediction tools to the decompressed data, we show that the lossy compression modes preserve over 99% of all called variants while outperforming comparable methods in some configurations

OTHES

Building blocks for semantic data organization on the desktop

Author: Popitsch Niko
Publication venue
Publication date: 01/01/2011
Field of study

Die Organisation von (Multimedia-) Daten auf Desktop-Systemen wird derzeit hauptsächlich durch das Einordnen von Dateien in ein hierarchisches Dateisystem bewerkstelligt. Zusätzlich werden gewisse Inhalte (z.B. Musik oder Fotos) von spezialisierter Software mit Hilfe Datei-bezogener Metadaten verwaltet. Diese Metadaten werden meist direkt im Dateikopf in einer Unzahl verschiedener, vorwiegend proprietärer Formate gespeichert. Allgemein nehmen Metadaten und Links die Schlüsselrollen in fortgeschrittenen Datenorganisationskonzepten ein, ihre eingeschränkte Unterstützung in vorherrschenden Dateisystemen macht die Einführung solcher Konzepte auf dem Desktop jedoch schwierig: Erstens müssen Anwendungen sowohl Dateiformat als auch Metadatenschema verstehen um auf Metadaten zugreifen zu können; zweitens ist ein getrennter Zugriff auf Daten und Metadaten nicht möglich und drittens kann man solche Metadaten nicht mit mehreren Dateien oder mit Dateiordnern assoziieren obgleich letztere die derzeit wichtigsten Konstrukte für die Dateiorganisation darstellen. Dies bedeutet in weiterer Folge: (i) eingeschränkte Möglichkeiten der Datenorganisation, (ii) eingeschränkte Navigationsmöglichkeiten, (iii) schlechte Auffindbarkeit der gespeicherten Daten, und (iv) Fragmentierung von Metadaten. Obschon es Versuche gab, diese Situation (zum Beispiel mit Hilfe semantischer Dateisysteme) zu verbessern, wurden die meisten dieser Probleme bisher vor allem im Web und im Speziellen im semantischen Web adressiert und gelöst. Das Anwenden dort entwickelter Lösungen auf dem Desktop, einer zentralen Plattform der Daten- und Metadatenmanipulation, wäre zweifellos von Vorteil. In der vorliegenden Arbeit wird ein neues, rückwärts-kompatibles Metadatenmodell als Lösungsversuch für die oben genannten Probleme präsentiert. Dieses Modell basiert auf stabilen Datei-Identifikatoren und externen, semantischen, Datei- bezogenen Metadatenbeschreibungen welche im RDF Graphenmodell repräsentiert werden. Diese Beschreibungen sind durch eine einheitliche Linked-Data- Schnittstelle zugänglich und können mit anderen Beschreibungen und Ressourcen verlinkt werden. Im Speziellen erlaubt dieses Modell semantische Links zwischen lokalen Dateisystemobjekten und Netzressourcen im Web sowie im entstehenden “Daten Web” und ermöglicht somit die Integration dieser Datenräume. Das Modell hängt entscheidend von der Stabilität dieser Links ab weshalb zwei Algorithmen präsentiert werden, welche deren Integrität in lokalen und vernetzten Umgebungen erhalten können. Dies bedeutet, dass Links zwischen Dateisystemobjekten, Metadatenbeschreibungen und Netzressourcen nicht brechen wenn sich deren Adressen ändern, z.B. wenn Dateien verschoben oder Linked-Data Ressourcen unter geänderten URIs publiziert werden. Schließlich wird eine prototypische Implementierung des vorgeschlagenen Metadatenmodells präsentiert, welche demonstriert wie die Summe dieser Bausteine eine Metadatenschicht bildet die als Grundlage für semantische Datenorganisation auf dem Desktop verwendet werden kann.The organization of (multimedia) data on current desktop systems is done to a large part by arranging files in hierarchical file systems, but also by specialized applications (e.g., music or photo organizing software) that make use of file-related metadata for this task. These metadata are predominantly stored in embedded file headers, using a magnitude of mainly proprietary formats. Generally, metadata and links play the key roles in advanced data organization concepts. Their limited support in prevalent file system implementations, however, hinders the adoption of such concepts on the desktop: First, non-uniform access interfaces require metadata consuming applications to understand both a file’s format and its metadata scheme; second, separate data/metadata access is not possible, and third, metadata cannot be attached to multiple files or to file folders although the latter are the primary constructs for file organization. As a consequence of this, current desktops suffer, inter alia, from (i) limited data organization possibilities, (ii) limited navigability, (iii) limited data findability, and (iv) metadata fragmentation. Although there were attempts to improve this situation, e.g., by introducing semantic file systems, most of these issues were successfully addressed and solved in the Web and in particular in the Semantic Web and reusing these solutions on the desktop, a central hub of data and metadata manipulation, is clearly desirable. In this thesis a novel, backwards-compatible metadata model that addresses the above-mentioned issues is introduced. This model is based on stable file identifiers and external, file-related, semantic metadata descriptions that are represented using the generic RDF graph model. Descriptions are accessible via a uniform Linked Data interface and can be linked with other descriptions and resources. In particular, this model enables semantic linking between local file system objects and remote resources on the Web or the emerging Web of Data, thereby enabling the integration of these data spaces. As the model crucially relies on the stability of these links, we contribute two algorithms that preserve their integrity in local and in remote environments. This means that links between file system objects, metadata descriptions and remote resources do not break even if their addresses change, e.g., when files are moved or Linked Data resources are re-published using different URIs. Finally, we contribute a prototypical implementation of the proposed metadata model that demonstrates how these building blocks sum up to constitute a metadata layer that may act as a foundation for semantic data organization on the desktop

OTHES

Hijacking of transcriptional condensates by endogenous retroviruses

Author: Asimi Vahid
Du Manyu
Fasching Nina
Hetzel Sara
Kretzmer Helene
Kumar Abhishek Sampath
Naderi Julian
Niskanen Henri
Popitsch Niko
Riemenschneider Christina
Publication venue
Publication date: 01/01/2022
Field of study

Most endogenous retroviruses (ERVs) in mammals are incapable of retrotransposition; therefore, why ERV derepression is associated with lethality during early development has been a mystery. Here, we report that rapid and selective degradation of the heterochromatin adapter protein TRIM28 triggers dissociation of transcriptional condensates from loci encoding super-enhancer (SE)-driven pluripotency genes and their association with transcribed ERV loci in murine embryonic stem cells. Knockdown of ERV RNAs or forced expression of SE-enriched transcription factors rescued condensate localization at SEs in TRIM28-degraded cells. In a biochemical reconstitution system, ERV RNA facilitated partitioning of RNA polymerase II and the Mediator coactivator into phase-separated droplets. In TRIM28 knockout mouse embryos, single-cell RNA-seq analysis revealed specific depletion of pluripotent lineages. We propose that coding and noncoding nascent RNAs, including those produced by retrotransposons, may facilitate ‘hijacking’ of transcriptional condensates in various developmental and disease contexts

Institutional Repository of the Freie Universität Berlin

The Proportionality Critique Still Stands

Author: Ivana Bilusic (3620147)
Meghan Lybecker (795981)
Niko Popitsch (795982)
Philipp Rescheneder (795983)
RenĂŠe Schroeder (3620159)
Publication venue
Publication date: 12/08/2015
Field of study

Transcriptome data comparison. Comparison of genes reported as temperature-dependent in a DNA microarray [52] and asRNAs we identified opposite to them. (XLSX 11 kb

<intR>²Dok

FigShare

The identification of protein and RNA interactors of the splicing factor Caper in the adult Drosophila nervous system

Author: Adeline W. Chang
Christopher C. Ebmeier
Eugenia C. Olesnicky
Jeremy M. Bono
M. Brandon Titus
Niko Popitsch
Publication venue: 'Frontiers Media SA'
Publication date: 01/06/2023
Field of study

Post-transcriptional gene regulation is a fundamental mechanism that helps regulate the development and healthy aging of the nervous system. Mutations that disrupt the function of RNA-binding proteins (RBPs), which regulate post-transcriptional gene regulation, have increasingly been implicated in neurological disorders including amyotrophic lateral sclerosis, Fragile X Syndrome, and spinal muscular atrophy. Interestingly, although the majority of RBPs are expressed widely within diverse tissue types, the nervous system is often particularly sensitive to their dysfunction. It is therefore critical to elucidate how aberrant RNA regulation that results from the dysfunction of ubiquitously expressed RBPs leads to tissue specific pathologies that underlie neurological diseases. The highly conserved RBP and alternative splicing factor Caper is widely expressed throughout development and is required for the development of Drosophila sensory and motor neurons. Furthermore, caper dysfunction results in larval and adult locomotor deficits. Nonetheless, little is known about which proteins interact with Caper, and which RNAs are regulated by Caper. Here we identify proteins that interact with Caper in both neural and muscle tissue, along with neural specific Caper target RNAs. Furthermore, we show that a subset of these Caper-interacting proteins and RNAs genetically interact with caper to regulate Drosophila gravitaxis behavior

Directory of Open Access Journals

Recommended from our members

The complete costs of genome sequencing: a microcosting study in cancer and rare diseases from a single center in the United Kingdom

Author: Antoniou Pavlos
Buchanan James
Camps Carme
Dreau Helene
Fermont Jilles M.
Harris Steve
Knight Samantha J. L.
Kvikstad Erika M.
Pagnamenta Alistair T.
Pentony Melissa M.
Popitsch Niko
Schuh Anna
Schwarze Katharina
Taylor Jenny C.
Taylor John M.
Tilley Mark W.
Wordsworth Sarah
Publication venue: Genetics in Medicine
Publication date: 01/01/2020
Field of study

Abstract: Purpose: The translation of genome sequencing into routine health care has been slow, partly because of concerns about affordability. The aspirational cost of sequencing a genome is

1000, but there is little evidence to support this estimate. We estimate the cost of using genome sequencing in routine clinical care in patients with cancer or rare diseases. Methods: We performed a microcosting study of Illumina-based genome sequencing in a UK National Health Service laboratory processing 399 samples/year. Cost data were collected for all steps in the sequencing pathway, including bioinformatics analysis and reporting of results. Sensitivity analysis identified key cost drivers. Results: Genome sequencing costs £6841 per cancer case (comprising matched tumor and germline samples) and £7050 per rare disease case (three samples). The consumables used during sequencing are the most expensive component of testing (68–72% of the total cost). Equipment costs are higher for rare disease cases, whereas consumable and staff costs are slightly higher for cancer cases. Conclusion: The cost of genome sequencing is underestimated if only sequencing costs are considered, and likely surpasses

1000/genome in a single laboratory. This aspirational sequencing cost will likely only be achieved if consumable costs are considerably reduced and sequencing is performed at scale

Apollo (Cambridge)

Clinically actionable mutation profiles in patients with cancer identified by whole-genome sequencing

Author: Ahmed Ahmed
Antoniou Pavlos
Athanasou Nick
Church David
Colling Richard
Dreau Helene
Flanagan Adrienne M.
Hamblin Angela
Harris Adrian
Hassan Bass
Knight Samantha J.l.
Kvikstad Erika M.
Mizani Tuba
Orosz Zsolt
Parton Marina
Pentony Melissa M.
Popitsch Niko
Protheroe Andrew
Ridout Kate
Schuh Anna
Shah Ketan A.
Taylor Jenny C.
Tomlinson Ian
Vavoulis Dimitris
Winter Stuart
Publication venue: 'Cold Spring Harbor Laboratory'
Publication date: 01/01/2018
Field of study

Next-generation sequencing (NGS) efforts have established catalogs of mutations relevant to cancer development. However, the clinical utility of this information remains largely unexplored. Here, we present the results of the first eight patients recruited into a clinical whole-genome sequencing (WGS) program in the United Kingdom. We performed PCR-free WGS of fresh frozen tumors and germline DNA at 75× and 30×, respectively, using the HiSeq2500 HTv4. Subtracted tumor VCFs and paired germlines were subjected to comprehensive analysis of coding and noncoding regions, integration of germline with somatically acquired variants, and global mutation signatures and pathway analyses. Results were classified into tiers and presented to a multidisciplinary tumor board. WGS results helped to clarify an uncertain histopathological diagnosis in one case, led to informed or supported prognosis in two cases, leading to de-escalation of therapy in one, and indicated potential treatments in all eight. Overall 26 different tier 1 potentially clinically actionable findings were identified using WGS compared with six SNVs/indels using routine targeted NGS. These initial results demonstrate the potential of WGS to inform future diagnosis, prognosis, and treatment choice in cancer and justify the systematic evaluation of the clinical utility of WGS in larger cohorts of patients with cancer

Crossref

UCL Discovery

Edinburgh Research Explorer

Oxford University Research Archive

Analysis of exome data for 4293 trios suggests GPI-anchor biogenesis defects are a rare cause of developmental disorders.

Over 150 different proteins attach to the plasma membrane using glycosylphosphatidylinositol (GPI) anchors. Mutations in 18 genes that encode components of GPI-anchor biogenesis result in a phenotypic spectrum that includes learning disability, epilepsy, microcephaly, congenital malformations and mild dysmorphic features. To determine the incidence of GPI-anchor defects, we analysed the exome data from 4293 parent-child trios recruited to the Deciphering Developmental Disorders (DDD) study. All probands recruited had a neurodevelopmental disorder. We searched for variants in 31 genes linked to GPI-anchor biogenesis and detected rare biallelic variants in PGAP3, PIGN, PIGT (n=2), PIGO and PIGL, providing a likely diagnosis for six families. In five families, the variants were in a compound heterozygous configuration while in a consanguineous Afghani kindred, a homozygous c.709G>C; p.(E237Q) variant in PIGT was identified within 10-12 Mb of autozygosity. Validation and segregation analysis was performed using Sanger sequencing. Across the six families, five siblings were available for testing and in all cases variants co-segregated consistent with them being causative. In four families, abnormal alkaline phosphatase results were observed in the direction expected. FACS analysis of knockout HEK293 cells that had been transfected with wild-type or mutant cDNA constructs demonstrated that the variants in PIGN, PIGT and PIGO all led to reduced activity. Splicing assays, performed using leucocyte RNA, showed that a c.336-2A>G variant in PIGL resulted in exon skipping and p.D113fs*2. Our results strengthen recently reported disease associations, suggest that defective GPI-anchor biogenesis may explain ~0.15% of individuals with developmental disorders and highlight the benefits of data sharing

Southampton (e-Prints Soton)

Crossref

Oxford University Research Archive

St George's Online Research Archive

Structural and non-coding variants increase the diagnostic yield of clinical whole genome sequencing for rare diseases

Author: Allroggen Holger
Ansorge Olaf
Babbs Christian
Banka Siddharth
Baños-Piñero Benito
Beeson David
Ben-Ami Tal
Bennett David L.
Bento Celeste
Blair Edward
Brasch-Andersen Charlotte
Bull Katherine R.
Calpena Eduardo
Camps Carme
Cario Holger
Cilliers Deirdre
Conti Valerio
Dacal Beatriz Diez
Davies E. Graham
Dhalla Fatima
Dong Yin
Dreau Helene
Dunford James E.
Ferla Matteo
Giacopuzzi Edoardo
Guerrini Renzo
Harris Adrian L.
Hartley Jane
Hashim Mona
Hashimoto Akiko
Hollander Georg
Hughes Jim R.
Javaid Kassim
Kaisaki Pamela J.
Kane Maureen
Kelly Deirdre
Kelly Dominic
Kesim Yesim
Kini Usha
Knight Samantha J. L.
Kreins Alexandra Y.
Kvikstad Erika M.
Lange Lukas
Langman Craig B.
Lester Tracy
Lines Kate E.
Lord Simon R.
Lu Xin
Lunter Gerton
Mansour Sahar
Manzur Adnan
Maroofian Reza
Marsden Brian
Mason Joanne
McGowan Simon J.
Mei Davide
Mlcochova Hana
Murakami Yoshiko
Németh Andrea H.
Okoli Steven
Ormondroyd Elizabeth
Ousager Lilian Bomme
Pagnamenta Alistair T.
Palace Jacqueline
Patel Smita Y.
Pentony Melissa M.
Popitsch Niko
Pugh Chris
Rad Aboulfazl
Ragoussis Vassilis
Ramesh Archana
Riva Simone G.
Roberts Irene
Roy Noémi
Salminen Outi
Sanders Edward
Schilling Kyleen D.
Schuh Anna H.
Schwessinger Ron
Scott Caroline
Sen Arjune
Smith Conrad
Stevenson Mark
Taylor Jenny C.
Taylor John M.
Thakker Rajesh V.
Twigg Stephen R. F.
Uhlig Holm H.
van Wijk Richard
Vavoulis Dimitrios V.
Vona Barbara
Wall Steven
Wang Jing
Watkins Hugh
Wilkie Andrew O. M.
Yu Jing
Zak Jaroslav
Publication venue
Publication date: 09/11/2023
Field of study

BACKGROUND: Whole genome sequencing is increasingly being used for the diagnosis of patients with rare diseases. However, the diagnostic yields of many studies, particularly those conducted in a healthcare setting, are often disappointingly low, at 25-30%. This is in part because although entire genomes are sequenced, analysis is often confined to in silico gene panels or coding regions of the genome.METHODS: We undertook WGS on a cohort of 122 unrelated rare disease patients and their relatives (300 genomes) who had been pre-screened by gene panels or arrays. Patients were recruited from a broad spectrum of clinical specialties. We applied a bioinformatics pipeline that would allow comprehensive analysis of all variant types. We combined established bioinformatics tools for phenotypic and genomic analysis with our novel algorithms (SVRare, ALTSPLICE and GREEN-DB) to detect and annotate structural, splice site and non-coding variants.RESULTS: Our diagnostic yield was 43/122 cases (35%), although 47/122 cases (39%) were considered solved when considering novel candidate genes with supporting functional data into account. Structural, splice site and deep intronic variants contributed to 20/47 (43%) of our solved cases. Five genes that are novel, or were novel at the time of discovery, were identified, whilst a further three genes are putative novel disease genes with evidence of causality. We identified variants of uncertain significance in a further fourteen candidate genes. The phenotypic spectrum associated with RMND1 was expanded to include polymicrogyria. Two patients with secondary findings in FBN1 and KCNQ1 were confirmed to have previously unidentified Marfan and long QT syndromes, respectively, and were referred for further clinical interventions. Clinical diagnoses were changed in six patients and treatment adjustments made for eight individuals, which for five patients was considered life-saving.CONCLUSIONS: Genome sequencing is increasingly being considered as a first-line genetic test in routine clinical settings and can make a substantial contribution to rapidly identifying a causal aetiology for many patients, shortening their diagnostic odyssey. We have demonstrated that structural, splice site and intronic variants make a significant contribution to diagnostic yield and that comprehensive analysis of the entire genome is essential to maximise the value of clinical genome sequencing.</p

University of Birmingham Research Portal

The University of Manchester - Institutional Repository