Search CORE

17 research outputs found

Identification of microbial community in the urban environment: The concordance between conventional culture and nanopore 16S rRNA sequencing

Author: Annie Wing-Tung Lee
Cheuk-Yi Yip
Chloe Toi-Mei Chan
Gilman Kit-Hang Siu
Hiu-Yin Lao
Ivan Tak-Fai Wong
Jake Siu-Lun Leung
Kai-Chun Cheng
Lam-Kwong Lee
Lily Lok-Yee Wong
Timothy Ting-Leung Ng
Wing-Tung Lui
Publication venue: 'Frontiers Media SA'
Publication date: 01/04/2023
Field of study

IntroductionMicrobes in the built environment have been implicated as a source of infectious diseases. Bacterial culture is the standard method for assessing the risk of exposure to pathogens in urban environments, but this method only accounts for <1% of the diversity of bacteria. Recently, full-length 16S rRNA gene analysis using nanopore sequencing has been applied for microbial evaluations, resulting in a rise in the development of long-read taxonomic tools for species-level classification. Regarding their comparative performance, there is, however, a lack of information.MethodsHere, we aim to analyze the concordance of the microbial community in the urban environment inferred by multiple taxonomic classifiers, including ARGpore2, Emu, Kraken2/Bracken and NanoCLUST, using our 16S-nanopore dataset generated by MegaBLAST, as well as assess their abilities to identify culturable species based on the conventional culture results.ResultsAccording to our results, NanoCLUST was preferred for 16S microbial profiling because it had a high concordance of dominant species and a similar microbial profile to MegaBLAST, whereas Kraken2/Bracken, which had similar clustering results as NanoCLUST, was also desirable. Second, for culturable species identification, Emu with the highest accuracy (81.2%) and F1 score (29%) for the detection of culturable species was suggested.DiscussionIn addition to generating datasets in complex communities for future benchmarking studies, our comprehensive evaluation of the taxonomic classifiers offers recommendations for ongoing microbial community research, particularly for complex communities using nanopore 16S rRNA sequencing

Directory of Open Access Journals

Threats from the air: Damselfly predation on diverse prey taxa

Author: Forbes MR
Kaunisto KM
Lilley TM
Morrill A
Puisto AIE
Roslin T
Sääksjärvi IE
Vesterinen EJ
Publication venue: 'Wiley'
Publication date: 28/10/2022
Field of study

To understand the diversity and strength of predation in natural communities, researchers must quantify the total amount of prey species in the diet of predators. Metabarcoding approaches have allowed widespread characterization of predator diets with high taxonomic resolution. To determine the wider impacts of predators, researchers should combine DNA techniques with estimates of population size of predators using mark-release-recapture (MRR) methods, and with accurate metrics of food consumption by individuals.Herein, we estimate the scale of predation exerted by four damselfly species on diverse prey taxa within a well-defined 12-ha study area, resolving the prey species of individual damselflies, to what extent the diets of predatory species overlap, and which fraction of the main prey populations are consumed.We identify the taxonomic composition of diets using DNA metabarcoding and quantify damselfly population sizes by MRR. We also use predator-specific estimates of consumption rates, and independent data on prey emergence rates to estimate the collective predation pressure summed over all prey taxa and specific to their main prey (non-biting midges or chironomids) of the four damselfly species.The four damselfly species collectively consumed a prey mass equivalent to roughly 870 (95% CL 410-1,800) g, over 2 months. Each individual consumed 29%-66% (95% CL 9.4-123) of its body weight during its relatively short life span (2.1-4.7 days; 95% CL 0.74-7.9) in the focal population. This predation pressure was widely distributed across the local invertebrate prey community, including 4 classes, 19 orders and c. 140 genera. Different predator species showed extensive overlap in diets, with an average of 30% of prey shared by at least two predator species.Of the available prey individuals in the widely consumed family Chironomidae, only a relatively small proportion (0.76%; 95% CL 0.35%-1.61%) were consumed.Our synthesis of population sizes, per-capita consumption rates and taxonomic distribution of diets identifies damselflies as a comparatively minor predator group of aerial insects. As the next step, we should add estimates of predation by larger odonate species, and experimental removal of odonates, thereby establishing the full impact of odonate predation on prey communities

UTUPub

Antimicrobial use and production system shape the fecal, environmental, and slurry resistomes of pig farms

Author: Argüello Rodríguez Héctor
Cabrera Rubio Raúl
Carvajal Urueña Ana María
Cobo Díaz José Francisco
Cotter Paul D.
Crispie Fiona
Gómez García Manuel
Mencía Ares Oscar
Puente Fernández Héctor
Rubio Nistal Pedro Miguel
Álvarez Ordóñez Avelino
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

P. 1-17Background: The global threat of antimicrobial resistance (AMR) is a One Health problem impacted by antimicrobial use (AMU) for human and livestock applications. Extensive Iberian swine production is based on a more sustainable and eco-friendly management system, providing an excellent opportunity to evaluate how sustained differences in AMU impact the resistome, not only in the animals but also on the farm environment. Here, we evaluate the resistome footprint of an extensive pig farming system, maintained for decades, as compared to that of industrialized intensive pig farming by analyzing 105 fecal, environmental and slurry metagenomes from 38 farms. Results: Our results evidence a significantly higher abundance of antimicrobial resistance genes (ARGs) on intensive farms and a link between AMU and AMR to certain antimicrobial classes. We observed differences in the resistome across sample types, with a higher richness and dispersion of ARGs within environmental samples than on those from feces or slurry. Indeed, a deeper analysis revealed that differences among the three sample types were defined by taxa-ARGs associations. Interestingly, mobilome analyses revealed that the observed AMR differences between intensive and extensive farms could be linked to differences in the abundance of mobile genetic elements (MGEs). Thus, while there were no differences in the abundance of chromosomal-associated ARGs between intensive and extensive herds, a significantly higher abundance of integrons in the environment and plasmids, regardless of the sample type, was detected on intensive farms. Conclusions: Overall, this study shows how AMU, production system, and sample type influence, mainly through MGEs, the profile and dispersion of ARGs in pig production.S

T-Stór

Cork Open Research Archive

Leon University (Spain)

Polarella genomics: understanding the evolutionary transition to algal symbiosis and cold adaptation

Author: Stephens Timothy
Publication venue: 'University of Queensland Library'
Publication date: 08/11/2019
Field of study

University of Queensland eSpace

Investigating relationships in the New Zealand alpine Ranunculus using RNA sequencing : a thesis presented in partial fulfilment of the requirements for the degree of Master of Science in Plant Biology at Massey University, Manawatū, New Zealand

Author: Henry John Curteis
Publication venue: 'Massey University'
Publication date: 01/01/2021
Field of study

Whether the alpine flora of New Zealand is resilient enough to withstand the effects of climate change is an important and unanswered question. Conspicuous amongst the alpine flora are the species of Ranunculus in section Pseudadonis. This monophyletic group of species is hypothesised to have rapidly diversified into distinct mountain habitats, with some species convergently evolving into similar habitats. Investigating how cryptic physiologies have convergently evolved in some Ranunculus species may provide insight into the adaptive potential of this group of plants. It has been argued that hybridisation is an important evolutionary process explaining the morphological and ecological variation of New Zealand alpine Ranunculus species. Hybridisation, and in particular introgression, has also been hypothesised elsewhere as an effective means for closely-related species to share genetic material and undergo rapid adaptation through selection of standing genetic variation. This research aimed to use RNA sequencing technology to address questions of physiology and phylogeny amongst four taxa of the alpine Ranunculus group. Habitat characterisation was carried out before plants were sampled and grown under standardised conditions in a common garden experiment. Bioinformatic approaches were used to analyse high-throughput sequencing data of RNA extracted from these laboratory-grown plants. This research illustrates the potential of RNA sequencing for studying non-model plant species. However, the conservative analytical approaches adopted, and noise within the data, limited inferences of physiological traits and evolutionary relationships. Analyses of heterozygosity and issues with de novo transcriptome assembly suggested greater numbers of gene variants than expected for these small, isolated populations of alpine plants. These gene variants likely occur because of the polyploid genomes of the New Zealand alpine Ranunculus. Further work is needed, however, to confirm this genetic diversity. Overall, this work reinforces the difficulties in studying non-model polyploid systems. Yet, it does hint at a genetic richness within the alpine Ranunculus that might aid survival of this clade during a rapidly changing future

Massey Research Online

Recommended from our members

Design and Implementation of Environmental DNA Metabarcoding Methods for Monitoring the Southern California Marine Protected Area Network

Author: Gold Zachary Jacob
Publication venue: eScholarship, University of California
Publication date: 01/01/2020
Field of study

Marine protected areas (MPAs) are important tools for maintaining biodiversity and abundance of marine species. However, key to the effectiveness of MPAs is monitoring of marine communities. Current monitoring methods rely heavily on SCUBA-based visual observations that are costly and time consuming, limiting the scope of MPA monitoring. Environmental DNA (eDNA) metabarcoding is a promising cost effective, rapid, and automatable alternative for marine ecosystem monitoring. However, as a developing tool, the utility of eDNA metabarcoding requires improved bioinformatic techniques and reference barcode databases. Furthermore, it is important to understand how eDNA metabarcoding performs relative to visual surveys to better understand the strengths and limitations of each approach. This thesis improves eDNA metabarcoding approaches to survey the nearshore rocky reef and kelp forest ecosystems within the Southern California MPA network. It then tests the effectiveness of eDNA metabarcoding against visual surveys conducted by the Channel Islands National Park Service Kelp Forest Monitoring Program and Reef Check California. In Chapter 1, I develop FishCARD, a 12S reference barcode database specific to fishes of the California Current ecosystem. FishCARD improves eDNA metabarcoding taxonomic assignments, resulting in the identification of a broader array of marine vertebrate diversity, including invasive, endangered, and mobile species frequently missed by visual surveys. In Chapter 2, I compare eDNA metabarcoding and visual underwater survey methods inside, on the edge of, and outside the Scorpion State Marine Reserve off Santa Cruz Island. We demonstrate that eDNA captures a broader range of fish taxa than visual surveys and detects fine-scale spatial differences in fish communities. In Chapter 3, I demonstrate that eDNA metabarcoding and visual underwater surveys capture similar biogeographic patterns of fish communities across 44 sites within the Southern California Bight. Importantly, eDNA methods distinguished fish communities inside and outside of Southern California MPAs, finding a greater abundance of target species inside MPAs matching patterns observed through visual surveys. These results built off the collaborative development of the Anacapa Toolkit metabarcoding pipeline. Together I demonstrate the utility of eDNA metabarcoding for monitoring MPAs, providing an important complementary tool to visual methods, helping expand MPA monitoring across space, time, and depth

eScholarship - University of California

Sequence data mining and characterisation of unclassified microbial diversity

Author: Modha Sejal
Publication venue
Publication date: 01/01/2022
Field of study

In the last two decades, sequencing has become increasingly affordable and a routine tool to study the microbial community of a given environment. Metagenomics has revolutionised the way microbes are identified and studied in this age of biological data science because it provides a relatively unbiased view of the composition of microbial communities we interact with every day, which are integral to our ecosystem. These technological advances have led to an exponential growth of raw data repositories that save, distribute and archive these metagenomic datasets. Since metagenomics presents the ultimate opportunity to capture, explore and identify uncultivated microbial genomic sequences, these metagenomic datasets harbour a large proportion of unknown sequences that do not bear any similarity to known sequences readily available in the standard sequence data repositories. The aim of this thesis was to systematically catalogue, quantify and potentially characterise the unknown sequences embedded within the metagenomic datasets. To this end, a comprehensive, portable, modular framework called UnXplore was developed to determine the proportion of unknown sequences included in human microbiome datasets. UnXplore was applied to a range of different human microbiomes and showed that on average 2% of assembled sequences were categorised as unknown meaning that they did not bear any sequence similarity to known sequences. A third of the unknown sequences were shown to contain large open reading frames indicating the coding potential and biological origin of the unknowns. Furthermore, a small proportion of these potentially coding sequences were shown to have functional similarities as they were deemed to contain known protein domain signatures. These results indicated that unknown sequences captured through the UnXplore framework were not artefacts and were indeed of biological origin. To test this formally, supervised kmer-based machine learning models were devised, tested and validated. These models are currently distributed in a package called TetraPredX that can accurately predict whether a sequence originated from bacteria, archaea, virus or plasmid. TetraPredX models were applied to the unknown sequence dataset and revealed that the majority of unknown sequences are of biological origin. Furthermore, TetraPredX results demonstrated that >70% of all long unknown sequences (i.e. >1kb) are likely to be of virus origin indicating an unexplored diversity of viruses that is yet to be fully characterised and classified. In order to catalogue the diversity of virus sequences in human microbiome samples analysed here, an extensive virus discovery analysis was carried out on the contigs assembled through UnXplore. This helped to characterise a vast diversity of prokaryotic, eukaryotic and unclassified virus sequences captured in a range of human microbiomes. The results obtained here demonstrate the need to systematically interrogate metagenomic datasets to fully comprehend and compile the presence of both known and unknown uncultivated microbes within them. A comprehensive survey of metagenomic datasets carried out in this manner would provide a more complete picture of the known and unknown organisms that surround us

Glasgow Theses Service

Algorithms and Applications for non-coding RNAs in Aging

Author: Kern Fabian
Publication venue: Saarländische Universitäts- und Landesbibliothek
Publication date: 01/01/2021
Field of study

Gene expression is a complex molecular process governing fate and function of most eukaryotic cells. The fundamental mechanism, namely that genetic material of a cell is compactly stored on chromosomal DNA and at times being transcribed into messenger-RNA to facilitate on-demand protein biosynthesis, is widely known. However, the interplay of biochemical regulatory pathways underlying an individual’s disease phenotype development remains incompletely understood. Intriguingly, the ∼ 20.000 protein-coding genes only account for 2% of the human genome, triggering profound questions on the purpose of remaining segments. In recent years it became apparent that non-coding RNAs essentially tune the observed gene expression circuits. In particular the small non-coding RNAs such as microRNAs, turned out to be regulatory players by switching on and off protein translation of target messenger-RNAs. Several thousand mammalian microRNAs have been discovered so far but little is known about their impact on the transcriptome, which likely depends on contextual variables like cell type identity, cellular and tissue environment or phase of activation. Previous efforts demonstrated that gene expression programs in human and mouse undergo gradual changes along the life trajectory with amplification at higher ages. In parallel, age-related diseases are currently accumulating in our globally aging population, posing a serious challenge to our society and healthcare systems. Neurodegenerative disorders such as Alzheimer’s disease and Parkinson’s disease show steadily rising incidence rates with several million people already affected. Both are caused by pathological protein accumulation in selectively vulnerable neurons and brain regions. Notably, these neurological disorders do not appear all of a sudden in an individual but are believed to originate after long asymptomatic phases of subtle aberrant changes on the cellular level, turning early diagnosis into an intricate affair. Yet, no single comprehensive model to explain aging associated changes in gene expression exists and certainly any such model must take into account the role of microRNAs and other important non-coding RNAs. With the advent of ultra-high-throughput sequencing techniques and unprecedented computational power, the screening of microRNAs and their targets from human biofluids and tissues became not only affordable but scalable. To deal with the increasing complexity of molecular studies, novel bioinformatics-driven approaches are needed to generate reproducible and comprehensive conclusions from large-scale data sets. Here, the role of small non-coding RNAs in governing gene expression changes observed in complex age-related diseases is explored with the aid of new methods and databases as well as several thousand RNA profiling samples. This cumulative doctoral thesis comprises eight peer-reviewed publications. Basic research covers a comprehensive review on most target prediction tools and a novel experimental and computational workflow for microRNA-target pathway identification. In addition, with miRPathDB 2.0 the so-far largest database on enriched microRNA pathways for human and mouse is presented. Moreover, the new versatile web tool miEAA 2.0 allows rapid annotation of statistically enriched molecular properties and functions for large lists of microRNAs from ten species. The lessons learned from web-based tool development were condensed in an invited summary and survey article on scientific web server availability along with best practices for developers. The here presented toolkit was used in three applied research studies to investigate the association between microRNAs and their target pathways in the context of aging as well as the to date largest Parkinson’s disease biomarker discovery framework. Circulating microRNAs obtained low-invasively from whole-blood samples bear diagnostic and prognostic value in Alzheimer’s and Parkinson’s disease patients, which was discovered using machine learning models. Furthermore, selected microRNA families were found to systematically target entire signaling pathways as to effectively silence gene expression. Indeed, these pathways are affected in prevalent neurodegenerative disorders. Taken together, the published candidate signatures and validated targets are pivotal for subsequent experimental perturbation in microRNA or gene knockout studies. In future efforts, large-scale single-cell studies will be required to further dissect disease and cell-type specificity of aging disease biomarker candidates and their long-term effect on gene expression, possibly indicating early neuropathological hallmarks.Genexpression ist ein komplexer molekularer Prozess, der das Überleben und die Funktion der meisten eukaryotischen Zellen entscheidend beeinflusst. Der zugrunde liegende Mechanismus, nämlich, dass das genetische Material einer Zelle kompakt in chromosomaler DNA vorliegt und je nach Bedarf in messenger-RNA zur Proteinbiosynthese genutzt wird, ist weitgehend bekannt. Allerdings ist das Zusammenspiel der regulatorischen Pfade im Hintergrund der phenotypischen Veränderungen von erkrankten Individuen nur wenig verstanden. Interessanterweise machen die fast 20.000 protein-kodierenden Gene nur in etwa 2% des menschlichen Erbgutes aus. In den letzten Jahren hat man festgestellt, dass nicht-kodierende RNAs eine essentielle Rolle bei der Einstellung der beobachteten Genexpressionsschaltkreise spielen. Insbesondere kleine nicht-kodierende RNAs wie microRNAs, stellten sich als zuvor unterschätzte regulatorische Einheiten heraus, die die Translation von Ziel-messenger-RNA in Proteine an und ausschalten. Mehrere tausend microRNAs wurden bisher bei Säugetieren entdeckt, trotzdem ist immer noch wenig über ihren Einfluss auf das Transkriptom bekannt, ein Zusammenhang der wahrscheinlich vom Kontext wie Zelltypidentität, dem zelluären Umfeld sowie dem umgebenden Gewebe, und den Aktivierungsphasen abhängt. Frühere Forschungsarbeiten haben bereits gezeigt, dass das Genexpressionsprogramm im Menschen und in der Maus sukzessiven Änderungen im Laufe des Lebens unterworfen ist, welche sich im höheren Alter verstärken. Zur gleichen Zeit akkumulieren Fälle von altersbedingten Krankheiten in unserer immer älter werdenden, globalen Population, was ernstzunehmende Herausforderungen für unsere Gesellschaft sowie unser Gesundheitssystem mit sich bringt. Neurodegenerative Krankheiten wie Morbus Alzheimer und Morbus Parkinson zeigen eine kontinuierlich ansteigende Inzidenz, wobei bereits mehrere millionen Menschen weltweit betroffen sind. Besonders für diese Krankheiten ist, dass sie bei einem Menschen nicht spontan oder plötzlich entstehen, sondern vermutlich nach langer Zeit der asymptomatischen Phase aufgrund schleichender, abnormaler Veränderungen auf zellulärer Ebene entstehen, was eine frühe Diagnose überaus schwierig gestaltet. Bisher existiert noch kein verständliches Modell das die altersassoziierten Veränderungen der Genexpression erklären kann, wobei jedes darauf ausgerichtete Modell mit Bestimmtheit die Rolle der microRNAs und anderen wichtigen nicht-kodierenden RNAs zwangsläufig in Betracht ziehen muss. Mit dem Aufkommen der Sequenzierung im Ultrahochdurchsatzverfahren und der unübertroffenen Leistung moderner Computersysteme, wurde die Untersuchung von microRNAs und ihren Zielgenen anhand von Proben menschlicher Flüssigkeiten und Geweben nicht nur möglich gemacht, sondern kann entsprechend hochskaliert werden. Um mit der zunehmenden Komplexität molekularer Studien Schritt zu halten, braucht es neue Ansätze der Bioinformatik um reproduzierbare und nachvollziehbare Schlüsse aus großen Datensätzen gewinnen zu können. Im Rahmen dieser Arbeit wurden kleine nicht-kodierende RNAs hinsichtlich ihrer Rolle der Genregulation in komplexen altersbedingten Krankheiten anhand neuer Methoden und Datenbanken sowie mehreren tausend Proben der RNA-Sequenzierung untersucht. Diese kumulative Dissertationsarbeit umfasst acht von unabhängigen Experten begutachtete (peer-reviewed), wissenschaftliche Publikationen. Die Grundlagenforschung enthält einen umfassenden Übersichtsartikel zu fast allen Methoden der Vorhersage von microRNA Zielgenen sowie ein neuartiges Protokoll bestehend aus Labormethoden und computergestützen Berechnungen zur Identifikation von durch microRNAs regulierte Genpfade. Zusätzlich wird mit miRPathDB 2.0 die bisher größte Datenbank zu signifikant angereicherten microRNA Zielpfaden präsentiert. Des Weiteren, bietet die neue und vielseitige, web-basierte Software miEAA 2.0 die Möglichkeit der rasanten Annotation statistisch angereicherter, molekularer Eigenschaften sowie bekannter Funktionen einer gegebenen Liste an microRNAs von zehn Spezies. Die durch web-basierte Softwareentwicklung zuvor angelernten Fähigkeiten sowie daraus resultierende Empfehlungen für nachfolgende Entwickler wurden kurz und bündig in einem eingeladenen Übersichtsartikel zum Thema Verfügbarkeit wissenschaftlicher Software im Internet veröffentlicht. Die hier präsentierten Werkzeuge wurden gezielt in drei Studien zur angewandten Forschung genutzt um die Assoziation zwischen microRNAs und ihren Zielpfaden im Kontext der allgemeinen Altersforschung sowie im Rahmen der bisher größten Studie zur Entdeckung von Biomarkern der Parkinson Krankheit zu untersuchen. Im Blutkreislauf zirkulierende microRNAs, die anhand von Vollblutproben extrahiert wurden, zeigen diagnostisches und prognostisches Potential bei Alzheimer und Parkinson Patienten, was mit Methoden des maschinellen Lernens entdeckt werden konnte. Überdies konnte herausgefunden werden, dass bestimmte microRNA Familien systematisch Signalwege blockieren können, um die Genexpression herunterzufahren. Tatsächlich sind diese Pfade auch in neurodegenerativen Krankheiten betroffen. Insgesamt sind die hier publizierten Signaturen von Kandidaten-microRNAs und einiger validierter Zielgene herausragend dazu geeignet in weiteren Studien anhand von gezielter Ausschaltung im Labor genauer untersucht zu werden. In zukünftigen Forschungsprojekten sollten groß angelegte Untersuchungen vieler einzelner Zellen im Vordergrund stehen, um zu verstehen wie spezifisch für Krankheit oder Zelltyp die hier genannten Biomarker-Kandidaten für altersbedingte Krankheiten sind. Auch wird es wichtig sein die Langzeiteffekte von dysregulierten microRNAs auf die Genexpression zu verstehen, die möglicherweise frühzeitig neuropathologische Kennzeichen widerspiegeln

Universaar

Acronym

Interpretable Machine Learning Methods for Prediction and Analysis of Genome Regulation in 3D

Author: Nikumbh Sarvesh
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 01/01/2019
Field of study

With the development of chromosome conformation capture-based techniques, we now know that chromatin is packed in three-dimensional (3D) space inside the cell nucleus. Changes in the 3D chromatin architecture have already been implicated in diseases such as cancer. Thus, a better understanding of this 3D conformation is of interest to help enhance our comprehension of the complex, multipronged regulatory mechanisms of the genome. The work described in this dissertation largely focuses on development and application of interpretable machine learning methods for prediction and analysis of long-range genomic interactions output from chromatin interaction experiments. In the first part, we demonstrate that the genetic sequence information at the ge- nomic loci is predictive of the long-range interactions of a particular locus of interest (LoI). For example, the genetic sequence information at and around enhancers can help predict whether it interacts with a promoter region of interest. This is achieved by building string kernel-based support vector classifiers together with two novel, in- tuitive visualization methods. These models suggest a potential general role of short tandem repeat motifs in the 3D genome organization. But, the insights gained out of these models are still coarse-grained. To this end, we devised a machine learning method, called CoMIK for Conformal Multi-Instance Kernels, capable of providing more fine-grained insights. When comparing sequences of variable length in the su- pervised learning setting, CoMIK can not only identify the features important for classification but also locate them within the sequence. Such precise identification of important segments of the whole sequence can help in gaining de novo insights into any role played by the intervening chromatin towards long-range interactions. Although CoMIK primarily uses only genetic sequence information, it can also si- multaneously utilize other information modalities such as the numerous functional genomics data if available. The second part describes our pipeline, pHDee, for easy manipulation of large amounts of 3D genomics data. We used the pipeline for analyzing HiChIP experimen- tal data for studying the 3D architectural changes in Ewing sarcoma (EWS) which is a rare cancer affecting adolescents. In particular, HiChIP data for two experimen- tal conditions, doxycycline-treated and untreated, and for primary tumor samples is analyzed. We demonstrate that pHDee facilitates processing and easy integration of large amounts of 3D genomics data analysis together with other data-intensive bioinformatics analyses.Mit der Entwicklung von Techniken zur Bestimmung der Chromosomen-Konforma- tion wissen wir jetzt, dass Chromatin in einer dreidimensionalen (3D) Struktur in- nerhalb des Zellkerns gepackt ist. Änderungen in der 3D-Chromatin-Architektur sind bereits mit Krankheiten wie Krebs in Verbindung gebracht worden. Daher ist ein besseres Verständnis dieser 3D-Konformation von Interesse, um einen tieferen Einblick in die komplexen, vielschichtigen Regulationsmechanismen des Genoms zu ermöglichen. Die in dieser Dissertation beschriebene Arbeit konzentriert sich im Wesentlichen auf die Entwicklung und Anwendung interpretierbarer maschineller Lernmethoden zur Vorhersage und Analyse von weitreichenden genomischen Inter- aktionen aus Chromatin-Interaktionsexperimenten. Im ersten Teil zeigen wir, dass die genetische Sequenzinformation an den genomis- chen Loci prädiktiv für die weitreichenden Interaktionen eines bestimmten Locus von Interesse (LoI) ist. Zum Beispiel kann die genetische Sequenzinformation an und um Enhancer-Elemente helfen, vorherzusagen, ob diese mit einer Promotorregion von Interesse interagieren. Dies wird durch die Erstellung von String-Kernel-basierten Support Vector Klassifikationsmodellen zusammen mit zwei neuen, intuitiven Visual- isierungsmethoden erreicht. Diese Modelle deuten auf eine mögliche allgemeine Rolle von kurzen, repetitiven Sequenzmotiven (”tandem repeats”) in der dreidimensionalen Genomorganisation hin. Die Erkenntnisse aus diesen Modellen sind jedoch immer noch grobkörnig. Zu diesem Zweck haben wir die maschinelle Lernmethode CoMIK (für Conformal Multi-Instance-Kernel) entwickelt, welche feiner aufgelöste Erkennt- nisse liefern kann. Beim Vergleich von Sequenzen mit variabler Länge in überwachten Lernszenarien kann CoMIK nicht nur die für die Klassifizierung wichtigen Merkmale identifizieren, sondern sie auch innerhalb der Sequenz lokalisieren. Diese genaue Identifizierung wichtiger Abschnitte der gesamten Sequenz kann dazu beitragen, de novo Einblick in jede Rolle zu gewinnen, die das dazwischen liegende Chromatin für weitreichende Interaktionen spielt. Obwohl CoMIK hauptsächlich nur genetische Se- quenzinformationen verwendet, kann es gleichzeitig auch andere Informationsquellen nutzen, beispielsweise zahlreiche funktionellen Genomdaten sofern verfügbar. Der zweite Teil beschreibt unsere Pipeline pHDee für die einfache Bearbeitung großer Mengen von 3D-Genomdaten. Wir haben die Pipeline zur Analyse von HiChIP- Experimenten zur Untersuchung von dreidimensionalen Architekturänderungen bei der seltenen Krebsart Ewing-Sarkom (EWS) verwendet, welche Jugendliche betrifft. Insbesondere werden HiChIP-Daten für zwei experimentelle Bedingungen, Doxycyclin- behandelt und unbehandelt, und für primäre Tumorproben analysiert. Wir zeigen, dass pHDee die Verarbeitung und einfache Integration großer Mengen der 3D-Genomik- Datenanalyse zusammen mit anderen datenintensiven Bioinformatik-Analysen erle- ichtert

Universaar

Acronym

MPG.PuRe