22 research outputs found

    Developing and evaluating tools to improve the quality of DNA methylation association studies

    Get PDF
    There is increasing interest in studying DNA methylation in the context of health and disease. A number of technical and analytical considerations are important to take into account when designing and interpreting DNA methylation studies, such as the experimental parameters used when quantifying DNA methylation differences between individuals and how best to account for study confounders, such as cellular composition. This thesis aims to address these issues by first developing a method to assess study power in bisulfite sequencing (BS) studies, second establishing a method for the estimation of error across reference based cellular deconvolution models, and third generating a novel reference based DNA methylation deconvolution model for the brain incorporating data for three neural cell types. In Chapter 2 the impact of bisulfite sequencing depth and sample size on power is investigated. It is shown that study power is not dependent on one specific parameter, but reflects the combination of multiple study-specific variables. Data simulation is utilised to generate an interactive tool for use by the wider research community that can be used to estimate the power of BS studies based on user-defined input variables including sample size and read depth filtering. In Chapter 3 an error metric is established for reference based cellular deconvolution approaches using DNA methylation data, which is validated using datasets derived from both blood and brain tissue. In Chapter 4 the reference based deconvolution model utilised for the deconvolution of brain tissue is refined to include an additional cell type, resulting in a three cell type model. The model was applied to bulk brain DNA methylation samples, showing that the addition of a third cell type improved insight gained from data generated on bulk brain tissue. Overall, this thesis aims to generate tools which can be utilised to better design and interpret DNA methylation studies, all of which have been made publicly available. This thesis also encourages researchers to clearly communicate any DNA methylation quality control decisions made and examine their methodologies to improve the transparency and reproducibility of their findings.Biotechnology & Biological Sciences Research Council (BBSRC

    Integrative bioinformatics applications for complex human disease contexts

    Get PDF
    This thesis presents new methods for the analysis of high-throughput data from modern sources in the context of complex human diseases, at the example of a bioinformatics analysis workflow. New measurement techniques improve the resolution with which cellular and molecular processes can be monitored. While RNA sequencing (RNA-seq) measures mRNA expression, single-cell RNA-seq (scRNA-seq) resolves this on a per-cell basis. Long-read sequencing is increasingly used in genomics. With imaging mass spectrometry (IMS) the protein level in tissues is measured spatially resolved. All these techniques induce specific challenges, which need to be addressed with new computational methods. Collecting knowledge with contextual annotations is important for integrative data analyses. Such knowledge is available through large literature repositories, from which information, such as miRNA-gene interactions, can be extracted using text mining methods. After aggregating this information in new databases, specific questions can be answered with traceable evidence. The combination of experimental data with these databases offers new possibilities for data integrative methods and for answering questions relevant for complex human diseases. Several data sources are made available, such as literature for text mining miRNA-gene interactions (Chapter 2), next- and third-generation sequencing data for genomics and transcriptomics (Chapters 4.1, 5), and IMS for spatially resolved proteomics (Chapter 4.4). For these data sources new methods for information extraction and pre-processing are developed. For instance, third-generation sequencing runs can be monitored and evaluated using the poreSTAT and sequ-into methods. The integrative (down-stream) analyses make use of these (heterogeneous) data sources. The cPred method (Chapter 4.2) for cell type prediction from scRNA-seq data was successfully applied in the context of the SARS-CoV-2 pandemic. The robust differential expression (DE) analysis pipeline RoDE (Chapter 6.1) contains a large set of methods for (differential) data analysis, reporting and visualization of RNA-seq data. Topics of accessibility of bioinformatics software are discussed along practical applications (Chapter 3). The developed miRNA-gene interaction database gives valuable insights into atherosclerosis-relevant processes and serves as regulatory network for the prediction of active miRNA regulators in RoDE (Chapter 6.1). The cPred predictions, RoDE results, scRNA-seq and IMS data are unified as input for the 3D-index Aorta3D (Chapter 6.2), which makes atherosclerosis related datasets browsable. Finally, the scRNA-seq analysis with subsequent cPred cell type prediction, and the robust analysis of bulk-RNA-seq datasets, led to novel insights into COVID-19. Taken all discussed methods together, the integrative analysis methods for complex human disease contexts have been improved at essential positions.Die Dissertation beschreibt Methoden zur Prozessierung von aktuellen Hochdurchsatzdaten, sowie Verfahren zu deren weiterer integrativen Analyse. Diese findet Anwendung vor allem im Kontext von komplexen menschlichen Krankheiten. Neue Messtechniken erlauben eine detailliertere Beobachtung biomedizinischer Prozesse. Mit RNA-Sequenzierung (RNA-seq) wird mRNA-Expression gemessen, mit Hilfe von moderner single-cell-RNA-seq (scRNA-seq) sogar für (sehr viele) einzelne Zellen. Long-Read-Sequenzierung wird zunehmend zur Sequenzierung ganzer Genome eingesetzt. Mittels bildgebender Massenspektrometrie (IMS) können Proteine in Geweben räumlich aufgelöst quantifiziert werden. Diese Techniken bringen spezifische Herausforderungen mit sich, die mit neuen bioinformatischen Methoden angegangen werden müssen. Für die integrative Datenanalyse ist auch die Gewinnung von geeignetem Kontextwissen wichtig. Wissenschaftliche Erkenntnisse werden in Artikeln veröffentlicht, die über große Literaturdatenbanken zugänglich sind. Mittels Textmining können daraus Informationen extrahiert werden, z.B. miRNA-Gen-Interaktionen, die in eigenen Datenbank aggregiert werden um spezifische Fragen mit nachvollziehbaren Belegen zu beantworten. In Kombination mit experimentellen Daten bieten sich so neue Möglichkeiten für integrative Methoden. Durch die Extraktion von Rohdaten und deren Vorprozessierung werden mehrere Datenquellen erschlossen, wie z.B. Literatur für Textmining von miRNA-Gen-Interaktionen (Kapitel 2), Long-Read- und RNA-seq-Daten für Genomics und Transcriptomics (Kapitel 4.2, 5) und IMS für Protein-Messungen (Kapitel 4.4). So dienen z.B. die poreSTAT und sequ-into Methoden der Vorprozessierung und Auswertung von Long-Read-Sequenzierungen. In der integrativen (down-stream) Analyse werden diese (heterogenen) Datenquellen verwendet. Für die Bestimmung von Zelltypen in scRNA-seq-Experimenten wurde die cPred-Methode (Kapitel 4.2) erfolgreich im Kontext der SARS-CoV-2-Pandemie eingesetzt. Auch die robuste Pipeline RoDE fand dort Anwendung, die viele Methoden zur (differentiellen) Datenanalyse, zum Reporting und zur Visualisierung bereitstellt (Kapitel 6.1). Themen der Benutzbarkeit von (bioinformatischer) Software werden an Hand von praktischen Anwendungen diskutiert (Kapitel 3). Die entwickelte miRNA-Gen-Interaktionsdatenbank gibt wertvolle Einblicke in Atherosklerose-relevante Prozesse und dient als regulatorisches Netzwerk für die Vorhersage von aktiven miRNA-Regulatoren in RoDE (Kapitel 6.1). Die cPred-Methode, RoDE-Ergebnisse, scRNA-seq- und IMS-Daten werden im 3D-Index Aorta3D (Kapitel 6.2) zusammengeführt, der relevante Datensätze durchsuchbar macht. Die diskutierten Methoden führen zu erheblichen Verbesserungen für die integrative Datenanalyse in komplexen menschlichen Krankheitskontexten

    Evaluating the role of social attention in the causal path to Autism Spectrum Disorder

    Get PDF
    This thesis evaluated the evidence for the hypothesis that early disruptions in social attention are involved in the causal pathway to Autism Spectrum Disorder (ASD). The sample included infants at high and low familial risk for neurodevelopmental disorders participating in a prospective longitudinal study, and their family members. Five studies were conducted to test whether social attention atypicalities precede the onset of behavioural symptoms and whether they are related to familial, genetic and epigenetic burden for ASD. Chapter 2 examined neural correlates of attention measured with multi-channel electroencephalography in 8-month-old infants attending to faces and non-social stimuli, in relation to outcomes at age 3. Chapter 3 used structural equation modelling to investigate whether disruptions in neural response have cascading effects on learning from the environment via looking behaviour. Next, to further understand whether disruptions in social attention lie between genetic risk and ASD phenotype, Chapter 4 examined the association between ability to detect eye-gaze direction in a familial sample, severity of ASD symptoms and polygenic risk for ASD. Chapter 5 explored these patterns earlier in development, looking at the relationship between social attention at 14 months of age and familial burden, polygenic risk and parentreport traits of ASD and ADHD. Finally, Chapter 6, leveraging DNA methylation data, explored whether epigenetic signals were associated with early neural and behavioural correlates of social attention as well as developmental change leading to atypical outcome. Taken together, this work examined in depth the multifaceted nature of social attention, pointing to neural and behavioural atypicalities at critical time points as promising targets for cognitive and affective interventions. Furthermore, it pioneers future work integrating genetics, epigenetics and early neurocognitive measures of social attention in large prospective longitudinal studies of individuals at increased vulnerability for neurodevelopmental disorders, to shed light on the developmental mechanisms underlying the emergence of ASD

    Knowledge Discovery with Bayesian Networks

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    A Genome-wide Association Study of Schizophrenia in the South African Xhosa and Generalizability of Polygenic Risk Score across African populations

    Get PDF
    African populations are vastly underrepresented in genetic studies despite having the most genetic variation globally and facing wide-ranging environmental exposures. Most of these studies have been conducted in populations of European (EUR) ancestry using GWAS arrays that represent the genetic variation in these populations. Thus, the prediction accuracy of polygenic risk scores (PRS) derived from EUR ancestry populations is less accurate in populations of non-European ancestry, and least accurate in African (AFR) ancestry populations. The extent to which PRS prediction accuracy varies within AFR ancestry populations has not, however, been previously investigated. This study had two aims: the first was to investigate the contribution of common variants to the risk of schizophrenia in the South African Xhosa (SAX) population through genome-wide association study (GWAS) analysis, and to determine if PRS derived from EUR and East Asian (EAS) ancestry populations from the Psychiatric Genomics Consortium (PGC) Schizophrenia Working Group were generalizable to SAX. The second aim was to assess the generalizability of PRS for non-psychiatric phenotypes that were derived from EUR ancestry individuals from the UK Biobank (UKB, n = ~350,000) in the Uganda General Population Cohort (GPC, n = 4,778) and the South African Drakenstein Child Health Study (DHCS, n = 638). To address the first aim, a GWAS was conducted in 2,086 Xhosa individuals from South Africa with and without schizophrenia (ncases = 1,038; ncontrols = 1,048) using a custom-designed Affymetrix GWAS array designed to capture variation in the Xhosa population. The schizophrenia GWAS in SAX yielded one SNP (rs35172303 ; P = 4.74e-08, OR = 0.6004, 95%CI:[0.499,0.721]) in ZFP3 that met genome-wide significance. The association of variants in ZFP3 from the schizophrenia GWAS is consistent with those from an earlier exomesequence study in SAX undertaken by colleagues, but this gene has not previously been associated with schizophrenia in large-scale schizophrenia GWAS of predominantly EUR ancestry. After characterizing the genetic architecture of schizophrenia in SAX, it was found that the heritability was enriched across functional categories involved in the regulation of gene expression. Then, the accuracy of PRS derived from PGC Schizophrenia Working Group from both EUR and EAS ancestries in predicting schizophrenia in SAX was quantified. There was low PRS prediction accuracy using PGC-derived summary statistics in SAX (PGC-EUR: max R2 = 0.0057, P = 0.008; PGC-EAS: max R2 = 0.0059, P = 0.007). These findings are consistent with previous findings that showed that PRS predication accuracy is low when discovery and target cohorts come from different ancestral backgrounds. For the second aim, PRS prediction accuracy was quantified in simulations using data from the African Genome Variation project (AGVP) to represent continental AFR diversity. Samples were categorised by geographical region into West, East and South Africa cohorts. Each cohort was divided into a discovery and target datasets. The West and East African discovery data was used to predict the simulated phenotype in the three target cohorts. Using UKB EUR ancestry individuals, PRS prediction accuracy was assessed for 34 anthropometric and blood panel traits in the Uganda GPC, and then meta-analysed UKB with PAGE (Population Architecture using Genomics and Epidemiology, comprising about 50,000 Latino/Hispanic and African-American individuals) and BBJ (Biobank Japan, n = ~162,000) to assess how the inclusion of diverse sample impacts PRS prediction accuracy. Simulations were limited by sample size but showed that PRS prediction accuracy was highest when the discovery and target cohorts were matched by African region, and for phenotypes with the sparsest genetic architecture. Using empirical data from UKB and the Uganda GPC, a low prediction accuracy was observed across all 34 quantitative traits in GPC when using GWAS data from UKB. There was differential prediction accuracy across AFR ancestry groups within UKB, i.e. the prediction accuracy was highest for the Ethiopian and admixed populations, and lowest for southern African populations. When comparing PRS prediction accuracy of East African individuals from the UKB to that of individuals from GPC, the prediction accuracy was lowest in the Ugandan GPC population, indicating that the difference in environments between the two groups may be contributing to the difference in PRS accuracy. Moreover, the cross-ancestry meta-analyses showed that the inclusion of diverse samples in large scale studies improves PRS prediction accuracy, most especially for phenotypes with population-enriched variants. It was demonstrated for the first time in this thesis that EUR ancestry-derived PRS prediction accuracy varied within continental AFR ancestry groups, and tracks with population history and the evolution of humans. The higher prediction accuracy observed in Ethiopians can be explained by their genetic proximity to Europeans as a result of the back to Africa migration, whereas the southern African populations (including SAX) are more proximal to the ancestral populations that never left the continent. It is therefore imperative to not only include more African samples in future large-scale studies, but to have samples that adequately represent the genetic and environmental diversity on the African continent
    corecore