20 research outputs found

    Imputation Aided Methylation Analysis

    Get PDF
    Genome-wide DNA methylation analysis is of broad interest to medical research because of its central role in human development and disease. However, generating high-quality methylomes on a large scale is particularly expensive due to technical issues inherent to DNA treatment with bisulfite, requiring deeper than usual sequencing. In silico methodologies, such as imputation, can be used to address this limitation and improve the coverage and quality of data produced in these experiments. Imputation is a statistical technique where missing values are substituted with computed values. The process involves leveraging information from reference data to calculate probable values for missing data points. In this thesis, imputation is explored for its potential to increase the value of methylation datasets sequenced at different depths: 1. First, a new R package, Methylation Analysis ToolkiT (MATT), was developed to deal with large numbers of WGBS datasets in a computationally- and memory-efficient manner. 2. Second, the performance of DNA methylation-specific and generic imputation tools were assessed by down-sampling high-quality (100x) WGBS datasets to determine the extent to which missing data can be recovered and the accuracy of imputed values. 3. Third, to overcome shortfalls within existing tools, a novel imputation tool was developed, termed Global IMputation of cpg MEthylation (GIMMEcpg). GIMMEcpg default implementation is based on Model Stacking and outperforms existing tools in accuracy and speed. 4. Lastly, to demonstrate its potential, GIMMEcpg was used to impute ten shallow (17x) WGBS datasets from healthy volunteers of the Personal Genome Project UK with high accuracy. Moreover, the extent of missing and low-quality data, as well as the reproducibility and accuracy of methylation datasets, were explored for different data types (Microarrays, Reduced Representation Bisulfite Sequencing (RRBS), Whole Genome Bisulfite Sequencing (WGBS), EM-Seq and Nanopore sequencing)

    GeneValidator: identify problems with protein-coding gene predictions

    Get PDF
    This work was supported by the Sciruby community, NESCent Google Summer of Code, the NESCent “Building non-model species genome curation communities” working group, Biotechnology and Biological Sciences Research Council [BB/K004204/1], Natural Environment Research Council [NE/L00626X/1, EOS Cloud] and QMUL Apocrita Midplus (EP/K000128/1)

    Comparison and imputation-aided integration of five commercial platforms for targeted DNA methylome analysis

    Get PDF
    Targeted bisulfite sequencing (TBS) has become the method of choice for the cost-effective, targeted analysis of the human methylome at base-pair resolution. In this study, we benchmarked five commercially available TBS platforms-three hybridization capture-based (Agilent, Roche and Illumina) and two reduced-representation-based (Diagenode and NuGen)-across 11 samples. Two samples were also compared with whole-genome DNA methylation sequencing with the Illumina and Oxford Nanopore platforms. We assessed workflow complexity, on/off-target performance, coverage, accuracy and reproducibility. Although all platforms produced robust and reproducible data, major differences in the number and identity of the CpG sites covered make it difficult to compare datasets generated on different platforms. To overcome this limitation, we applied imputation and show that it improves interoperability from an average of 10.35% (0.8 million) to 97% (7.6 million) common CpG sites. Our study provides guidance on which TBS platform to use for different methylome features and offers an imputation-based harmonization solution that allows comparative, integrative analysis

    The Personal Genome Project-UK, an open access resource of human multi-omics data

    Get PDF
    Integrative analysis of multi-omics data is a powerful approach for gaining functional insights into biological and medical processes. Conducting these multifaceted analyses on human samples is often complicated by the fact that the raw sequencing output is rarely available under open access. The Personal Genome Project UK (PGP-UK) is one of few resources that recruits its participants under open consent and makes the resulting multi-omics data freely and openly available. As part of this resource, we describe the PGP-UK multi-omics reference panel consisting of ten genomic, methylomic and transcriptomic data. Specifically, we outline the data processing, quality control and validation procedures which were implemented to ensure data integrity and exclude sample mix-ups. In addition, we provide a REST API to facilitate the download of the entire PGP-UK dataset. The data are also available from two cloud-based environments, providing platforms for free integrated analysis. In conclusion, the genotype-validated PGP-UK multi-omics human reference panel described here provides a valuable new open access resource for integrated analyses in support of personal and medical genomics

    SARS-CoV-2 3D database: Understanding the Coronavirus Proteome and Evaluating Possible Drug Targets.

    Get PDF
    The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a rapidly growing infectious disease, widely spread with high mortality rates. Since the release of the SARS-CoV-2 genome sequence in March 2020, there has been an international focus on developing target-based drug discovery, which also requires knowledge of the 3D structure of the proteome. Where there are no experimentally solved structures, our group has created 3D models with coverage of 97.5% and characterised them using state-of-the-art computational approaches. Models of protomers and oligomers, together with predictions of substrate and allosteric binding sites, protein- ligand docking, SARS-CoV-2 protein interactions with human proteins, impacts of mutations, and mapped solved experimental structures are freely available for download. These are imple- mented in SARS CoV-2 3D, a comprehensive and user-friendly database, available at https://sars3d.com/. This provides essential information for drug discovery, both to evaluate targets and design new potential therapeutics.This work is supported and funded by King Abdullah scholarship (Saudi Arabia research coun- cil), and American Leprosy Missions grants (G88726), SET is funded by the Cystic Fibrosis Trust (RG 70975) and Fondation Botnar (RG91317). A.R.J is funded by the Biotechnology and Biological Sciences Research Council (BBSRC) DTP studentship (BB/M011194/1). B.B. is funded by the Cystic Fibrosis Trust and L.C. on a studentship from Ipsen. T.L.B. is funded by a the Wellcome Trust Investigator Award, PHZJ/489 RG83114 (2016-2021

    Donor whole blood DNA methylation is not a strong predictor of acute graft versus host disease in unrelated donor allogeneic haematopoietic cell transplantation

    Get PDF
    Allogeneic hematopoietic cell transplantation (HCT) is used to treat many blood-based disorders and malignancies. While this is an effective treatment, it can result in serious adverse events, such as the development of acute graft-versus-host disease (aGVHD). This study aimed to develop a donor-specific epigenetic classifier that could be used in donor selection in HCT to reduce the incidence of aGVHD. The discovery cohort of the study consisted of 288 donors from a population receiving HLA-A, -B, -C and -DRB1 matched unrelated donor HCT with T cell replete peripheral blood stem cell grafts for treatment of acute leukaemia or myelodysplastic syndromes after myeloablative conditioning. Donors were selected based on recipient aGVHD outcome; this cohort consisted of 144 cases with aGVHD grades III-IV and 144 controls with no aGVHD that survived at least 100 days post-HCT matched for sex, age, disease and GVHD prophylaxis. Genome-wide DNA methylation was assessed using the Infinium Methylation EPIC BeadChip (Illumina), measuring CpG methylation at >850,000 sites across the genome. Following quality control, pre-processing and exploratory analyses, we applied a machine learning algorithm (Random Forest) to identify CpG sites predictive of aGVHD. Receiver operating characteristic (ROC) curve analysis of these sites resulted in a classifier with an encouraging area under the ROC curve (AUC) of 0.91. To test this classifier, we used an independent validation cohort (n=288) selected using the same criteria as the discovery cohort. Different attempts to validate the classifier using the independent validation cohort failed with the AUC falling to 0.51. These results indicate that donor DNA methylation may not be a suitable predictor of aGVHD in an HCT setting involving unrelated donors, despite the initial promising results in the discovery cohort. Our work highlights the importance of independent validation of machine learning classifiers, particularly when developing classifiers intended for clinical use

    SynthEye: Investigating the Impact of Synthetic Data on Artificial Intelligence-assisted Gene Diagnosis of Inherited Retinal Disease

    Get PDF
    PURPOSE: Rare disease diagnosis is challenging in medical image-based artificial intelligence due to a natural class imbalance in datasets, leading to biased prediction models. Inherited retinal diseases (IRDs) are a research domain that particularly faces this issue. This study investigates the applicability of synthetic data in improving artificial intelligence-enabled diagnosis of IRDs using generative adversarial networks (GANs). DESIGN: Diagnostic study of gene-labeled fundus autofluorescence (FAF) IRD images using deep learning. PARTICIPANTS: Moorfields Eye Hospital (MEH) dataset of 15 692 FAF images obtained from 1800 patients with confirmed genetic diagnosis of 1 of 36 IRD genes. METHODS: A StyleGAN2 model is trained on the IRD dataset to generate 512 × 512 resolution images. Convolutional neural networks are trained for classification using different synthetically augmented datasets, including real IRD images plus 1800 and 3600 synthetic images, and a fully rebalanced dataset. We also perform an experiment with only synthetic data. All models are compared against a baseline convolutional neural network trained only on real data. MAIN OUTCOME MEASURES: We evaluated synthetic data quality using a Visual Turing Test conducted with 4 ophthalmologists from MEH. Synthetic and real images were compared using feature space visualization, similarity analysis to detect memorized images, and Blind/Referenceless Image Spatial Quality Evaluator (BRISQUE) score for no-reference-based quality evaluation. Convolutional neural network diagnostic performance was determined on a held-out test set using the area under the receiver operating characteristic curve (AUROC) and Cohen's Kappa (Îș). RESULTS: An average true recognition rate of 63% and fake recognition rate of 47% was obtained from the Visual Turing Test. Thus, a considerable proportion of the synthetic images were classified as real by clinical experts. Similarity analysis showed that the synthetic images were not copies of the real images, indicating that copied real images, meaning the GAN was able to generalize. However, BRISQUE score analysis indicated that synthetic images were of significantly lower quality overall than real images (P < 0.05). Comparing the rebalanced model (RB) with the baseline (R), no significant change in the average AUROC and Îș was found (R-AUROC = 0.86[0.85-88], RB-AUROC = 0.88[0.86-0.89], R-k = 0.51[0.49-0.53], and RB-k = 0.52[0.50-0.54]). The synthetic data trained model (S) achieved similar performance as the baseline (S-AUROC = 0.86[0.85-87], S-k = 0.48[0.46-0.50]). CONCLUSIONS: Synthetic generation of realistic IRD FAF images is feasible. Synthetic data augmentation does not deliver improvements in classification performance. However, synthetic data alone deliver a similar performance as real data, and hence may be useful as a proxy to real data. Financial Disclosure(s): Proprietary or commercial disclosure may be found after the references

    Can artificial intelligence accelerate the diagnosis of inherited retinal diseases? Protocol for a data-only retrospective cohort study (Eye2Gene)

    Get PDF
    INTRODUCTION: Inherited retinal diseases (IRD) are a leading cause of visual impairment and blindness in the working age population. Mutations in over 300 genes have been found to be associated with IRDs and identifying the affected gene in patients by molecular genetic testing is the first step towards effective care and patient management. However, genetic diagnosis is currently slow, expensive and not widely accessible. The aim of the current project is to address the evidence gap in IRD diagnosis with an AI algorithm, Eye2Gene, to accelerate and democratise the IRD diagnosis service. METHODS AND ANALYSIS: The data-only retrospective cohort study involves a target sample size of 10 000 participants, which has been derived based on the number of participants with IRD at three leading UK eye hospitals: Moorfields Eye Hospital (MEH), Oxford University Hospital (OUH) and Liverpool University Hospital (LUH), as well as a Japanese hospital, the Tokyo Medical Centre (TMC). Eye2Gene aims to predict causative genes from retinal images of patients with a diagnosis of IRD. For this purpose, 36 most common causative IRD genes have been selected to develop a training dataset for the software to have enough examples for training and validation for detection of each gene. The Eye2Gene algorithm is composed of multiple deep convolutional neural networks, which will be trained on MEH IRD datasets, and externally validated on OUH, LUH and TMC. ETHICS AND DISSEMINATION: This research was approved by the IRB and the UK Health Research Authority (Research Ethics Committee reference 22/WA/0049) 'Eye2Gene: accelerating the diagnosis of IRDs' Integrated Research Application System (IRAS) project ID: 242050. All research adhered to the tenets of the Declaration of Helsinki. Findings will be reported in an open-access format

    Sequenceserver: A Modern Graphical User Interface for Custom BLAST Databases

    Get PDF
    Comparing newly obtained and previously known nucleotide and amino-acid sequences underpins modern biological research. BLAST is a well-established tool for such comparisons but is challenging to use on new data sets. We combined a user-centric design philosophy with sustainable software development approaches to create Sequenceserver, a tool for running BLAST and visually inspecting BLAST results for biological interpretation. Sequenceserver uses simple algorithms to prevent potential analysis errors and provides flexible text-based and visual outputs to support researcher productivity. Our software can be rapidly installed for use by individuals or on shared servers

    Whole Exome Sequencing Reveals Novel and Recurrent Disease-Causing Variants in Lens Specific Gap Junctional Protein Encoding Genes Causing Congenital Cataract

    Get PDF
    Pediatric cataract is clinically and genetically heterogeneous, and is the most common cause of childhood blindness worldwide. In this study, we aimed to identify disease-causing variants in three large British families and one isolated case with autosomal dominant congenital cataract, using whole exome sequencing. We identified four different heterozygous variants, three in the large families and one in the isolated case. Family A, with a novel missense variant (c.178G>C, p.Gly60Arg) in GJA8 with lamellar cataract; family B, with a recurrent variant in GJA8 (c.262C>T, p.Pro88Ser) associated with nuclear cataract; and family C, with a novel variant in GJA3 (c.771dupC, p.Ser258GlnfsTer68) causing a lamellar phenotype. Individual D had a novel variant in GJA3 (c.82G>T, p.Val28Leu) associated with congenital cataract. Each sequence variant was found to cosegregate with disease. Here, we report three novel and one recurrent disease-causing sequence variant in the gap junctional protein encoding genes causing autosomal dominant congenital cataract. Our study further extends the mutation spectrum of these genes and further facilitates clinical diagnosis. A recurrent p.P88S variant in GJA8 causing isolated nuclear cataract provides evidence of further phenotypic heterogeneity associated with this variant
    corecore