Search CORE

20 research outputs found

Imputation Aided Methylation Analysis

Author: Moghul Muhammad Ismail
Publication venue: UCL (University College London)
Publication date: 28/08/2021
Field of study

Genome-wide DNA methylation analysis is of broad interest to medical research because of its central role in human development and disease. However, generating high-quality methylomes on a large scale is particularly expensive due to technical issues inherent to DNA treatment with bisulfite, requiring deeper than usual sequencing. In silico methodologies, such as imputation, can be used to address this limitation and improve the coverage and quality of data produced in these experiments. Imputation is a statistical technique where missing values are substituted with computed values. The process involves leveraging information from reference data to calculate probable values for missing data points. In this thesis, imputation is explored for its potential to increase the value of methylation datasets sequenced at different depths: 1. First, a new R package, Methylation Analysis ToolkiT (MATT), was developed to deal with large numbers of WGBS datasets in a computationally- and memory-efficient manner. 2. Second, the performance of DNA methylation-specific and generic imputation tools were assessed by down-sampling high-quality (100x) WGBS datasets to determine the extent to which missing data can be recovered and the accuracy of imputed values. 3. Third, to overcome shortfalls within existing tools, a novel imputation tool was developed, termed Global IMputation of cpg MEthylation (GIMMEcpg). GIMMEcpg default implementation is based on Model Stacking and outperforms existing tools in accuracy and speed. 4. Lastly, to demonstrate its potential, GIMMEcpg was used to impute ten shallow (17x) WGBS datasets from healthy volunteers of the Personal Genome Project UK with high accuracy. Moreover, the extent of missing and low-quality data, as well as the reproducibility and accuracy of methylation datasets, were explored for different data types (Microarrays, Reduced Representation Bisulfite Sequencing (RRBS), Whole Genome Bisulfite Sequencing (WGBS), EM-Seq and Nanopore sequencing)

UCL Discovery

GeneValidator: identify problems with protein-coding gene predictions

Author: Alioto
Anurag Priyam
Claudio Bustos
Hou
Ismail Moghul
Monica-Andreea Drăgan
Nygaard
Pray
Wurm
Yannick Wurm
Publication venue: 'Oxford University Press (OUP)'
Publication date: 18/01/2016
Field of study

This work was supported by the Sciruby community, NESCent Google Summer of Code, the NESCent “Building non-model species genome curation communities” working group, Biotechnology and Biological Sciences Research Council [BB/K004204/1], Natural Environment Research Council [NE/L00626X/1, EOS Cloud] and QMUL Apocrita Midplus (EP/K000128/1)

Repository for Publications and Research Data

Crossref

PubMed Central

Queen Mary Research Online

Comparison and imputation-aided integration of five commercial platforms for targeted DNA methylome analysis

Author: Ambrose John
Barrett James
Beck Stephan
Dhami Pawan
Feber Andrew
Moghul Ismail
Rodney Simon
Tanić Miljana
Vaikkinen Heli
Publication venue: NATURE PORTFOLIO
Publication date: 01/10/2022
Field of study

Targeted bisulfite sequencing (TBS) has become the method of choice for the cost-effective, targeted analysis of the human methylome at base-pair resolution. In this study, we benchmarked five commercially available TBS platforms-three hybridization capture-based (Agilent, Roche and Illumina) and two reduced-representation-based (Diagenode and NuGen)-across 11 samples. Two samples were also compared with whole-genome DNA methylation sequencing with the Illumina and Oxford Nanopore platforms. We assessed workflow complexity, on/off-target performance, coverage, accuracy and reproducibility. Although all platforms produced robust and reproducible data, major differences in the number and identity of the CpG sites covered make it difficult to compare datasets generated on different platforms. To overcome this limitation, we applied imputation and show that it improves interoperability from an average of 10.35% (0.8 million) to 97% (7.6 million) common CpG sites. Our study provides guidance on which TBS platform to use for different methylome features and offers an imputation-based harmonization solution that allows comparative, integrative analysis

UCL Discovery

The Personal Genome Project-UK, an open access resource of human multi-omics data

Author: Beck Stephan
Berner Alison
Chervova Olga
Conde Lucia
Guerra-Assunção José Afonso
Hamoudi Rifat
Herrero Javier
Jesus Tiago F.
Larose Cadieux Elizabeth
Moghul Ismail
Tian Yuan
Voloshin Vitaly
Webster Amy P.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 31/10/2019
Field of study

Integrative analysis of multi-omics data is a powerful approach for gaining functional insights into biological and medical processes. Conducting these multifaceted analyses on human samples is often complicated by the fact that the raw sequencing output is rarely available under open access. The Personal Genome Project UK (PGP-UK) is one of few resources that recruits its participants under open consent and makes the resulting multi-omics data freely and openly available. As part of this resource, we describe the PGP-UK multi-omics reference panel consisting of ten genomic, methylomic and transcriptomic data. Specifically, we outline the data processing, quality control and validation procedures which were implemented to ensure data integrity and exclude sample mix-ups. In addition, we provide a REST API to facilitate the download of the entire PGP-UK dataset. The data are also available from two cloud-based environments, providing platforms for free integrated analysis. In conclusion, the genotype-validated PGP-UK multi-omics human reference panel described here provides a valuable new open access resource for integrated analyses in support of personal and medical genomics

UCL Discovery

Warwick Research Archives Portal Repository

SARS-CoV-2 3D database: Understanding the Coronavirus Proteome and Evaluating Possible Drug Targets.

Author: Alsulami Ali
Bannerman Bridget
Beaudoin Christopher
Blundell Tom
Copoiu Liviu
Jamasb Arian
Moghul Ismail
Thomas Sherine
Torres pedro
Vedithi Sundeep
Publication venue: Briefings in Bioinformatics
Publication date: 22/03/2021
Field of study

The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a rapidly growing infectious disease, widely spread with high mortality rates. Since the release of the SARS-CoV-2 genome sequence in March 2020, there has been an international focus on developing target-based drug discovery, which also requires knowledge of the 3D structure of the proteome. Where there are no experimentally solved structures, our group has created 3D models with coverage of 97.5% and characterised them using state-of-the-art computational approaches. Models of protomers and oligomers, together with predictions of substrate and allosteric binding sites, protein- ligand docking, SARS-CoV-2 protein interactions with human proteins, impacts of mutations, and mapped solved experimental structures are freely available for download. These are imple- mented in SARS CoV-2 3D, a comprehensive and user-friendly database, available at https://sars3d.com/. This provides essential information for drug discovery, both to evaluate targets and design new potential therapeutics.This work is supported and funded by King Abdullah scholarship (Saudi Arabia research coun- cil), and American Leprosy Missions grants (G88726), SET is funded by the Cystic Fibrosis Trust (RG 70975) and Fondation Botnar (RG91317). A.R.J is funded by the Biotechnology and Biological Sciences Research Council (BBSRC) DTP studentship (BB/M011194/1). B.B. is funded by the Cystic Fibrosis Trust and L.C. on a studentship from Ipsen. T.L.B. is funded by a the Wellcome Trust Investigator Award, PHZJ/489 RG83114 (2016-2021

UCL Discovery

Apollo (Cambridge)

Donor whole blood DNA methylation is not a strong predictor of acute graft versus host disease in unrelated donor allogeneic haematopoietic cell transplantation

Author: Beck Stephan
Dhami Pawan
Ecker Simone
Feber Andrew
Kuxhausen Michelle
Lee Stephanie J
Marzi Sarah
Moghul Ismail
Paul Dirk S
Peggs Karl S
Rakyan Vardhman
Spellman Stephen R
Wang Tao
Webster Amy P
Publication venue: 'Cold Spring Harbor Laboratory'
Publication date: 10/03/2022
Field of study

Allogeneic hematopoietic cell transplantation (HCT) is used to treat many blood-based disorders and malignancies. While this is an effective treatment, it can result in serious adverse events, such as the development of acute graft-versus-host disease (aGVHD). This study aimed to develop a donor-specific epigenetic classifier that could be used in donor selection in HCT to reduce the incidence of aGVHD. The discovery cohort of the study consisted of 288 donors from a population receiving HLA-A, -B, -C and -DRB1 matched unrelated donor HCT with T cell replete peripheral blood stem cell grafts for treatment of acute leukaemia or myelodysplastic syndromes after myeloablative conditioning. Donors were selected based on recipient aGVHD outcome; this cohort consisted of 144 cases with aGVHD grades III-IV and 144 controls with no aGVHD that survived at least 100 days post-HCT matched for sex, age, disease and GVHD prophylaxis. Genome-wide DNA methylation was assessed using the Infinium Methylation EPIC BeadChip (Illumina), measuring CpG methylation at >850,000 sites across the genome. Following quality control, pre-processing and exploratory analyses, we applied a machine learning algorithm (Random Forest) to identify CpG sites predictive of aGVHD. Receiver operating characteristic (ROC) curve analysis of these sites resulted in a classifier with an encouraging area under the ROC curve (AUC) of 0.91. To test this classifier, we used an independent validation cohort (n=288) selected using the same criteria as the discovery cohort. Different attempts to validate the classifier using the independent validation cohort failed with the AUC falling to 0.51. These results indicate that donor DNA methylation may not be a suitable predictor of aGVHD in an HCT setting involving unrelated donors, despite the initial promising results in the discovery cohort. Our work highlights the importance of independent validation of machine learning classifiers, particularly when developing classifiers intended for clinical use

UCL Discovery

SynthEye: Investigating the Impact of Synthetic Data on Artificial Intelligence-assisted Gene Diagnosis of Inherited Retinal Disease

Author: Balaskas Konstantinos
Beck Stephan
Cabral de Guimarães Thales Antonio
Daich Varela Malena
Keane Pearse A
Lazebnik Teddy
Liefers Bart
Mahroo Omar
Michaelides Michel
Moghul Ismail
Patel Praveen J
Pontikos Nikolas
Veturi Yoga Advaith
Wagner Siegfried K
Webster Andrew R
Woodward-Court Peter
Woof William
Publication venue: 'Elsevier BV'
Publication date: 01/06/2023
Field of study

PURPOSE: Rare disease diagnosis is challenging in medical image-based artificial intelligence due to a natural class imbalance in datasets, leading to biased prediction models. Inherited retinal diseases (IRDs) are a research domain that particularly faces this issue. This study investigates the applicability of synthetic data in improving artificial intelligence-enabled diagnosis of IRDs using generative adversarial networks (GANs). DESIGN: Diagnostic study of gene-labeled fundus autofluorescence (FAF) IRD images using deep learning. PARTICIPANTS: Moorfields Eye Hospital (MEH) dataset of 15 692 FAF images obtained from 1800 patients with confirmed genetic diagnosis of 1 of 36 IRD genes. METHODS: A StyleGAN2 model is trained on the IRD dataset to generate 512 × 512 resolution images. Convolutional neural networks are trained for classification using different synthetically augmented datasets, including real IRD images plus 1800 and 3600 synthetic images, and a fully rebalanced dataset. We also perform an experiment with only synthetic data. All models are compared against a baseline convolutional neural network trained only on real data. MAIN OUTCOME MEASURES: We evaluated synthetic data quality using a Visual Turing Test conducted with 4 ophthalmologists from MEH. Synthetic and real images were compared using feature space visualization, similarity analysis to detect memorized images, and Blind/Referenceless Image Spatial Quality Evaluator (BRISQUE) score for no-reference-based quality evaluation. Convolutional neural network diagnostic performance was determined on a held-out test set using the area under the receiver operating characteristic curve (AUROC) and Cohen's Kappa (κ). RESULTS: An average true recognition rate of 63% and fake recognition rate of 47% was obtained from the Visual Turing Test. Thus, a considerable proportion of the synthetic images were classified as real by clinical experts. Similarity analysis showed that the synthetic images were not copies of the real images, indicating that copied real images, meaning the GAN was able to generalize. However, BRISQUE score analysis indicated that synthetic images were of significantly lower quality overall than real images (P < 0.05). Comparing the rebalanced model (RB) with the baseline (R), no significant change in the average AUROC and κ was found (R-AUROC = 0.86[0.85-88], RB-AUROC = 0.88[0.86-0.89], R-k = 0.51[0.49-0.53], and RB-k = 0.52[0.50-0.54]). The synthetic data trained model (S) achieved similar performance as the baseline (S-AUROC = 0.86[0.85-87], S-k = 0.48[0.46-0.50]). CONCLUSIONS: Synthetic generation of realistic IRD FAF images is feasible. Synthetic data augmentation does not deliver improvements in classification performance. However, synthetic data alone deliver a similar performance as real data, and hence may be useful as a proxy to real data. Financial Disclosure(s): Proprietary or commercial disclosure may be found after the references

UCL Discovery

Can artificial intelligence accelerate the diagnosis of inherited retinal diseases? Protocol for a data-only retrospective cohort study (Eye2Gene)

INTRODUCTION: Inherited retinal diseases (IRD) are a leading cause of visual impairment and blindness in the working age population. Mutations in over 300 genes have been found to be associated with IRDs and identifying the affected gene in patients by molecular genetic testing is the first step towards effective care and patient management. However, genetic diagnosis is currently slow, expensive and not widely accessible. The aim of the current project is to address the evidence gap in IRD diagnosis with an AI algorithm, Eye2Gene, to accelerate and democratise the IRD diagnosis service. METHODS AND ANALYSIS: The data-only retrospective cohort study involves a target sample size of 10 000 participants, which has been derived based on the number of participants with IRD at three leading UK eye hospitals: Moorfields Eye Hospital (MEH), Oxford University Hospital (OUH) and Liverpool University Hospital (LUH), as well as a Japanese hospital, the Tokyo Medical Centre (TMC). Eye2Gene aims to predict causative genes from retinal images of patients with a diagnosis of IRD. For this purpose, 36 most common causative IRD genes have been selected to develop a training dataset for the software to have enough examples for training and validation for detection of each gene. The Eye2Gene algorithm is composed of multiple deep convolutional neural networks, which will be trained on MEH IRD datasets, and externally validated on OUH, LUH and TMC. ETHICS AND DISSEMINATION: This research was approved by the IRB and the UK Health Research Authority (Research Ethics Committee reference 22/WA/0049) 'Eye2Gene: accelerating the diagnosis of IRDs' Integrated Research Application System (IRAS) project ID: 242050. All research adhered to the tenets of the Declaration of Helsinki. Findings will be reported in an open-access format

UCL Discovery

Sequenceserver: A Modern Graphical User Interface for Custom BLAST Databases

Author: Alekhya Munagala
Altschul
Anurag Priyam
Austin Davis-Richardson
Ben J Woodcroft
Blanchoud
Buchfink
Camacho
Cui
Emeline Favreau
Esteban A Gómez
Filip Ter
Garrett
Guy Leonard
Hiroyuki Nakamura
Hiten Chowdhary
HongKee Moon
Ismail Moghul
Iwo Pieniak
Lawrence J Maynard
Liew
Mahmut Uludag
Mark Anthony Gibbins
McCormick
Nathan S Watson-Haigh
Pracana
Reese
Reichler
Richard Challis
Seim
Shen
Tomás Pluskal
Vivek Rai
Winnenburg
Wintersinger
Wolfgang Rumpf
Yannick Wurm
Publication venue: 'Oxford University Press (OUP)'
Publication date: 14/08/2019
Field of study

Comparing newly obtained and previously known nucleotide and amino-acid sequences underpins modern biological research. BLAST is a well-established tool for such comparisons but is challenging to use on new data sets. We combined a user-centric design philosophy with sustainable software development approaches to create Sequenceserver, a tool for running BLAST and visually inspecting BLAST results for biological interpretation. Sequenceserver uses simple algorithms to prevent potential analysis errors and provides flexible text-based and visual outputs to support researcher productivity. Our software can be rapidly installed for use by individuals or on shared servers

Crossref

Queensland University of Technology ePrints Archive

UCL Discovery

Edinburgh Research Explorer

Queen Mary Research Online

MPG.PuRe

University of Queensland eSpace

Whole Exome Sequencing Reveals Novel and Recurrent Disease-Causing Variants in Lens Specific Gap Junctional Protein Encoding Genes Causing Congenital Cataract

Author: Berry Vanita
Ionides Alex
Michaelides Michel
Moghul Ismail
Moore Anthony T.
Pontikos Nikolas
Quinlan Roy A.
Publication venue: MDPI
Publication date: 01/05/2020
Field of study

Pediatric cataract is clinically and genetically heterogeneous, and is the most common cause of childhood blindness worldwide. In this study, we aimed to identify disease-causing variants in three large British families and one isolated case with autosomal dominant congenital cataract, using whole exome sequencing. We identified four different heterozygous variants, three in the large families and one in the isolated case. Family A, with a novel missense variant (c.178G>C, p.Gly60Arg) in GJA8 with lamellar cataract; family B, with a recurrent variant in GJA8 (c.262C>T, p.Pro88Ser) associated with nuclear cataract; and family C, with a novel variant in GJA3 (c.771dupC, p.Ser258GlnfsTer68) causing a lamellar phenotype. Individual D had a novel variant in GJA3 (c.82G>T, p.Val28Leu) associated with congenital cataract. Each sequence variant was found to cosegregate with disease. Here, we report three novel and one recurrent disease-causing sequence variant in the gap junctional protein encoding genes causing autosomal dominant congenital cataract. Our study further extends the mutation spectrum of these genes and further facilitates clinical diagnosis. A recurrent p.P88S variant in GJA8 causing isolated nuclear cataract provides evidence of further phenotypic heterogeneity associated with this variant

Durham Research Online

UCL Discovery

eScholarship - University of California