17 research outputs found

    Data Study Group Final Report: Roche

    Get PDF
    Data Study Groups are week-long events at The Alan Turing Institute bringing together some of the country’s top talent from data science, artificial intelligence, and wider fields, to analyse real-world data science challenges. Roche: Personalised lung cancer treatment modelling using electronic health records and genomics Cancer immunotherapy (CIT) is a promising new type of cancer treatment that uses the patient’s own immune system to fight cancer cells. CIT drugs work to stop the cancer cells from turning off the immune system’s T-cells by inhibiting the PD-L1 produced by the tumour cells (PD-L1 is a protein that binds to PD-1 receptors on T-cells and prevents the immune system from attacking the cancer cells). CIT is currently being used to treat patients with non-small cell lung cancer (NSCLC) for whom chemotherapy or other drugs have failed. CIT is also be-ing used as part of the first-line treatment in patients with advanced NSCLC (aNSCLC - stage III and higher). Theoretically, patients with high PD-L1 ex-pression levels are more likely to respond well to CIT; however, in practice, patient outcomes vary considerably. In this data study group, we investigated different approaches for predicting survival time for patients treated with CIT as first line of treatment, using both electronic health records and tumour genomic data. We also investigated the causal effects of CIT vs other oncology treatments, and studied treatment heterogeneity. The results contribute to identifying patients who are most likely to benefit from CIT

    Advanced modelling of genomic data in Inflammatory Bowel Disease

    No full text
    Advances in next generation sequencing technologies allow the collection of enormous volumes of genomic data on large patient cohorts. Concurrently, machine learning algorithms are rapidly evolving and, together, these technologies represent the new frontier of research and clinical management on a path leading toward personalised medicine.The aims of this thesis are two. Firstly, to develop a mathematical framework for the analysis and integration of next generation sequencing data. Secondly, to model data from patients affected by inflammatory bowel disease (IBD), a common complex autoimmune condition with increasing incidence worldwide, by applying machine learning methodologies to clinical and transformed genomic data.The analyses presented in this thesis are largely based on a cohort of paediatric IBD patients for which clinical data, immunology and whole exome sequencing data were available.This research illustrates a supervised and unsupervised machine learning approach modelling histology and endoscopy data for assigning IBD patients with the correct CD/UC subtypes with superior accuracy.Stratification and classification of IBD patients can be improved by layering ge- nomic data on top of clinical evidence. This thesis also describes the development of GenePy, a mathematical model for transforming patients genomic data into a per-individual per-gene deleteriousness scoring system. GenePy is capable of modelling and implementing important biological information from whole exome sequencing data from patient DNA. GenePy eases the analysis and interpretation of genomic data on an individual basis and concomitantly allows the comparison of genetic profiles across patients. GenePy gene scores can be further combined according to molecular processes or pathways.This work describes eight novel immuno-genomic IBD sybtypes observed on a small cohort for which immune cytokine signalling and response cascades have been specifically profiled and GenePy scores obtained.In addition, the GenePy algorithm is applied using both supervised and unsuper- vised approaches to classify IBD subtypes and to explore alternative disease clas- sifications that discriminate molecular clinical subtypes that are clinically relevant for treatment and prognosis. This thesis reports the current highest performance in discriminating IBD subtypes using exome sequencing data and five novel ge- nomic patient strata defined by different mutational burden of adaptive immune system genes.This work demonstrates the power of integrating 21st century high throughput digital data in machine learning frameworks and the potential to obtain clinicallyrelevant strata for bench to bedside improvements in patient quality of life.<br/

    GenePy - a score for estimating gene pathogenicity in individuals using next-generation sequencing data

    No full text
    BackgroundNext-generation sequencing is revolutionising diagnosis and treatment of rare diseases, however its application to understanding common disease aetiology is limited. Rare disease applications binarily attribute genetic change(s) at a single locus to a specific phenotype. In common diseases, where multiple genetic variants within and across genes contribute to disease, binary modelling cannot capture the burden of pathogenicity harboured by an individual across a given gene/pathway.We present GenePy, a novel gene-level scoring system for integration and analysis of next-generation sequencing data on a per-individual basis that transforms NGS data interpretation from variant-level to gene-level. This simple and flexible scoring system is intuitive and amenable to integration for machine learning, network and topological approaches, facilitating the investigation of complex phenotypes.ResultsWhole-exome sequencing data from 508 individuals were used to generate GenePy scores. For each variant a score is calculated incorporating: i) population allele frequency estimates; ii) individual zygosity, determined through standard variant calling pipelines and; iii) any user defined deleteriousness metric to inform on functional impact. GenePy then combines scores generated for all variants observed into a single gene score for each individual.We generated a matrix of ~ 14,000 GenePy scores for all individuals for each of sixteen popular deleteriousness metrics. All per-gene scores are corrected for gene length. The majority of genes generate GenePy scores &lt; 0.01 although individuals harbouring multiple rare highly deleterious mutations can accumulate extremely high GenePy scores.In the absence of a comparator metric, we examine GenePy performance in discriminating genes known to be associated with three common, complex diseases. A Mann-Whitney U test conducted on GenePy scores for this positive control gene in cases versus controls demonstrates markedly more significant results (p = 1.37 × 10− 4) compared to the most commonly applied association tool that combines common and rare variation (p = 0.003).ConclusionsPer-gene per-individual GenePy scores are intuitive when assessing genetic variation in individual patients or comparing scores between groups. GenePy outperforms the currently accepted best practice tools for combining common and rare variation. GenePy scores are suitable for downstream data integration with transcriptomic and proteomic data that also report at the gene level

    A systematic review of artificial intelligence and machine learning applications to inflammatory bowel disease, with practical guidelines for interpretation

    No full text
    Background: Inflammatory bowel disease (IBD) is a gastrointestinal chronic disease with an unpredictable disease course. Computational methods such as machine learning (ML) have the potential to stratify IBD patients for the provision of individualised care. The use of ML methods for IBD was surveyed, with an additional focus on how the field has changed over time. Methods: A systematic review was conducted through a search of MEDLINE and Embase databases, with the search structure (“machine learning” OR “artificial intelligence”) AND (“Crohn* Disease” OR “Ulcerative Colitis” OR “Inflammatory Bowel Disease”), searched 6th May 2021. Exclusion criteria: studies not written in English, no human patient data, publication before 2001, studies that were not peer reviewed, non-autoimmune disease comorbidity research and record types that were not primary research. Results: 78 (of 409) records met the inclusion criteria. Random forest methods were most prevalent, and there was an increase in neural networks, mainly applied to imaging datasets. The main applications of ML to clinical tasks were diagnosis (18/78), disease course (22/78) and disease severity (16/78). The median sample size was 263. Clinical and microbiome-related datasets were most popular. 5% of studies used an external dataset after training and testing for additional model validation.Discussion: Availability of longitudinal and deep phenotyping data could lead to better modelling. ML pipelines considering imbalanced data, and feature selection only on training data will generate more generalisable models. ML models are increasingly being applied to more complex clinical tasks for specific phenotypes, indicating progress towards personalised medicine for IBD

    Supervised machine learning classifies inflammatory bowel disease patients by subtype using whole exome sequencing data

    No full text
    Background: inflammatory bowel disease (IBD) is a chronic inflammatory disorder with two main subtypes: Crohn's disease (CD) and Ulcerative Colitis (UC). Prompt subtype diagnosis enables the correct treatment to be administered. Using genomic data, we aimed to assess machine learning (ML) to classify patients according to IBD subtype.Methods: whole exome sequencing from paediatric/adult IBD patients was processed using an in-house bioinformatics pipeline. This data was condensed into the per-gene, per-individual genomic burden score, GenePy. Data was split into training and testing datasets (80/20). Feature selection with a linear support vector classifier, and hyperparameter tuning with Bayesian Optimisation was performed (training data). The supervised ML method random forest was utilised to classify patients as CD or UC using three panels: I) all available genes, 2) autoimmune genes, 3) 'IBD' genes. ML results were assessed using AUROC, sensitivity and specificity on the testing dataset.Results: 906 patients were included in analysis (600 CD, 306 UC). Training data included 488 patients, balanced according to the minority class of UC. The autoimmune gene panel generated the best performing ML model (AUROC=0.68), outperforming an IBD gene panel (AUROC =0.61). NOD2 was the top gene for discriminating CD and UC, regardless of the gene panel used. Lack of variation in genes with high GenePy scores in CD patients was the best classifier of a diagnosis of UC.Discussion: we demonstrate promising classification of patients by subtype utilising random forest and WES data. Focussing on specific subgroups of patients, with larger datasets may result in better classification

    Data from: Commercial chicken breeds exhibit highly divergent patterns of linkage disequilibrium

    No full text
    The analysis of linkage disequilibrium (LD) underpins the development of effective genotyping technologies, trait mapping and understanding of biological mechanisms such as those driving recombination and the impact of selection. We apply the Mal&eacute;cot-Morton model of LD to create additive LD maps which describe the high-resolution LD landscape of commercial chickens. We investigated LD in chickens (Gallus gallus) at the highest resolution to date for broiler, white egg and brown egg layer commercial lines. There is minimal concordance between breeds of fine scale LD patterns (correlation coefficient &amp;lt; 0.21), and even between discrete broiler lines. Regions of LD breakdown, which may align with recombination hotspots, are enriched near CpG islands and transcription start sites (p &amp;lt; 2.2x10-16), consistent with recent evidence described in finches, but concordance in hotspot locations between commercial breeds is only marginally greater than random. As in other birds functional elements in the chicken genome are associated with recombination, but, unlike evidence from other bird species, the LD landscape is not stable in the populations studied. The development of optimal genotyping panels for genome-led selection programmes will depend on careful analysis of the LD structure of each line of interest. Further study is required to fully elucidate the mechanisms underlying highly divergent LD patterns found in commercial chickens.,Genotypes from three chicken breedsFiles contain in archive Gallus_genotypes.zip contain genotype data from Pengelly et al, 2016. &#39;Commercial chicken breeds exhibit highly divergent patterns of linkage disequilibrium&#39;. The three breed types are in separate file: WEL - white egg layers, BEL - brown egg layers, BRO - broilers. Genotypes have been QCed, providing SNPs with an allele frequency &amp;gt; 5%, HWE p &amp;gt; 0.001, and having 95% genotyping rate accross the breed. Genotype data are provided in the PLINK format, see http://pngu.mgh.harvard.edu/~purcell/plink/index.shtml for further details including file format specifications.Gallus_genotypes.zip,</span

    Data from: Commercial chicken breeds exhibit highly divergent patterns of linkage disequilibrium

    No full text
    The analysis of linkage disequilibrium (LD) underpins the development of effective genotyping technologies, trait mapping and understanding of biological mechanisms such as those driving recombination and the impact of selection. We apply the Malécot-Morton model of LD to create additive LD maps which describe the high-resolution LD landscape of commercial chickens. We investigated LD in chickens (Gallus gallus) at the highest resolution to date for broiler, white egg and brown egg layer commercial lines. There is minimal concordance between breeds of fine scale LD patterns (correlation coefficient < 0.21), and even between discrete broiler lines. Regions of LD breakdown, which may align with recombination hotspots, are enriched near CpG islands and transcription start sites (p < 2.2x10-16), consistent with recent evidence described in finches, but concordance in hotspot locations between commercial breeds is only marginally greater than random. As in other birds functional elements in the chicken genome are associated with recombination, but, unlike evidence from other bird species, the LD landscape is not stable in the populations studied. The development of optimal genotyping panels for genome-led selection programmes will depend on careful analysis of the LD structure of each line of interest. Further study is required to fully elucidate the mechanisms underlying highly divergent LD patterns found in commercial chickens
    corecore