Search CORE

6,039 research outputs found

Lack of sufficiently strong informative features limits the potential of gene expression analysis as predictive tool for many clinical classification problems

Author: Hess Kenneth R
Iwamoto Takayuki
Pusztai Lajos
Qi Yuan
Symmans W Fraser
Wei Caimiao
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Our goal was to examine how various aspects of a gene signature influence the success of developing multi-gene prediction models. We inserted gene signatures into three real data sets by altering the expression level of existing probe sets. We varied the number of probe sets perturbed (signature size), the fold increase of mean probe set expression in perturbed compared to unperturbed data (signature strength) and the number of samples perturbed. Prediction models were trained to identify which cases had been perturbed. Performance was estimated using Monte-Carlo cross validation. Results Signature strength had the greatest influence on predictor performance. It was possible to develop almost perfect predictors with as few as 10 features if the fold difference in mean expression values were > 2 even when the spiked samples represented 10% of all samples. We also assessed the gene signature set size and strength for 9 real clinical prediction problems in six different breast cancer data sets. Conclusions We found sufficiently large and strong predictive signatures only for distinguishing ER-positive from ER-negative cancers, there were no strong signatures for more subtle prediction problems. Current statistical methods efficiently identify highly informative features in gene expression data if such features exist and accurate models can be built with as few as 10 highly informative features. Features can be considered highly informative if at least 2-fold expression difference exists between comparison groups but such features do not appear to be common for many clinically relevant prediction problems in human data sets.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

A Systems Biology-Based Classifier for Hepatocellular Carcinoma Diagnosis

Author: A Huether
C Bagi
CC Liu
Christian Schönbach
Dianhua Gu
Dong Li
DR Rhodes
E Segal
E Wurmbach
Fuchu He
G Giannelli
H Jeong
I Dinu
J Chen
J Ikoma
J Ludbrook
JD Storey
Jiyang Zhnag
JJ Goeman
KS Chadha
M Höpfner
M Sakamoto
M Sherman
MJ van Nimwegen
N Normanno
P Baldi
P Liu
PA Rioux
PM Song
R Braren
R Kannangai
S Eschrich
S Varambally
SE Calvano
Shaochuang Wang
TS Furey
U Scherf
X Chen
X Lu
X Peng
Y Hoshida
Y Li
Yanqiong Zhang
Yunping Zhu
Publication venue: Public Library of Science
Publication date: 28/07/2011
Field of study

AIM: The diagnosis of hepatocellular carcinoma (HCC) in the early stage is crucial to the application of curative treatments which are the only hope for increasing the life expectancy of patients. Recently, several large-scale studies have shed light on this problem through analysis of gene expression profiles to identify markers correlated with HCC progression. However, those marker sets shared few genes in common and were poorly validated using independent data. Therefore, we developed a systems biology based classifier by combining the differential gene expression with topological features of human protein interaction networks to enhance the ability of HCC diagnosis. METHODS AND RESULTS: In the Oncomine platform, genes differentially expressed in HCC tissues relative to their corresponding normal tissues were filtered by a corrected Q value cut-off and Concept filters. The identified genes that are common to different microarray datasets were chosen as the candidate markers. Then, their networks were analyzed by GeneGO Meta-Core software and the hub genes were chosen. After that, an HCC diagnostic classifier was constructed by Partial Least Squares modeling based on the microarray gene expression data of the hub genes. Validations of diagnostic performance showed that this classifier had high predictive accuracy (85.88∼92.71%) and area under ROC curve (approximating 1.0), and that the network topological features integrated into this classifier contribute greatly to improving the predictive performance. Furthermore, it has been demonstrated that this modeling strategy is not only applicable to HCC, but also to other cancers. CONCLUSION: Our analysis suggests that the systems biology-based classifier that combines the differential gene expression and topological features of human protein interaction network may enhance the diagnostic performance of HCC classifier

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Gene Expression Profiling between Patient Groups with High and Low Ki67 Levels after Short-term Preoperative Aromatase Inhibitor Treatment for Breast Cancer

Author: Doihara Hiroyoshi
Iwamoto Takayuki
Kajiwara Yukiko
Kochi Mariko
Shien Tadahiko
Taira Naruto
Toyooka Shinichi
Zhu Yidan
Publication venue: Okayama University Medical School
Publication date: 01/08/2022
Field of study

According to a recent report, a low Ki67 level after short-term preoperative hormone therapy (post-Ki67) might suggest a more favorable prognosis compared with a high post-Ki67 level in patients with hormone receptorpositive/human epidermal growth factor 2-negative (HR+/HER2−) breast cancer with high levels of Ki67. This study aimed to evaluate the pre-treatment genetic differences between these two patient groups. Forty-five luminal B-like patients were stratified into two groups, namely, a group with high (H→H) and one with low (H→L) Ki67 levels after short-term preoperative aromatase inhibitor (AI) treatment. We compared pre-treatmentgene expression profiles between the two groups. In gene level analysis, there was no significant difference between the two groups by the class comparison test. In pathway analysis, five metabolism-related gene sets were significantly upregulated in the H→L group (p≤0.05). In the search for novel targets, five genes (PARP, BRCA2, FLT4, CDK6, and PDCD1LG2) showed significantly higher expression in the H→H group (p≤0.05). Several metabolism-related pathways were associated with sensitivity to AI. In the future, it will be necessary to seek out new therapeutic strategies for the poor prognostic group with high post-Ki67

Okayama University Scientific Achievement Repository

Opportunities and obstacles for deep learning in biology and medicine

Author: Ching Travers
et al
Swamidass S Joshua
Publication venue: Digital Commons@Becker
Publication date: 01/04/2018
Field of study

Deep learning describes a class of machine learning algorithms that are capable of combining raw inputs into layers of intermediate features. These algorithms have recently shown impressive results across a variety of domains. Biology and medicine are data-rich disciplines, but the data are complex and often ill-understood. Hence, deep learning techniques may be particularly well suited to solve problems of these fields. We examine applications of deep learning to a variety of biomedical problems-patient classification, fundamental biological processes and treatment of patients-and discuss whether deep learning will be able to transform these tasks or if the biomedical sphere poses unique challenges. Following from an extensive literature review, we find that deep learning has yet to revolutionize biomedicine or definitively resolve any of the most pressing challenges in the field, but promising advances have been made on the prior state of the art. Even though improvements over previous baselines have been modest in general, the recent progress indicates that deep learning methods will provide valuable means for speeding up or aiding human investigation. Though progress has been made linking a specific neural network\u27s prediction to input features, understanding how users should interpret these models to make testable hypotheses about the system under study remains an open challenge. Furthermore, the limited amount of labelled data for training presents problems in some domains, as do legal and privacy constraints on work with sensitive health records. Nonetheless, we foresee deep learning enabling changes at both bench and bedside with the potential to transform several areas of biology and medicine

Digital Commons@Becker

From Correlation to Causality: Does Network Information improve Cancer Outcome Prediction?

Author: Roy Janine
Publication venue
Publication date: 16/04/2014
Field of study

Motivation: Disease progression in cancer can vary substantially between patients. Yet, patients often receive the same treatment. Recently, there has been much work on predicting disease progression and patient outcome variables from gene expression in order to personalize treatment options. A widely used approach is high-throughput experiments that aim to explore predictive signature genes which would provide identification of clinical outcome of diseases. Microarray data analysis helps to reveal underlying biological mechanisms of tumor progression, metastasis, and drug-resistance in cancer studies. Despite first diagnostic kits in the market, there are open problems such as the choice of random gene signatures or noisy expression data. The experimental or computational noise in data and limited tissue samples collected from patients might furthermore reduce the predictive power and biological interpretability of such signature genes. Nevertheless, signature genes predicted by different studies generally represent poor similarity; even for the same type of cancer. Integration of network information with gene expression data could provide more efficient signatures for outcome prediction in cancer studies. One approach to deal with these problems employs gene-gene relationships and ranks genes using the random surfer model of Google's PageRank algorithm. Unfortunately, the majority of published network-based approaches solely tested their methods on a small amount of datasets, questioning the general applicability of network-based methods for outcome prediction. Methods: In this thesis, I provide a comprehensive and systematically evaluation of a network-based outcome prediction approach -- NetRank - a PageRank derivative -- applied on several types of gene expression cancer data and four different types of networks. The algorithm identifies a signature gene set for a specific cancer type by incorporating gene network information with given expression data. To assess the performance of NetRank, I created a benchmark dataset collection comprising 25 cancer outcome prediction datasets from literature and one in-house dataset. Results: NetRank performs significantly better than classical methods such as foldchange or t-test as it improves the prediction performance in average for 7%. Besides, we are approaching the accuracy level of the authors' signatures by applying a relatively unbiased but fully automated process for biomarker discovery. Despite an order of magnitude difference in network size, a regulatory, a protein-protein interaction and two predicted networks perform equally well. Signatures as published by the authors and the signatures generated with classical methods do not overlap -- not even for the same cancer type -- whereas the network-based signatures strongly overlap. I analyze and discuss these overlapping genes in terms of the Hallmarks of cancer and in particular single out six transcription factors and seven proteins and discuss their specific role in cancer progression. Furthermore several tests are conducted for the identification of a Universal Cancer Signature. No Universal Cancer Signature could be identified so far, but a cancer-specific combination of general master regulators with specific cancer genes could be discovered that achieves the best results for all cancer types. As NetRank offers a great value for cancer outcome prediction, first steps for a secure usage of NetRank in a public cloud are described. Conclusion: Experimental evaluation of network-based methods on a gene expression benchmark dataset suggests that these methods are especially suited for outcome prediction as they overcome the problems of random gene signatures and noisy expression data. Through the combination of network information with gene expression data, network-based methods identify highly similar signatures over all cancer types, in contrast to classical methods that fail to identify highly common gene sets across the same cancer types. In general allows the integration of additional information in gene expression analysis the identification of more reliable, accurate and reproducible biomarkers and provides a deeper understanding of processes occurring in cancer development and progression.:1 Definition of Open Problems 2 Introduction 2.1 Problems in cancer outcome prediction 2.2 Network-based cancer outcome prediction 2.3 Universal Cancer Signature 3 Methods 3.1 NetRank algorithm 3.2 Preprocessing and filtering of the microarray data 3.3 Accuracy 3.4 Signature similarity 3.5 Classical approaches 3.6 Random signatures 3.7 Networks 3.8 Direct neighbor method 3.9 Dataset extraction 4 Performance of NetRank 4.1 Benchmark dataset for evaluation 4.2 The influence of NetRank parameters 4.3 Evaluation of NetRank 4.4 General findings 4.5 Computational complexity of NetRank 4.6 Discussion 5 Universal Cancer Signature 5.1 Signature overlap – a sign for Universal Cancer Signature 5.2 NetRank genes are highly connected and confirmed in literature 5.3 Hallmarks of Cancer 5.4 Testing possible Universal Cancer Signatures 5.5 Conclusion 6 Cloud-based Biomarker Discovery 6.1 Introduction to secure Cloud computing 6.2 Cancer outcome prediction 6.3 Security analysis 6.4 Conclusion 7 Contributions and Conclusion

Technische Universität Dresden: Qucosa

Distance-based methods for the analysis of Next-Generation sequencing data

Author: Otto Raik
Publication venue: Humboldt-Universität zu Berlin
Publication date: 14/09/2021
Field of study

Die Analyse von NGS Daten ist ein zentraler Aspekt der modernen genomischen Forschung. Bei der Extraktion von Daten aus den beiden am häufigsten verwendeten Quellorganismen bestehen jedoch vielfältige Problemstellungen. Im ersten Kapitel wird ein neuartiger Ansatz vorgestellt welcher einen Abstand zwischen Krebszellinienkulturen auf Grundlage ihrer kleinen genomischen Varianten bestimmt um die Kulturen zu identifizieren. Eine Voll-Exom sequenzierte Kultur wird durch paarweise Vergleiche zu Referenzdatensätzen identifiziert so ein gemessener Abstand geringer ist als dies bei nicht verwandten Kulturen zu erwarten wäre. Die Wirksamkeit der Methode wurde verifiziert, jedoch verbleiben Einschränkung da nur das Sequenzierformat des Voll-Exoms unterstützt wird. Daher wird im zweiten Kapitel eine publizierte Modifikation des Ansatzes vorgestellt welcher die Unterstützung der weitläufig genutzten Bulk RNA sowie der Panel-Sequenzierung ermöglicht. Die Ausweitung der Technologiebasis führt jedoch zu einer Verstärkung von Störeffekten welche zu Verletzungen der mathematischen Konditionen einer Abstandsmetrik führen. Daher werden die entstandenen Verletzungen durch statistische Verfahren zuerst quantifiziert und danach durch dynamische Schwellwertanpassungen erfolgreich kompensiert. Das dritte Kapitel stellt eine neuartige Daten-Aufwertungsmethode (Data-Augmentation) vor welche das Trainieren von maschinellen Lernmodellen in Abwesenheit von neoplastischen Trainingsdaten ermöglicht. Ein abstraktes Abstandsmaß wird zwischen neoplastischen Entitäten sowie Entitäten gesundem Ursprungs mittels einer transkriptomischen Dekonvolution hergestellt. Die Ausgabe der Dekonvolution erlaubt dann das effektive Vorhersagen von klinischen Eigenschaften von seltenen jedoch biologisch vielfältigen Krebsarten wobei die prädiktive Kraft des Verfahrens der des etablierten Goldstandard ebenbürtig ist.The analysis of NGS data is a central aspect of modern Molecular Genetics and Oncology. The first scientific contribution is the development of a method which identifies Whole-exome-sequenced CCL via the quantification of a distance between their sets of small genomic variants. A distinguishing aspect of the method is that it was designed for the computer-based identification of NGS-sequenced CCL. An identification of an unknown CCL occurs when its abstract distance to a known CCL is smaller than is expected due to chance. The method performed favorably during benchmarks but only supported the Whole-exome-sequencing technology. The second contribution therefore extended the identification method by additionally supporting the Bulk mRNA-sequencing technology and Panel-sequencing format. However, the technological extension incurred predictive biases which detrimentally affected the quantification of abstract distances. Hence, statistical methods were introduced to quantify and compensate for confounding factors. The method revealed a heterogeneity-robust benchmark performance at the trade-off of a slightly reduced sensitivity compared to the Whole-exome-sequencing method. The third contribution is a method which trains Machine-Learning models for rare and diverse cancer types. Machine-Learning models are subsequently trained on these distances to predict clinically relevant characteristics. The performance of such-trained models was comparable to that of models trained on both the substituted neoplastic data and the gold-standard biomarker Ki-67. No proliferation rate-indicative features were utilized to predict clinical characteristics which is why the method can complement the proliferation rate-oriented pathological assessment of biopsies. The thesis revealed that the quantification of an abstract distance can address sources of erroneous NGS data analysis

Dokumenten-Publikationsserver der Humboldt-Universität zu Berlin

Investigating the prediction ability of survival models based on both clinical and omics data: two case studies

Author: Bair
Binder
Boulesteix
Boulesteix
Boulesteix
Bøvelstad
Bühlmann
Bühlmann
Castaldi
Dobbin
Friedman
Friedman
Gerds
Graf
Harrell
Hatzis
Herold
Hess
Hothorn
Houwelingen
Hudis
Ideker
Ioannidis
Ioannidis
Ishwaran
Leeb
Martinez
Mayr
Metzeler
Michiels
Mogensen
Nevins
Oberthuer
Obulkasim
Sauerbrei
Sauerbrei
Schmid
Schumacher
Stephenson
Steyerberg
Tibshirani
Tibshirani
Tutz
Xanthakis
Publication venue
Publication date: 31/10/2013
Field of study

In biomedical literature numerous prediction models for clinical outcomes have been developed based either on clinical data or, more recently, on high-throughput molecular data (omics data). Prediction models based on both types of data, however, are less common, although some recent studies suggest that a suitable combination of clinical and molecular information may lead to models with better predictive abilities. This is probably due to the fact that it is not straightforward to combine data with different characteristics and dimensions (poorly characterized high dimensional omics data, well-investigated low dimensional clinical data). In this paper we analyze two publicly available datasets related to breast cancer and neuroblastoma, respectively, in order to show some possible ways to combine clinical and omics data into a prediction model of time-to-event outcome. Different strategies and statistical methods are exploited. The results are compared and discussed according to different criteria, including the discriminative ability of the models, computed on a validation dataset

Crossref

Open Access LMU

Data- og ekspertdreven variabelseleksjon for prediktive modeller i helsevesenet : mot økt tolkbarhet i underbestemte maskinlæringsproblemer

Author: Jenul Anna Selina
Publication venue: Norwegian University of Life Sciences, Ås
Publication date: 01/01/2023
Field of study

Modern data acquisition techniques in healthcare generate large collections of data from multiple sources, such as novel diagnosis and treatment methodologies. Some concrete examples are electronic healthcare record systems, genomics, and medical images. This leads to situations with often unstructured, high-dimensional heterogeneous patient cohort data where classical statistical methods may not be sufficient for optimal utilization of the data and informed decision-making. Instead, investigating such data structures with modern machine learning techniques promises to improve the understanding of patient health issues and may provide a better platform for informed decision-making by clinicians. Key requirements for this purpose include (a) sufficiently accurate predictions and (b) model interpretability. Achieving both aspects in parallel is difficult, particularly for datasets with few patients, which are common in the healthcare domain. In such cases, machine learning models encounter mathematically underdetermined systems and may overfit easily on the training data. An important approach to overcome this issue is feature selection, i.e., determining a subset of informative features from the original set of features with respect to the target variable. While potentially raising the predictive performance, feature selection fosters model interpretability by identifying a low number of relevant model parameters to better understand the underlying biological processes that lead to health issues. Interpretability requires that feature selection is stable, i.e., small changes in the dataset do not lead to changes in the selected feature set. A concept to address instability is ensemble feature selection, i.e. the process of repeating the feature selection multiple times on subsets of samples of the original dataset and aggregating results in a meta-model. This thesis presents two approaches for ensemble feature selection, which are tailored towards high-dimensional data in healthcare: the Repeated Elastic Net Technique for feature selection (RENT) and the User-Guided Bayesian Framework for feature selection (UBayFS). While RENT is purely data-driven and builds upon elastic net regularized models, UBayFS is a general framework for ensembles with the capabilities to include expert knowledge in the feature selection process via prior weights and side constraints. A case study modeling the overall survival of cancer patients compares these novel feature selectors and demonstrates their potential in clinical practice. Beyond the selection of single features, UBayFS also allows for selecting whole feature groups (feature blocks) that were acquired from multiple data sources, as those mentioned above. Importance quantification of such feature blocks plays a key role in tracing information about the target variable back to the acquisition modalities. Such information on feature block importance may lead to positive effects on the use of human, technical, and financial resources if systematically integrated into the planning of patient treatment by excluding the acquisition of non-informative features. Since a generalization of feature importance measures to block importance is not trivial, this thesis also investigates and compares approaches for feature block importance rankings. This thesis demonstrates that high-dimensional datasets from multiple data sources in the medical domain can be successfully tackled by the presented approaches for feature selection. Experimental evaluations demonstrate favorable properties of both predictive performance, stability, as well as interpretability of results, which carries a high potential for better data-driven decision support in clinical practice.Moderne datainnsamlingsteknikker i helsevesenet genererer store datamengder fra flere kilder, som for eksempel nye diagnose- og behandlingsmetoder. Noen konkrete eksempler er elektroniske helsejournalsystemer, genomikk og medisinske bilder. Slike pasientkohortdata er ofte ustrukturerte, høydimensjonale og heterogene og hvor klassiske statistiske metoder ikke er tilstrekkelige for optimal utnyttelse av dataene og god informasjonsbasert beslutningstaking. Derfor kan det være lovende å analysere slike datastrukturer ved bruk av moderne maskinlæringsteknikker for å øke forståelsen av pasientenes helseproblemer og for å gi klinikerne en bedre plattform for informasjonsbasert beslutningstaking. Sentrale krav til dette formålet inkluderer (a) tilstrekkelig nøyaktige prediksjoner og (b) modelltolkbarhet. Å oppnå begge aspektene samtidig er vanskelig, spesielt for datasett med få pasienter, noe som er vanlig for data i helsevesenet. I slike tilfeller må maskinlæringsmodeller håndtere matematisk underbestemte systemer og dette kan lett føre til at modellene overtilpasses treningsdataene. Variabelseleksjon er en viktig tilnærming for å håndtere dette ved å identifisere en undergruppe av informative variabler med hensyn til responsvariablen. Samtidig som variabelseleksjonsmetoder kan lede til økt prediktiv ytelse, fremmes modelltolkbarhet ved å identifisere et lavt antall relevante modellparametere. Dette kan gi bedre forståelse av de underliggende biologiske prosessene som fører til helseproblemer. Tolkbarhet krever at variabelseleksjonen er stabil, dvs. at små endringer i datasettet ikke fører til endringer i hvilke variabler som velges. Et konsept for å adressere ustabilitet er ensemblevariableseleksjon, dvs. prosessen med å gjenta variabelseleksjon flere ganger på en delmengde av prøvene i det originale datasett og aggregere resultater i en metamodell. Denne avhandlingen presenterer to tilnærminger for ensemblevariabelseleksjon, som er skreddersydd for høydimensjonale data i helsevesenet: "Repeated Elastic Net Technique for feature selection" (RENT) og "User-Guided Bayesian Framework for feature selection" (UBayFS). Mens RENT er datadrevet og bygger på elastic net-regulariserte modeller, er UBayFS et generelt rammeverk for ensembler som muliggjør inkludering av ekspertkunnskap i variabelseleksjonsprosessen gjennom forhåndsbestemte vekter og sidebegrensninger. En case-studie som modellerer overlevelsen av kreftpasienter sammenligner disse nye variabelseleksjonsmetodene og demonstrerer deres potensiale i klinisk praksis. Utover valg av enkelte variabler gjør UBayFS det også mulig å velge blokker eller grupper av variabler som representerer de ulike datakildene som ble nevnt over. Kvantifisering av viktigheten av variabelgrupper spiller en nøkkelrolle for forståelsen av hvorvidt datakildene er viktige for responsvariablen. Tilgang til slik informasjon kan føre til at bruken av menneskelige, tekniske og økonomiske ressurser kan forbedres dersom informasjonen integreres systematisk i planleggingen av pasientbehandlingen. Slik kan man redusere innsamling av ikke-informative variabler. Siden generaliseringen av viktighet av variabelgrupper ikke er triviell, undersøkes og sammenlignes også tilnærminger for rangering av viktigheten til disse variabelgruppene. Denne avhandlingen viser at høydimensjonale datasett fra flere datakilder fra det medisinske domenet effektivt kan håndteres ved bruk av variabelseleksjonmetodene som er presentert i avhandlingen. Eksperimentene viser at disse kan ha positiv en effekt på både prediktiv ytelse, stabilitet og tolkbarhet av resultatene. Bruken av disse variabelseleksjonsmetodene bærer et stort potensiale for bedre datadrevet beslutningsstøtte i klinisk praksis

Brage NMBU

Computational Tools for the Untargeted Assignment of FT-MS Metabolomics Datasets

Author: Mitchell Joshua Merritt
Publication venue: UKnowledge
Publication date: 01/01/2019
Field of study

Metabolomics is the study of metabolomes, the sets of metabolites observed in living systems. Metabolism interconverts these metabolites to provide the molecules and energy necessary for life processes. Many disease processes, including cancer, have a significant metabolic component that manifests as differences in what metabolites are present and in what quantities they are produced and utilized. Thus, using metabolomics, differences between metabolomes in disease and non-disease states can be detected and these differences improve our understanding of disease processes at the molecular level. Despite the potential benefits of metabolomics, the comprehensive investigation of metabolomes remains difficult. A popular analytical technique for metabolomics is mass spectrometry. Advances in Fourier transform mass spectrometry (FT-MS) instrumentation have yielded simultaneous improvements in mass resolution, mass accuracy, and detection sensitivity. In the metabolomics field, these advantages permit more complicated, but more informative experimental designs such as the use of multiple isotope-labeled precursors in stable isotope-resolved metabolomics (SIRM) experiments. However, despite these potential applications, several outstanding problems hamper the use of FT-MS for metabolomics studies. First, artifacts and data quality problems in FT-MS spectra can confound downstream data analyses, confuse machine learning models, and complicate the robust detection and assignment of metabolite features. Second, the assignment of observed spectral features to metabolites remains difficult. Existing targeted approaches for assignment often employ databases of known metabolites; however, metabolite databases are incomplete, thus limiting or biasing assignment results. Additionally, FT-MS provides limited structural information for observed metabolites, which complicates the determination of metabolite class (e.g. lipid, sugar, etc. ) for observed metabolite spectral features, a necessary step for many metabolomics experiments. To address these problems, a set of tools were developed. The first tool identifies artifacts with high peak density observed in many FT-MS spectra and removes them safely. Using this tool, two previously unreported types of high peak density artifact were identified in FT-MS spectra: fuzzy sites and partial ringing. Fuzzy sites were particularly problematic as they confused and reduced the accuracy of machine learning models trained on datasets containing these artifacts. Second, a tool called SMIRFE was developed to assign isotope-resolved molecular formulas to observed spectral features in an untargeted manner without a database of expected metabolites. This new untargeted method was validated on a gold-standard dataset containing both unlabeled and 15N-labeled compounds and was able to identify 18 of 18 expected spectral features. Third, a collection of machine learning models was constructed to predict if a molecular formula corresponds to one or more lipid categories. These models accurately predict the correct one of eight lipid categories on our training dataset of known lipid and non-lipid molecular formulas with precisions and accuracies over 90% for most categories. These models were used to predict lipid categories for untargeted SMIRFE-derived assignments in a non-small cell lung cancer dataset. Subsequent differential abundance analysis revealed a sub-population of non-small cell lung cancer samples with a significantly increased abundance in sterol lipids. This finding implies a possible therapeutic role of statins in the treatment and/or prevention of non-small cell lung cancer. Collectively these tools represent a pipeline for FT-MS metabolomics datasets that is compatible with isotope labeling experiments. With these tools, more robust and untargeted metabolic analyses of disease will be possible

University of Kentucky

Improving the Clinical Application of Genetic Testing for Patients with Inherited Heart Disease

Author: Burns Charlotte
Publication venue: Faculty of Medicine and Health
Publication date: 12/01/2019
Field of study

The difficulties associated with the clinical application of next generation sequencing (NGS) approaches can be substantial. However, the technology holds great potential to improve outcomes and risk management for inherited heart disease patients and their families. This PhD thesis focuses on three critical challenges associated with the clinical application of NGS technologies; (i) Understanding correlations between genetics and the clinical phenotype; (ii) the impact of uncertainty created by NGS-based genetic testing on the patient; and (iii) developing evidence-based approaches for improving current and future methods for returning complex genetic results. Two studies focused on understanding correlations between genetics and the clinical phenotype. Firstly, the study reported in Chapter two, which aimed to analyse genetic and phenotypic findings from a consecutive cohort of hypertrophic cardiomyopathy (HCM) families who had undergone comprehensive cardiac panel testing. We showed that increasing the number of genes included on HCM panels beyond the eight established sarcomere genes does not increase the diagnostic yield. However, identification of variants of uncertain significance increased dramatically from 13% to 36%. This result has important clinical implications. Increasing the number of genes screened does not necessarily improve the chance of identifying a family’s cause of disease but is likely to increase identification of uncertain results that can increase the complexity of discussions around inheritance and risk. We also showed that identification of multiple rare variants was associated with earlier disease onset, greater likelihood of family history of sudden cardiac death (SCD) and overall worse event-free survival (5/18 events versus 29/170 events, log-rank test p=0.008). Clinical heterogeneity is a hallmark feature of HCM and here we show one factor that can contribute to worse outcomes. The study reported in Chapter three aimed to describe the diverse genetic and phenotypic features of an international cohort of patients with a truncating variant in the desmoplakin (DSP) gene traditionally considered to cause arrhythmogenic right ventricular cardiomyopathy (ARVC). This is the first complete report of the diverse phenotype spectrum of DSP cardiomyopathy, suggesting that truncating variants in DSPshould be considered its own entity. We show that the phenotype associated with truncations in DSP is severe, with a high rate of SCD and characterised by left ventricular dysfunction and structural left ventricular involvement when compared with ‘classic’ ARVC. In addition, we show the functional domain in which DSP truncating variants reside has clinical significance. Key results from Chapter two and three emphasise the importance of correlating genetic variants with phenotype information. Data from this work is a step towards precision medicine whereby patients’ risk will be assessed and managed based on their genotype. In Chapter four we designed a qualitative study to explore attitudes, preferences, recall and psychosocial consequences of uncertain genetic results returned to HCM probands. The major themes we identified were knowledge and recall of complex genetic information, individual experiences with HCM genetic testing and communication and the value of information. In addition, those with uninformative results had a unique set of issues. We found that HCM probands undergoing genetic testing require additional support and information beyond the current practice model employed in the multidisciplinary specialist clinic setting. Importantly, many of the probands interviewed who received an uninformative or uncertain genetic result showed poor recall and understanding of genetic information. The final aim was to determine if a genetic counsellor led intervention using a communication aid for the delivery of HCM genetic test results improves the ability and confidence of the proband to communicate genetic results to at-risk relatives. This work is presented in Chapter five. We developed a study protocol for a randomised controlled trial with the primary outcome being the ability and confidence of the proband to communicate genetic results to at-risk relatives. We focused on transforming findings from the previous chapters into improved clinical practice. The a priori primary outcome did not show statistically significant differences between the control and intervention group, though the majority of probands in the intervention group achieved fair communication scores and had higher genetic knowledge scores than those in the control group. Importantly, we found that 29% of at-risk relatives were not informed of a genetic result in their family. We highlight the significant gap in our current approach to supporting family communication about genetics. The studies presented highlight that whilst genetic testing has significant potential for benefit in inherited heart disease families in terms of diagnosis, management and family screening there are issues to address in order to improve the clinical utility and application of a comprehensive approach to testing. Overall, the work contributes to understanding the genetic architecture of inherited heart diseases, the clinical impact of NGS results for patients, as well as highlighting that more work is needed to improve the clinical utility from an NGS approach to genetic testing

Sydney eScholarship