9 research outputs found

    A comparison of machine learning techniques for survival prediction in breast cancer

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The ability to accurately classify cancer patients into risk classes, i.e. to predict the outcome of the pathology on an individual basis, is a key ingredient in making therapeutic decisions. In recent years gene expression data have been successfully used to complement the clinical and histological criteria traditionally used in such prediction. Many "gene expression signatures" have been developed, i.e. sets of genes whose expression values in a tumor can be used to predict the outcome of the pathology. Here we investigate the use of several machine learning techniques to classify breast cancer patients using one of such signatures, the well established <it>70-gene signature</it>.</p> <p>Results</p> <p>We show that Genetic Programming performs significantly better than Support Vector Machines, Multilayered Perceptrons and Random Forests in classifying patients from the NKI breast cancer dataset, and comparably to the scoring-based method originally proposed by the authors of the 70-gene signature. Furthermore, Genetic Programming is able to perform an automatic feature selection.</p> <p>Conclusions</p> <p>Since the performance of Genetic Programming is likely to be improvable compared to the out-of-the-box approach used here, and given the biological insight potentially provided by the Genetic Programming solutions, we conclude that Genetic Programming methods are worth further investigation as a tool for cancer patient classification based on gene expression data.</p

    Building Gene Expression Profile Classifiers with a Simple and Efficient Rejection Option in R

    Get PDF
    Background: The collection of gene expression profiles from DNA microarrays and their analysis with pattern recognition algorithms is a powerful technology applied to several biological problems. Common pattern recognition systems classify samples assigning them to a set of known classes. However, in a clinical diagnostics setup, novel and unknown classes (new pathologies) may appear and one must be able to reject those samples that do not fit the trained model. The problem of implementing a rejection option in a multi-class classifier has not been widely addressed in the statistical literature. Gene expression profiles represent a critical case study since they suffer from the curse of dimensionality problem that negatively reflects on the reliability of both traditional rejection models and also more recent approaches such as one-class classifiers. Results: This paper presents a set of empirical decision rules that can be used to implement a rejection option in a set of multi-class classifiers widely used for the analysis of gene expression profiles. In particular, we focus on the classifiers implemented in the R Language and Environment for Statistical Computing (R for short in the remaining of this paper). The main contribution of the proposed rules is their simplicity, which enables an easy integration with available data analysis environments. Since in the definition of a rejection model tuning of the involved parameters is often a complex and delicate task, in this paper we exploit an evolutionary strategy to automate this process. This allows the final user to maximize the rejection accuracy with minimum manual intervention. Conclusions: This paper shows how the use of simple decision rules can be used to help the use of complex machine learning algorithms in real experimental setups. The proposed approach is almost completely automated and therefore a good candidate for being integrated in data analysis flows in labs where the machine learning expertise required to tune traditional classifiers might not be availabl

    Meta-analysis of estrogen response in MCF-7 distinguishes early target genes involved in signaling and cell proliferation from later target genes involved in cell cycle and DNA repair

    Get PDF
    ABSTRACT: BACKGROUND: Many studies have been published outlining the global effects of 17 beta-estradiol (E2) on gene expression in human epithelial breast cancer derived MCF-7 cells. These studies show large variation in results, reporting between ~100 and ~1500 genes regulated by E2, with poor overlap. RESULTS: We performed a meta-analysis of these expression studies, using the Rank product method to obtain a more accurate and stable list of the differentially expressed genes, and of pathways regulated by E2. We analyzed 9 time-series data sets, concentrating on response at 3-4 hrs (early) and at 24 hrs (late). We found &gt;1000 statistically significant probe sets after correction for multiple testing at 3-4 hrs, and &gt;2000 significant probe sets at 24 hrs. Differentially expressed genes were examined by pathway analysis. This revealed 15 early response pathways, mostly related to cell signaling and proliferation, and 20 late response pathways, mostly related to breast cancer, cell division, DNA repair and recombination. CONCLUSIONS: Our results show that meta-analysis identified more differentially expressed genes than the individual studies, and that these genes act together in networks. These results provide new insight into E2 regulated mechanisms, especially in the context of breast cancer

    Can machine learning methods contribute as a decision support system in sequential oligometastatic radioablation therapy?

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced AnalyticsCancer treatment is among the major medical challenges of this century. Sequential oligometastatic radio-ablation (SOMA) is a novel treatment method that aims at ablating reoccurring metastasis in a single session with a targeted high dose of radiation. To know if SOMA is the best possible treatment method for a patient, the benefits of each available therapy need to be understood and evaluated. The ability to model complex systems, such as cancer treatment, is the strength of machine learning techniques. These techniques have improved the understanding of numerous medical therapies already. In some cases, they can serve as medical support systems if they deliver reliable results that doctors can trust and understand. The results obtained from applying numerous machine learning techniques to the data of SOMA-treated patients show that there are favorable techniques in some cases. It was observed that the Random Forest algorithm proved superior at different classification tasks. Additionally, regression problems opposed a great challenge, as the amount of data is very limited. Finally, SHAP values - a novel machine learning interpretation technique – provided valuable insights into understanding the rationale of each algorithm. They proved that the machine learning algorithms could learn patterns aligned with the human intuition in the problems presented. SHAP values show great potential in bridging the gap between complex machine learning algorithms and their interpretability. They display how an algorithm learns from the data and derives results. This opens up exciting possibilities for applying machine learning algorithms in the real world

    Design and application of SuRFR: an R package to prioritise candidate functional DNA sequence variants

    Get PDF
    Genetic analyses such as linkage and genome wide association studies (GWAS) have been extremely successful at identifying genomic regions that harbour genetic variants contributing to complex disorders. Over 90% of disease-associated variants from GWAS fall within non-coding regions (Maurano et al., 2012). However, pinpointing the causal variants has proven a major bottleneck to genetic research. To address this I have developed SuRFR, an R package for the ranked prioritisation of candidate causal variants by predicted function. SuRFR produces rank orderings of variants based upon functional genomic annotations, including DNase hypersensitivity signal, chromatin state, minor allele frequency, and conservation. The ranks for each annotation are combined into a final prioritisation rank using a weighting system that has been parametrised and tested through ten-fold cross-validation. SuRFR has been tested extensively upon a combination of synthetic and real datasets and has been shown to perform with high sensitivity and specificity. These analyses have provided insight into the extent to which different classes of functional annotation are most useful for the identification of known regulatory variants: the most important factor for identifying a true variant across all classes of regulatory variants is position relative to genes. I have also shown that SuRFR performs at least as well as its nearest competitors whilst benefiting from the advantages that come from being part of the R environment. I have applied SuRFR to several genomics projects, particularly the study of psychiatric illness, including genome sequencing of a large Scottish family with bipolar disorder. This has resulted in the prioritisation of such variants for future study

    Genetic predictors for epilepsy development, treatment response and dosing

    Get PDF
    Antiepileptic drug (AED) treatment is the first line strategy for seizure control in the majority of individuals with epilepsy but remains challenging, not least because of interindividual variability in efficacy, tolerability and dosing. The studies presented in this thesis set out to explore that variability from a genomic perspective in patients with newly diagnosed epilepsy from across the UK. Single nucleotide polymorphisms (SNPs) in genes encoding drug metabolising enzymes (DMEs) may be associated with the dose of carbamazepine (CBZ) required for seizure control. A cohort of 159 individuals who were seizure-free for 12 months on a stable dose of CBZ monotherapy was genotyped for 51 SNPs across six DMEs. Haplotype analysis identified 8 haplotype blocks across the genes. No single SNPs or haplotype blocks were associated with CBZ dose. Thus, it is unlikely that genetic variability in DMEs accounts for the individual differences in CBZ dose requirement. A splice site SNP (rs3812718) in the SCN1A gene was previously shown to influence maximum doses of AEDs. This SNP was genotyped in 817 patients and tested for association with maximum and maintenance doses of several AEDs. An association was identified between rs3812718 and maximum AED dose, with an interaction analysis suggestive of a drug specific effect. These findings suggest that this SCN1A variant contributes to variability in the limit of tolerability to AEDs. Response to AED treatment is multifactorial and likely to be influenced by multiple genes. Five SNPs previously reported to predict treatment outcome in epilepsy were genotyped in 772 patients and the resulting data, together with data from an Australian cohort, incorporated into a predictive algorithm. The algorithm failed to predict treatment outcome in general but was partially successful in identifying responders to CBZ and valproate. These five SNPs may be relevant to the prognosis of epilepsy, particularly when treated with specific AEDs. Primary generalised epilepsies (PGEs) are highly heritable and believed to be polygenic in origin. Predictive algorithms were employed to explore genetic influences on seizure (absence vs. myoclonus) and epilepsy (PGE vs. focal) type using 1,840 SNP genotypes available from 436 patients with PGE. Although the algorithms failed to distinguish PGE patients on the basis of genetic variants, they showed improved association over univariate methods of analysis. Such an approach may be suitable for future investigations using large genomic datasets. A recent genome-wide association study identified multiple genetic variants that approached genome-wide significance for association with 12 month remission from seizures. Five of these SNPs were genotyped in an independent cohort of 424 patients and tested for association with remission and time to remission. No significant associations were found, questioning the validity of the original observation or the method of replication. Further work is required to understand this outcome. In conclusion, the genetic bases of epilepsy, AED response and AED dose requirement are multigenic and thus far undetectable using traditional association studies in modestly-sized patient cohorts. Further advances in genomic, bioinformatics and statistical methodologies are required before the genetic contribution to heterogeneity in epilepsy-related phenotypes can be translated into improved clinical care
    corecore