625 research outputs found

    A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Cancer diagnosis and clinical outcome prediction are among the most important emerging applications of gene expression microarray technology with several molecular signatures on their way toward clinical deployment. Use of the most accurate classification algorithms available for microarray gene expression data is a critical ingredient in order to develop the best possible molecular signatures for patient care. As suggested by a large body of literature to date, support vector machines can be considered "best of class" algorithms for classification of such data. Recent work, however, suggests that random forest classifiers may outperform support vector machines in this domain.</p> <p>Results</p> <p>In the present paper we identify methodological biases of prior work comparing random forests and support vector machines and conduct a new rigorous evaluation of the two algorithms that corrects these limitations. Our experiments use 22 diagnostic and prognostic datasets and show that support vector machines outperform random forests, often by a large margin. Our data also underlines the importance of sound research design in benchmarking and comparison of bioinformatics algorithms.</p> <p>Conclusion</p> <p>We found that both on average and in the majority of microarray datasets, random forests are outperformed by support vector machines both in the settings when no gene selection is performed and when several popular gene selection methods are used.</p

    Challenges in the Analysis of Mass-Throughput Data: A Technical Commentary from the Statistical Machine Learning Perspective

    Get PDF
    Sound data analysis is critical to the success of modern molecular medicine research that involves collection and interpretation of mass-throughput data. The novel nature and high-dimensionality in such datasets pose a series of nontrivial data analysis problems. This technical commentary discusses the problems of over-fitting, error estimation, curse of dimensionality, causal versus predictive modeling, integration of heterogeneous types of data, and lack of standard protocols for data analysis. We attempt to shed light on the nature and causes of these problems and to outline viable methodological approaches to overcome them

    The FAST-AIMS Clinical Mass Spectrometry Analysis System

    Get PDF
    Within clinical proteomics, mass spectrometry analysis of biological samples is emerging as an important high-throughput technology, capable of producing powerful diagnostic and prognostic models and identifying important disease biomarkers. As interest in this area grows, and the number of such proteomics datasets continues to increase, the need has developed for efficient, comprehensive, reproducible methods of mass spectrometry data analysis by both experts and nonexperts. We have designed and implemented a stand-alone software system, FAST-AIMS, which seeks to meet this need through automation of data preprocessing, feature selection, classification model generation, and performance estimation. FAST-AIMS is an efficient and user-friendly stand-alone software for predictive analysis of mass spectrometry data. The present resource review paper will describe the features and use of the FAST-AIMS system. The system is freely available for download for noncommercial use

    Effects of Environment, Genetics and Data Analysis Pitfalls in an Esophageal Cancer Genome-Wide Association Study

    Get PDF
    The development of new high-throughput genotyping technologies has allowed fast evaluation of single nucleotide polymorphisms (SNPs) on a genome-wide scale. Several recent genome-wide association studies employing these technologies suggest that panels of SNPs can be a useful tool for predicting cancer susceptibility and discovery of potentially important new disease loci.In the present paper we undertake a careful examination of the relative significance of genetics, environmental factors, and biases of the data analysis protocol that was used in a previously published genome-wide association study. That prior study reported a nearly perfect discrimination of esophageal cancer patients and healthy controls on the basis of only genetic information. On the other hand, our results strongly suggest that SNPs in this dataset are not statistically linked to the phenotype, while several environmental factors and especially family history of esophageal cancer (a proxy to both environmental and genetic factors) have only a modest association with the disease.The main component of the previously claimed strong discriminatory signal is due to several data analysis pitfalls that in combination led to the strongly optimistic results. Such pitfalls are preventable and should be avoided in future studies since they create misleading conclusions and generate many false leads for subsequent research

    Impact of managed clinical networks on neonatal care in England : a population-based study

    Get PDF
    Objective: To assess the impact of reorganisation of neonatal specialist care services in England after a UK Department of Health report in 2003. Design: A population-wide observational comparison of outcomes over two epochs, before and after the establishment of managed clinical neonatal networks. Setting: Epoch one: 294 maternity and neonatal units in England, Wales, and Northern Ireland, 1 September 1998 to 31 August 2000, as reported by the Confidential Enquiry into Stillbirths and Sudden Deaths in Infancy Project 27/28. Epoch two: 146 neonatal units in England contributing data to the National Neonatal Research Database at the Neonatal Data Analysis Unit, 1 January 2009 to 31 December 2010. Participants: Babies born at a gestational age of 27+0-28+6 (weeks+days): 3522 live births in epoch one; 2919 babies admitted to a neonatal unit within 28 days of birth in epoch two. Intervention: The national reorganisation of neonatal services into managed clinical networks. Main outcome measures: The proportion of babies born at hospitals providing the highest volume of neonatal specialist care (≥2000 neonatal intensive care days annually), having an acute transfer (within the first 24 hours after birth) and/or a late transfer (between 24 hours and 28 days after birth) to another hospital, assessed by change in distribution of transfer category (“none,” “acute,” “late”), and babies from multiple births separated by transfer. For acute transfers in epoch two, the level of specialist neonatal care provided at the destination hospital (British Association of Perinatal Medicine criteria). Results: After reorganisation, there were increases in the proportions of babies born at 27-28 weeks’ gestation in hospitals providing the highest volume of neonatal specialist care (18% (631/3495) v 49% (1325/2724); odds ratio 4.30, 95% confidence interval 3.83 to 4.82; P<0.001) and in acute and late postnatal transfers (7% (235) v 12% (360) and 18% (579) v 22% (640), respectively; P<0.001). There was no significant change in the proportion of babies from multiple births separated by transfer (33% (39) v 29% (38); 0.86, 0.50 to 1.46; P=0.57). In epoch two, 32% of acute transfers were to a neonatal unit providing either an equivalent (n=87) or lower (n=26) level of specialist care. Conclusions: There is evidence of some improvement in the delivery of neonatal specialist care after reorganisation. The increase in acute transfers in epoch two, in conjunction with the high proportion transferred to a neonatal unit providing an equivalent or lower level of specialist care, and the continued separation of babies from multiple births, are indicative of poor coordination between maternity and neonatal services to facilitate in utero transfer before delivery, and continuing inadequacies in capacity of intensive care cots. Historical data representing epoch one are available only in aggregate form, preventing examination of temporal trends or confounding factors. This limits the extent to which differences between epochs can be attributed to reorganisation and highlights the importance of routine, prospective data collection for evaluation of future health service reorganisations

    Generative Invertible Networks (GIN): Pathophysiology-Interpretable Feature Mapping and Virtual Patient Generation

    Full text link
    Machine learning methods play increasingly important roles in pre-procedural planning for complex surgeries and interventions. Very often, however, researchers find the historical records of emerging surgical techniques, such as the transcatheter aortic valve replacement (TAVR), are highly scarce in quantity. In this paper, we address this challenge by proposing novel generative invertible networks (GIN) to select features and generate high-quality virtual patients that may potentially serve as an additional data source for machine learning. Combining a convolutional neural network (CNN) and generative adversarial networks (GAN), GIN discovers the pathophysiologic meaning of the feature space. Moreover, a test of predicting the surgical outcome directly using the selected features results in a high accuracy of 81.55%, which suggests little pathophysiologic information has been lost while conducting the feature selection. This demonstrates GIN can generate virtual patients not only visually authentic but also pathophysiologically interpretable

    Automated Discrimination of Pathological Regions in Tissue Images: Unsupervised Clustering vs Supervised SVM Classification

    Get PDF
    Recognizing and isolating cancerous cells from non pathological tissue areas (e.g. connective stroma) is crucial for fast and objective immunohistochemical analysis of tissue images. This operation allows the further application of fully-automated techniques for quantitative evaluation of protein activity, since it avoids the necessity of a preventive manual selection of the representative pathological areas in the image, as well as of taking pictures only in the pure-cancerous portions of the tissue. In this paper we present a fully-automated method based on unsupervised clustering that performs tissue segmentations highly comparable with those provided by a skilled operator, achieving on average an accuracy of 90%. Experimental results on a heterogeneous dataset of immunohistochemical lung cancer tissue images demonstrate that our proposed unsupervised approach overcomes the accuracy of a theoretically superior supervised method such as Support Vector Machine (SVM) by 8%
    corecore