3,554 research outputs found

    Improving Breast Cancer Survival Analysis through Competition-Based Multidimensional Modeling

    Get PDF
    Breast cancer is the most common malignancy in women and is responsible for hundreds of thousands of deaths annually. As with most cancers, it is a heterogeneous disease and different breast cancer subtypes are treated differently. Understanding the difference in prognosis for breast cancer based on its molecular and phenotypic features is one avenue for improving treatment by matching the proper treatment with molecular subtypes of the disease. In this work, we employed a competition-based approach to modeling breast cancer prognosis using large datasets containing genomic and clinical information and an online real-time leaderboard program used to speed feedback to the modeling team and to encourage each modeler to work towards achieving a higher ranked submission. We find that machine learning methods combined with molecular features selected based on expert prior knowledge can improve survival predictions compared to current best-in-class methodologies and that ensemble models trained across multiple user submissions systematically outperform individual models within the ensemble. We also find that model scores are highly consistent across multiple independent evaluations. This study serves as the pilot phase of a much larger competition open to the whole research community, with the goal of understanding general strategies for model optimization using clinical and molecular profiling data and providing an objective, transparent system for assessing prognostic models

    Intra-tumour signalling entropy determines clinical outcome in breast and lung cancer.

    Get PDF
    The cancer stem cell hypothesis, that a small population of tumour cells are responsible for tumorigenesis and cancer progression, is becoming widely accepted and recent evidence has suggested a prognostic and predictive role for such cells. Intra-tumour heterogeneity, the diversity of the cancer cell population within the tumour of an individual patient, is related to cancer stem cells and is also considered a potential prognostic indicator in oncology. The measurement of cancer stem cell abundance and intra-tumour heterogeneity in a clinically relevant manner however, currently presents a challenge. Here we propose signalling entropy, a measure of signalling pathway promiscuity derived from a sample's genome-wide gene expression profile, as an estimate of the stemness of a tumour sample. By considering over 500 mixtures of diverse cellular expression profiles, we reveal that signalling entropy also associates with intra-tumour heterogeneity. By analysing 3668 breast cancer and 1692 lung adenocarcinoma samples, we further demonstrate that signalling entropy correlates negatively with survival, outperforming leading clinical gene expression based prognostic tools. Signalling entropy is found to be a general prognostic measure, valid in different breast cancer clinical subgroups, as well as within stage I lung adenocarcinoma. We find that its prognostic power is driven by genes involved in cancer stem cells and treatment resistance. In summary, by approximating both stemness and intra-tumour heterogeneity, signalling entropy provides a powerful prognostic measure across different epithelial cancers

    Circulation

    Get PDF
    Spurred by advances in processing power, memory, storage, and an unprecedented wealth of data, computers are being asked to tackle increasingly complex learning tasks, often with astonishing success. Computers have now mastered a popular variant of poker, learned the laws of physics from experimental data, and become experts in video games - tasks that would have been deemed impossible not too long ago. In parallel, the number of companies centered on applying complex data analysis to varying industries has exploded, and it is thus unsurprising that some analytic companies are turning attention to problems in health care. The purpose of this review is to explore what problems in medicine might benefit from such learning approaches and use examples from the literature to introduce basic concepts in machine learning. It is important to note that seemingly large enough medical data sets and adequate learning algorithms have been available for many decades, and yet, although there are thousands of papers applying machine learning algorithms to medical data, very few have contributed meaningfully to clinical care. This lack of impact stands in stark contrast to the enormous relevance of machine learning to many other industries. Thus, part of my effort will be to identify what obstacles there may be to changing the practice of medicine through statistical learning approaches, and discuss how these might be overcome.K08 HL098361/HL/NHLBI NIH HHS/United StatesDP2 HL123228/DP/NCCDPHP CDC HHS/United StatesK08 HL093861/HL/NHLBI NIH HHS/United StatesU01 HL107440/HL/NHLBI NIH HHS/United StatesDP2 HL123228/HL/NHLBI NIH HHS/United States2018-03-01T00:00:00Z26572668PMC5831252vault:2743

    Data Mining in Hospital Information System

    Get PDF

    Statistical Methods for Gene-Environment Interactions

    Get PDF
    Despite significant main effects of genetic and environmental risk factors have been found, the interactions between them can play critical roles and demonstrate important implications in medical genetics and epidemiology. Although many important gene-environment (G-E) interactions have been identified, the existing findings are still insufficient and there exists a strong need to develop statistical methods for analyzing G-E interactions. In this dissertation, we propose four statistical methodologies and computational algorithms for detecting G-E interactions and one application to imaging data. Extensive simulation studies are conducted in comparison with multiple advanced alternatives. In the analyses of The Cancer Genome Atlas datasets on multiple cancers, biologically meaningful findings are obtained. First, we develop two robust interaction analysis methods for prognostic outcomes. Compared to continuous and categorical outcomes, prognosis has been less investigated, with additional challenges brought by the unique characteristics of survival times. Most of the existing G-E interaction approaches for prognosis data share the limitation that they cannot accommodate long-tailed or contaminated outcomes. In the first method, we adopt the censored quantile regression and partial correlation for survival outcomes. Under a marginal modeling framework, this proposed approach is robust to long-tailed prognosis and is computationally straightforward to apply. Furthermore, outliers and contaminations among predictors are observed in real data. In the second method, we propose a joint model using the penalized trimmed regression that is robust to leverage points and vertical outliers. The proposed method respects the hierarchical structure of main effects and interactions and has an effective computational algorithm based on coordinate descent optimization and stability selection. Second, we propose a penalized approach to incorporate additional information for identifying important hierarchical interactions. Due to the high dimensionality and low signal levels, it is challenging to analyze interactions so that incorporating additional information is desired. We adopt the minimax concave penalty for regularized estimation and the Laplacian quadratic penalty for additional information. Under a unified formulation, multiple types of additional information and genetic measurements can be effectively utilized and improved identification accuracy can be achieved. Third, we develop a three-step procedure using multidimensional molecular data to identify G-E interactions. Recent studies have shown that collectively analyzing multiple types of molecular changes is not only biologically sensible but also leads to improved estimation and prediction. In this proposed method, we first estimate the relationship between gene expressions and their regulators by a multivariate penalized regression, and then identify regulatory modules via sparse biclustering. Next, we establish integrative covariates by principal components extracted from the identified regulatory modules. Last but not least, we construct a joint model for disease outcomes and employ Lasso-based penalization to select important main effects and hierarchical interactions. The proposed method expands the scope of interaction analysis to multidimensional molecular data. Last, we present an application using both marginal and joint models to analyze histopathological imaging-environment interactions. In cancer diagnosis, histopathological imaging has been routinely conducted and can be processed to generate high-dimensional features. To explore potential interactions, we conduct marginal and joint analyses, which have been extensively examined in the context of G-E interactions. This application extends the practical applicability of interaction analysis to imaging data and provides an alternative venue that combines histopathological imaging and environmental data in cancer modeling. Motivated by the important implications of G-E interactions and to overcome the limitations of the existing methods, the goal of this dissertation is to advance in methodological development for G-E interaction analysis and to provide practically useful tools for identifying important interactions. The proposed methods emerge from practical issues observed in real data and have solid statistical properties. With a balance between theory, computation, and data analysis, this dissertation provide four novel approaches for analyzing interactions to achieve more robust and accurate identification of biologically meaningful interactions

    Evolutionary Computation and QSAR Research

    Get PDF
    [Abstract] The successful high throughput screening of molecule libraries for a specific biological property is one of the main improvements in drug discovery. The virtual molecular filtering and screening relies greatly on quantitative structure-activity relationship (QSAR) analysis, a mathematical model that correlates the activity of a molecule with molecular descriptors. QSAR models have the potential to reduce the costly failure of drug candidates in advanced (clinical) stages by filtering combinatorial libraries, eliminating candidates with a predicted toxic effect and poor pharmacokinetic profiles, and reducing the number of experiments. To obtain a predictive and reliable QSAR model, scientists use methods from various fields such as molecular modeling, pattern recognition, machine learning or artificial intelligence. QSAR modeling relies on three main steps: molecular structure codification into molecular descriptors, selection of relevant variables in the context of the analyzed activity, and search of the optimal mathematical model that correlates the molecular descriptors with a specific activity. Since a variety of techniques from statistics and artificial intelligence can aid variable selection and model building steps, this review focuses on the evolutionary computation methods supporting these tasks. Thus, this review explains the basic of the genetic algorithms and genetic programming as evolutionary computation approaches, the selection methods for high-dimensional data in QSAR, the methods to build QSAR models, the current evolutionary feature selection methods and applications in QSAR and the future trend on the joint or multi-task feature selection methods.Instituto de Salud Carlos III, PIO52048Instituto de Salud Carlos III, RD07/0067/0005Ministerio de Industria, Comercio y Turismo; TSI-020110-2009-53)Galicia. ConsellerĂ­a de EconomĂ­a e Industria; 10SIN105004P

    Data- and knowledge-based modeling of gene regulatory networks: an update

    Get PDF
    Gene regulatory network inference is a systems biology approach which predicts interactions between genes with the help of high-throughput data. In this review, we present current and updated network inference methods focusing on novel techniques for data acquisition, network inference assessment, network inference for interacting species and the integration of prior knowledge. After the advance of Next-Generation-Sequencing of cDNAs derived from RNA samples (RNA-Seq) we discuss in detail its application to network inference. Furthermore, we present progress for large-scale or even full-genomic network inference as well as for small-scale condensed network inference and review advances in the evaluation of network inference methods by crowdsourcing. Finally, we reflect the current availability of data and prior knowledge sources and give an outlook for the inference of gene regulatory networks that reflect interacting species, in particular pathogen-host interactions

    Identification of Prognostic Cancer Biomarkers through the Application of RNA-Seq Technologies and Bioinformatics

    Get PDF
    MicroRNAs (miRNAs) are short single-stranded RNAs that function as the guide sequence of the post-transcriptional regulatory process known as the RNA-induced silencing complex (RISC), which targets mRNA sequences for degradation through complementary binding to the guide miRNA. Changes in miRNA expression have been reported as correlated with numerous biological processes, including embryonic development, cellular differentiation, and disease manifestation. In the latter case, dysregulation has been observed in response to infection by human papillomavirus (HPV), which has also been established as both oncogenic in cervical cancers and oropharyngeal cancers and favorable for overall patient survival after tumor formation. The identification of dysregulated miRNAs associated with both HPV infection and cancer survival requires large datasets of high-throughput sequencing data, which were obtained through The Cancer Genome Atlas. By analyzing this public data, we have identified a series of proposed mechanisms for cancer formation and survival that is mediated through the miRNA-RISC regulatory mechanism in response to HPV infection. We have also identified a diverse set of miRNA biomarkers that have been incorporated into linear expression-based risk signatures that are prognostic for overall patient survival after tumor diagnosis in HPV-related cancers. The tools that were used to identify both miRNA biomarkers and proposed targets in public datasets, such as The Cancer Genome Atlas, have since been incorporated into an web-accessible resource, OncomiR.org, to streamline the process of biomarker identification for the cancer research community
    • …
    corecore