695 research outputs found

    Machine learning and computational methods to identify molecular and clinical markers for complex diseases – case studies in cancer and obesity

    Get PDF
    In biomedical research, applied machine learning and bioinformatics are the essential disciplines heavily involved in translating data-driven findings into medical practice. This task is especially accomplished by developing computational tools and algorithms assisting in detection and clarification of underlying causes of the diseases. The continuous advancements in high-throughput technologies coupled with the recently promoted data sharing policies have contributed to presence of a massive wealth of data with remarkable potential to improve human health care. In concordance with this massive boost in data production, innovative data analysis tools and methods are required to meet the growing demand. The data analyzed by bioinformaticians and computational biology experts can be broadly divided into molecular and conventional clinical data categories. The aim of this thesis was to develop novel statistical and machine learning tools and to incorporate the existing state-of-the-art methods to analyze bio-clinical data with medical applications. The findings of the studies demonstrate the impact of computational approaches in clinical decision making by improving patients risk stratification and prediction of disease outcomes. This thesis is comprised of five studies explaining method development for 1) genomic data, 2) conventional clinical data and 3) integration of genomic and clinical data. With genomic data, the main focus is detection of differentially expressed genes as the most common task in transcriptome profiling projects. In addition to reviewing available differential expression tools, a data-adaptive statistical method called Reproducibility Optimized Test Statistic (ROTS) is proposed for detecting differential expression in RNA-sequencing studies. In order to prove the efficacy of ROTS in real biomedical applications, the method is used to identify prognostic markers in clear cell renal cell carcinoma (ccRCC). In addition to previously known markers, novel genes with potential prognostic and therapeutic role in ccRCC are detected. For conventional clinical data, ensemble based predictive models are developed to provide clinical decision support in treatment of patients with metastatic castration resistant prostate cancer (mCRPC). The proposed predictive models cover treatment and survival stratification tasks for both trial-based and realworld patient cohorts. Finally, genomic and conventional clinical data are integrated to demonstrate the importance of inclusion of genomic data in predictive ability of clinical models. Again, utilizing ensemble-based learners, a novel model is proposed to predict adulthood obesity using both genetic and social-environmental factors. Overall, the ultimate objective of this work is to demonstrate the importance of clinical bioinformatics and machine learning for bio-clinical marker discovery in complex disease with high heterogeneity. In case of cancer, the interpretability of clinical models strongly depends on predictive markers with high reproducibility supported by validation data. The discovery of these markers would increase chance of early detection and improve prognosis assessment and treatment choice

    Cross-validation and Peeling Strategies for Survival Bump Hunting using Recursive Peeling Methods

    Full text link
    We introduce a framework to build a survival/risk bump hunting model with a censored time-to-event response. Our Survival Bump Hunting (SBH) method is based on a recursive peeling procedure that uses a specific survival peeling criterion derived from non/semi-parametric statistics such as the hazards-ratio, the log-rank test or the Nelson-Aalen estimator. To optimize the tuning parameter of the model and validate it, we introduce an objective function based on survival or prediction-error statistics, such as the log-rank test and the concordance error rate. We also describe two alternative cross-validation techniques adapted to the joint task of decision-rule making by recursive peeling and survival estimation. Numerical analyses show the importance of replicated cross-validation and the differences between criteria and techniques in both low and high-dimensional settings. Although several non-parametric survival models exist, none addresses the problem of directly identifying local extrema. We show how SBH efficiently estimates extreme survival/risk subgroups unlike other models. This provides an insight into the behavior of commonly used models and suggests alternatives to be adopted in practice. Finally, our SBH framework was applied to a clinical dataset. In it, we identified subsets of patients characterized by clinical and demographic covariates with a distinct extreme survival outcome, for which tailored medical interventions could be made. An R package `PRIMsrc` is available on CRAN and GitHub.Comment: Keywords: Exploratory Survival/Risk Analysis, Survival/Risk Estimation & Prediction, Non-Parametric Method, Cross-Validation, Bump Hunting, Rule-Induction Metho

    Toward reliable biomarker signatures in the age of liquid biopsies - how to standardize the small RNA-Seq workflow

    Get PDF
    Small RNA-Seq has emerged as a powerful tool in transcriptomics, gene expression profiling and biomarker discovery. Sequencing cell-free nucleic acids, particularly microRNA (miRNA), from liquid biopsies additionally provides exciting possibilities for molecular diagnostics, and might help establish disease-specific biomarker signatures. The complexity of the small RNA-Seq workflow, however, bears challenges and biases that researchers need to be aware of in order to generate high-quality data. Rigorous standardization and extensive validation are required to guarantee reliability, reproducibility and comparability of research findings. Hypotheses based on flawed experimental conditions can be inconsistent and even misleading. Comparable to the well-established MIQE guidelines for qPCR experiments, this work aims at establishing guidelines for experimental design and pre-analytical sample processing, standardization of library preparation and sequencing reactions, as well as facilitating data analysis. We highlight bottlenecks in small RNA-Seq experiments, point out the importance of stringent quality control and validation, and provide a primer for differential expression analysis and biomarker discovery. Following our recommendations will en-courage better sequencing practice, increase experimental transparency and lead to more reproducible small RNA-Seq results. This will ultimately enhance the validity of biomarker signatures, and allow reliable and robust clinical predictions

    Multi-objective clustering of gene expression data with evolutionary algorithms: a query gene approach

    Get PDF

    Wrapper methods for multi-objective feature selection

    Get PDF
    The ongoing data boom has democratized the use of data for improved decision-making. Beyond gathering voluminous data, preprocessing the data is crucial to ensure that their most rele- vant aspects are considered during the analysis. Feature Selection (FS) is one integral step in data preprocessing for reducing data dimensionality and preserving the most relevant features of the data. FS can be done by inspecting inherent associations among the features in the data (filter methods) or using the model per- formance of a concrete learning algorithm (wrapper methods). In this work, we extensively evaluate a set of FS methods on 32 datasets and measure their effect on model performance, stability, scalability and memory usage. The results re-establish the superiority of wrapper methods over filter methods in model performance. We further investigate the unique role of wrapper methods in multi-objective FS with a focus on two traditional metrics - accuracy and Area Under the ROC Curve (AUC). On model performance, our experiments showed that optimizing for both metrics simultaneously, rather than using a single metric, led to improvements in the accuracy and AUC trade-off1 up to 5% and 10%, respectively.The project leading to this publication has received funding from the European Commission under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 955895). Besim Bilalli is partly supported by the Spanish Ministerio de Ciencia e Innovación, as well as the European Union-Next Generation EU, under the project FJC 2021-046606-I/AEI/10.13039/501100011033. Gianluca Bontempi was supported by Service Public de Wallonie Recherche undergrant n°2010235–ARIAC by DIGITALWALLONIA4.AI.Peer ReviewedPostprint (published version

    Matrix Reordering Methods for Table and Network Visualization

    Get PDF
    International audienceThis survey provides a description of algorithms to reorder visual matrices of tabular data and adjacency matrix of networks. The goal of this survey is to provide a comprehensive list of reordering algorithms published in different fields such as statistics, bioinformatics, or graph theory. While several of these algorithms are described in publications and others are available in software libraries and programs, there is little awareness of what is done across all fields. Our survey aims at describing these reordering algorithms in a unified manner to enable a wide audience to understand their differences and subtleties. We organize this corpus in a consistent manner, independently of the application or research field. We also provide practical guidance on how to select appropriate algorithms depending on the structure and size of the matrix to reorder, and point to implementations when available

    Clustering of longitudinal viral loads in the Western Cape

    Get PDF
    Introduction: Routine viral load (VL) monitoring is important for assessing the effectiveness of ART in South Africa. There is little information however, on how the longitudinal VL patterns change for subgroups of persons living with HIV (PLHIV) who have experienced at least one elevated VL. We investigated the possible longitudinal VL patterns that may exist among this unique population. Methods: This mini-dissertation offers three components; a research protocol (Section A), a literature review (Section B) and a journal ready manuscript (Section C). We examined HIV VL data for the Western Cape from 2008 to 2018, taken from the National Health Laboratory Services (NHLS). Using< 1000 copies/mL as a threshold for viral suppression, we identified 109092 individuals who had at least one instance of an elevated VL. A nonparametric (KML-Shape) and a model-based (LCMM) clustering technique were used to identify latent subgroups of longitudinal VL trajectories among these individuals. Results: Both the KML-Shape and LCMM clustering techniques identified five latent viral load trajectory subgroups. KML-Shape found majority of individuals' trajectories belonged to clusters that had a decreasing longitudinal VL trend (76.6% of individuals), while LCMM found a smaller proportion of individuals' trajectories belonged to clusters that had a decreasing longitudinal trend (52.5% of individuals). Most of the trajectory subgroups identified had long periods of low-level viremia. Conclusion: Although majority of individuals belonged to clusters that had downward trends, further research is needed to better understand factors contributing to membership of clusters that did not have a downward longitudinal trend. Understanding these factors may help in the development of targeted HIV prevention programs for these individuals

    Multipartite Graph Algorithms for the Analysis of Heterogeneous Data

    Get PDF
    The explosive growth in the rate of data generation in recent years threatens to outpace the growth in computer power, motivating the need for new, scalable algorithms and big data analytic techniques. No field may be more emblematic of this data deluge than the life sciences, where technologies such as high-throughput mRNA arrays and next generation genome sequencing are routinely used to generate datasets of extreme scale. Data from experiments in genomics, transcriptomics, metabolomics and proteomics are continuously being added to existing repositories. A goal of exploratory analysis of such omics data is to illuminate the functions and relationships of biomolecules within an organism. This dissertation describes the design, implementation and application of graph algorithms, with the goal of seeking dense structure in data derived from omics experiments in order to detect latent associations between often heterogeneous entities, such as genes, diseases and phenotypes. Exact combinatorial solutions are developed and implemented, rather than relying on approximations or heuristics, even when problems are exceedingly large and/or difficult. Datasets on which the algorithms are applied include time series transcriptomic data from an experiment on the developing mouse cerebellum, gene expression data measuring acute ethanol response in the prefrontal cortex, and the analysis of a predicted protein-protein interaction network. A bipartite graph model is used to integrate heterogeneous data types, such as genes with phenotypes and microbes with mouse strains. The techniques are then extended to a multipartite algorithm to enumerate dense substructure in multipartite graphs, constructed using data from three or more heterogeneous sources, with applications to functional genomics. Several new theoretical results are given regarding multipartite graphs and the multipartite enumeration algorithm. In all cases, practical implementations are demonstrated to expand the frontier of computational feasibility

    3rd EGEE User Forum

    Get PDF
    We have organized this book in a sequence of chapters, each chapter associated with an application or technical theme introduced by an overview of the contents, and a summary of the main conclusions coming from the Forum for the chapter topic. The first chapter gathers all the plenary session keynote addresses, and following this there is a sequence of chapters covering the application flavoured sessions. These are followed by chapters with the flavour of Computer Science and Grid Technology. The final chapter covers the important number of practical demonstrations and posters exhibited at the Forum. Much of the work presented has a direct link to specific areas of Science, and so we have created a Science Index, presented below. In addition, at the end of this book, we provide a complete list of the institutes and countries involved in the User Forum

    Mathematics for modern biology

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Mathematics, 2009.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student-submitted PDF version of thesis.Includes bibliographical references (p. 115-124).In recent years there has been a great deal of new activity at the interface of biology and computation. This has largely been driven by the massive in flux of data from new experimental technologies, particularly high-throughput sequencing and array-based data. These new data sources require both computational power and new mathematics to properly piece them apart. This thesis discusses two problems in this field, network reconstruction and multiple network alignment, and draws the beginnings of a connection between information theory and population genetics. The first section addresses cellular signaling network inference. A central challenge in systems biology is the reconstruction of biological networks from high-throughput data sets, We introduce a new method based on parameterized modeling to infer signaling networks from perturbation data. We use this on Microarray data from RNAi knockout experiments to reconstruct the Rho signaling network in Drosophila. The second section addresses information theory and population genetics. While much has been proven about population genetics, a connection with information theory has never been drawn. We show that genetic drift is naturally measured in terms of the entropy of the allele distribution. We further sketch a structural connection between the two fields. The final section addresses multiple network alignment. With the increasing availability of large protein-protein interaction networks, the question of protein network alignment is becoming central to systems biology.(cont.) We introduce a new algorithm, IsoRankN to compute a global alignment of multiple protein networks. We test this on the five known eukaryotic protein-protein interaction (PPI) networks and show that it outperforms existing techniques.by Michael Hartmann Baym.Ph.D
    • …
    corecore