200 research outputs found

    Global proteomics profiling improves drug sensitivity prediction: results from a multi-omics, pan-cancer modeling approach

    Get PDF
    Motivation: Proteomics profiling is increasingly being used for molecular stratification of cancer patients and cell-line panels. However, systematic assessment of the predictive power of large-scale proteomic technologies across various drug classes and cancer types is currently lacking. To that end, we carried out the first pan-cancer, multi-omics comparative analysis of the relative performance of two proteomic technologies, targeted reverse phase protein array (RPPA) and global mass spectrometry (MS), in terms of their accuracy for predicting the sensitivity of cancer cells to both cytotoxic chemotherapeutics and molecularly targeted anticancer compounds.Results: Our results in two cell-line panels demonstrate how MS profiling improves drug response predictions beyond that of the RPPA or the other omics profiles when used alone. However, frequent missing MS data values complicate its use in predictive modeling and required additional filtering, such as focusing on completely measured or known oncoproteins, to obtain maximal predictive performance. Rather strikingly, the two proteomics profiles provided complementary predictive signal both for the cytotoxic and targeted compounds. Further, information about the cellular-abundance of primary target proteins was found critical for predicting the response of targeted compounds, although the non-target features also contributed significantly to the predictive power. The clinical relevance of the selected protein markers was confirmed in cancer patient data. These results provide novel insights into the relative performance and optimal use of the widely applied proteomic technologies, MS and RPPA, which should prove useful in translational applications, such as defining the best combination of omics technologies and marker panels for understanding and predicting drug sensitivities in cancer patients

    Development and evaluation of machine learning algorithms for biomedical applications

    Get PDF
    Gene network inference and drug response prediction are two important problems in computational biomedicine. The former helps scientists better understand the functional elements and regulatory circuits of cells. The latter helps a physician gain full understanding of the effective treatment on patients. Both problems have been widely studied, though current solutions are far from perfect. More research is needed to improve the accuracy of existing approaches. This dissertation develops machine learning and data mining algorithms, and applies these algorithms to solve the two important biomedical problems. Specifically, to tackle the gene network inference problem, the dissertation proposes (i) new techniques for selecting topological features suitable for link prediction in gene networks; a graph sparsification method for network sampling; (iii) combined supervised and unsupervised methods to infer gene networks; and (iv) sampling and boosting techniques for reverse engineering gene networks. For drug sensitivity prediction problem, the dissertation presents (i) an instance selection technique and hybrid method for drug sensitivity prediction; (ii) a link prediction approach to drug sensitivity prediction; a noise-filtering method for drug sensitivity prediction; and (iv) transfer learning approaches for enhancing the performance of drug sensitivity prediction. Substantial experiments are conducted to evaluate the effectiveness and efficiency of the proposed algorithms. Experimental results demonstrate the feasibility of the algorithms and their superiority over the existing approaches

    Deep Learning for Genomics: A Concise Overview

    Full text link
    Advancements in genomic research such as high-throughput sequencing techniques have driven modern genomic studies into "big data" disciplines. This data explosion is constantly challenging conventional methods used in genomics. In parallel with the urgent demand for robust algorithms, deep learning has succeeded in a variety of fields such as vision, speech, and text processing. Yet genomics entails unique challenges to deep learning since we are expecting from deep learning a superhuman intelligence that explores beyond our knowledge to interpret the genome. A powerful deep learning model should rely on insightful utilization of task-specific knowledge. In this paper, we briefly discuss the strengths of different deep learning models from a genomic perspective so as to fit each particular task with a proper deep architecture, and remark on practical considerations of developing modern deep learning architectures for genomics. We also provide a concise review of deep learning applications in various aspects of genomic research, as well as pointing out potential opportunities and obstacles for future genomics applications.Comment: Invited chapter for Springer Book: Handbook of Deep Learning Application

    A Bioinformatics Approach

    Get PDF
    By regulating the timing of cellular processes, the circadian clock provides a way to adapt physiology and behaviour to the geophysical time. In mammals, a light-entrainable master clock located in the suprachiasmatic nucleus (SCN) controls peripheral clocks that are present in virtually every body cell. Defective circadian timing is associated with several pathologies such as cancer and metabolic and sleep disorders. To better understand the circadian regulation of cellular processes, we developed a bioinformatics pipeline encompassing the analysis of high-throughput data sets and the exploitation of published knowledge by text-mining. We identified 118 novel potential clock- regulated genes and integrated them into an existing high-quality circadian network, generating the to-date most comprehensive network of circadian regulated genes (NCRG). To validate particular elements in our network, we assessed publicly available ChIP-seq data for BMAL1, REV-ERBα/β and RORα/γ proteins and found strong evidence for circadian regulation of Elavl1, Nme1, Dhx6, Med1 and Rbbp7 all of which are involved in the regulation of tumourigenesis. Furthermore, we identified Ncl and Ddx6, as targets of RORγ and REV-ERBα, β, respectively. Most interestingly, these genes were also reported to be involved in miRNA regulation; in particular, NCL regulates several miRNAs, all involved in cancer aggressiveness. Thus, NCL represents a novel potential link via which the circadian clock, and specifically RORγ, regulates the expression of miRNAs, with particular consequences in breast cancer progression. Our findings bring us one step forward towards a mechanistic understanding of mammalian circadian regulation, and provide further evidence of the influence of circadian deregulation in cancer

    Unsupervised multiple kernel learning approaches for integrating molecular cancer patient data

    Get PDF
    Cancer is the second leading cause of death worldwide. A characteristic of this disease is its complexity leading to a wide variety of genetic and molecular aberrations in the tumors. This heterogeneity necessitates personalized therapies for the patients. However, currently defined cancer subtypes used in clinical practice for treatment decision-making are based on relatively few selected markers and thus provide only a coarse classifcation of tumors. The increased availability in multi-omics data measured for cancer patients now offers the possibility of defining more informed cancer subtypes. Such a more fine-grained characterization of cancer subtypes harbors the potential of substantially expanding treatment options in personalized cancer therapy. In this thesis, we identify comprehensive cancer subtypes using multidimensional data. For this purpose, we apply and extend unsupervised multiple kernel learning methods. Three challenges of unsupervised multiple kernel learning are addressed: robustness, applicability, and interpretability. First, we show that regularization of the multiple kernel graph embedding framework, which enables the implementation of dimensionality reduction techniques, can increase the stability of the resulting patient subgroups. This improvement is especially beneficial for data sets with a small number of samples. Second, we adapt the objective function of kernel principal component analysis to enable the application of multiple kernel learning in combination with this widely used dimensionality reduction technique. Third, we improve the interpretability of kernel learning procedures by performing feature clustering prior to integrating the data via multiple kernel learning. On the basis of these clusters, we derive a score indicating the impact of a feature cluster on a patient cluster, thereby facilitating further analysis of the cluster-specific biological properties. All three procedures are successfully tested on real-world cancer data. Comparing our newly derived methodologies to established methods provides evidence that our work offers novel and beneficial ways of identifying patient subgroups and gaining insights into medically relevant characteristics of cancer subtypes.Krebs ist eine der häufigsten Todesursachen weltweit. Krebs ist gekennzeichnet durch seine Komplexität, die zu vielen verschiedenen genetischen und molekularen Aberrationen im Tumor führt. Die Unterschiede zwischen Tumoren erfordern personalisierte Therapien für die einzelnen Patienten. Die Krebssubtypen, die derzeit zur Behandlungsplanung in der klinischen Praxis verwendet werden, basieren auf relativ wenigen, genetischen oder molekularen Markern und können daher nur eine grobe Unterteilung der Tumoren liefern. Die zunehmende Verfügbarkeit von Multi-Omics-Daten für Krebspatienten ermöglicht die Neudefinition von fundierteren Krebssubtypen, die wiederum zu spezifischeren Behandlungen für Krebspatienten führen könnten. In dieser Dissertation identifizieren wir neue, potentielle Krebssubtypen basierend auf Multi-Omics-Daten. Hierfür verwenden wir unüberwachtes Multiple Kernel Learning, welches in der Lage ist mehrere Datentypen miteinander zu kombinieren. Drei Herausforderungen des unüberwachten Multiple Kernel Learnings werden adressiert: Robustheit, Anwendbarkeit und Interpretierbarkeit. Zunächst zeigen wir, dass die zusätzliche Regularisierung des Multiple Kernel Learning Frameworks zur Implementierung verschiedener Dimensionsreduktionstechniken die Stabilität der identifizierten Patientengruppen erhöht. Diese Robustheit ist besonders vorteilhaft für Datensätze mit einer geringen Anzahl von Proben. Zweitens passen wir die Zielfunktion der kernbasierten Hauptkomponentenanalyse an, um eine integrative Version dieser weit verbreiteten Dimensionsreduktionstechnik zu ermöglichen. Drittens verbessern wir die Interpretierbarkeit von kernbasierten Lernprozeduren, indem wir verwendete Merkmale in homogene Gruppen unterteilen bevor wir die Daten integrieren. Mit Hilfe dieser Gruppen definieren wir eine Bewertungsfunktion, die die weitere Auswertung der biologischen Eigenschaften von Patientengruppen erleichtert. Alle drei Verfahren werden an realen Krebsdaten getestet. Den Vergleich unserer Methodik mit etablierten Methoden weist nach, dass unsere Arbeit neue und nützliche Möglichkeiten bietet, um integrative Patientengruppen zu identifizieren und Einblicke in medizinisch relevante Eigenschaften von Krebssubtypen zu erhalten

    Pathway-Based Multi-Omics Data Integration for Breast Cancer Diagnosis and Prognosis.

    Get PDF
    Ph.D. Thesis. University of Hawaiʻi at Mānoa 2017

    Development of New Bioinformatic Approaches for Human Genetic Studies

    Get PDF
    The development of bioinformatics methods for human genetic studies utilizes the vast amount of data to generate new valuable information. Machine learning and statistical coupling analysis can be used in the study of human diseases. These diseases include intellectual disabilities (ID), prevalent in 1-3% of the population and caused primarily by genetics. Although many cases of ID are caused by mutations in protein-coding genes, the possible involvement of long non-coding RNAs (lncRNAs) in ID due to their role in gene expression regulation, has been explored. In this study, we used machine learning to develop a new expression-based model trained using ID genes encoded with the developing brain transcriptome. The model was fine-tuned using the class-balancing approach of synthetic over-sampling of the minority class, resulting in improved performance. We used the model to predict candidate ID-associated lncRNAs. Our model identified several candidates that overlapped with previously reported ID-associated lncRNAs, enriched with neurodevelopmental functions, and highly expressed in brain tissues. Machine learning was also used to predict protein stability changes caused by missense mutations, which can lead to disease conditions including ID. We tested Random Forests, Support Vector Machines (SVM) and Naïve Bayes to find the best-performing algorithm to develop a multi-class classifier. We developed an SVM model using relevant physico-chemical features after feature selection. Our work identified new features for predicting the effect of amino acid substitutions on protein stability and a well-performing multi-class classifier solely based on sequence information. Statistical approaches were used to analyze the association between mutations and phenotypes. In this study, we used statistical coupling analysis (SCA) to cluster disease-causing mutations and ID phenotypes. Using SCA we identified groups of co-evolving residues, known as protein sectors, in ID protein families. Within each distinct sector, mutations associated with different phenotypic manifestations associated with a syndromic ID were identified. Our results suggest that protein sector analysis can be used to associate mutations with phenotypic manifestations in human diseases. The bioinformatic methods developed in this dissertation can be used in human genetic research to understand the role of new genes and proteins in human disease

    Using Machine Learning On Diverse Datasets To Predict Drug-Induced Liver Injury

    Get PDF
    A major challenge in drug development is safety and toxicity concerns due to drug sideeffects. One such side effect, drug-induced liver injury (DILI), is considered a primary factor in regulatory clearance. To develop prediction models of DILI, the Critical Assessment of Massive Data Analysis (CAMDA) 2020 CMap Drug Safety Challenge goal was established with an ultimate goal to develop prediction models based on gene perturbation of six preselected cell-lines (CMap L1000), extended structural information (MOLD2), toxicity data (TOX21), and FDA reporting of adverse events (FAERS). Four types of DILI classes were targeted, including two clinically relevant scores and two control classifications, designed by the CAMDA organizers. The L1000 gene expression data had variable drug coverage across cell lines with only 247 out of 617 drugs in the study measured in all six cell types. We addressed this coverage issue by using Kru-Bor ranked merging to generate a singular drug expression signature across all six cell lines. These merged signatures were then narrowed down to the top and bottom 100, 250, 500, or 1,000 genes most perturbed by drug treatment. These signatures were subject to feature selection using Fisher’s exact test to identify genes predictive of DILI status. Models based solely on expression signatures had varying results for clinical DILI subtypes with an accuracy ranging from 0.49 to 0.67 and Matthews Correlation Coefficient (MCC) values ranging from -0.03 to 0.1. Models built using FAERS, MOLD2 and TOX21 also had similar results in predicting clinical DILI scores with accuracy ranging from 0.56 to 0.67 with MCC scores ranging from 0.12 to 0.36. To incorporate these various data types with expression-based models, we utilized soft, hard, and weighted ensemble voting methods using the top three performing models for each DILI classification. These voting models achieved a balanced accuracy up to 0.54 and 0.60 for the clinically relevant DILI subtypes. Overall, from our experiment, traditional machine learning approaches may not be optimal as a classification method for the current data

    Evaluation of blood-based microRNAs toward clinical use as biomarkers in common and rare diseases

    Get PDF
    According to the GLOBOCAN project of the International Agency for Research on Cancer, the top three common cancer diseases worldwide in the year 2020 were breast, lung and colorectal cancer. These are usually diagnosed via imaging methods (e.g. computer tomography) or invasive methods (e.g. biopsy). However, these techniques are potentially risky and expensive and thus not accessible to all patients, resulting in most cancers being detected in an advanced stage. Since the discovery of small non-coding RNAs and specifically microRNAs and their role as gene regulators, many researchers investigate their association with disease development. In particular, researchers examine body fluid based microRNAs which could present potential cost-effective and minimally- or non-invasive alternatives to the previously described established diagnosis methods. This dissertation focuses on microRNAs and investigates their suitability as minimally-invasive blood-borne biomarkers for potential diagnostic purposes. More specifically, the goals of this work are (1) to implement a new method to predict novel microRNAs, (2) to understand stability and characteristics of these small non-coding RNAs, possibly relevant for the last goal, (3) to discover potential diagnostic biomarkers in common and rare diseases. The first goal was addressed by developing miRMaster, a web service to predict new microRNAs. The tool uses machine learning and high-throughput sequencing data to find microRNA candidates that follow the known biogenesis pathways. The second goal was pursued in four publications. First, we performed a large scale evaluation of miRMaster by generating a high-resolution map of the human small non-coding RNA transcriptome for which we analyzed and validated potential microRNA candidates. Next, we examined the influence of seasonal effects on microRNA expression profiles and observed the largest difference between spring and the other seasons. Additionally, we evaluated the evolutionary conservation of small non-coding RNAs in zoo animals and showed that the distribution of sncRNA classes varies across species, while common microRNA families are present in more diverse organisms than assumed so far. Furthermore, we analyzed if microRNAs are technically stable, and whether biological variation is preserved when using capillary dried blood spots as an alternative sample collection device to venous blood specimens. Finally, we investigated the suitability of microRNAs as biomarkers for two diseases: lung cancer and Marfan disease. We identified blood-borne biomarker candidates for lung cancer detection in a large-scale multi-center study via machine learning. For the rare Marfan disease we analyzed the paired messenger RNA and microRNA expression levels in whole-blood samples. This highlighted several significantly deregulated microRNAs and messenger RNAs, which we subsequently validated in an independent cohort. In summary, this thesis provides valuable results toward potential clinical use of microRNAs, and the herein described projects represent comprehensive analyses of them from different perspectives: starting with microRNA discovery, addressing various technical and biological questions and ending with the potential use as biomarkers.Nach Angaben des GLOBOCAN-Projekts der International Agency for Research on Cancer sind die drei häufigsten Krebserkrankungen weltweit im Jahr 2020 Brust-, Lungen- und Darmkrebs. Diese werden in der Regel durch bildgebende Verfahren (z.B. Computertomographie) oder invasive Methoden (z.B. Biopsie) diagnostiziert. Diese Verfahren sind jedoch potenziell risikoreich und teuer und daher nicht für alle Patienten zugänglich. Dies führt dazu, dass die meisten Krebsarten erst in einem fortgeschrittenen Stadium entdeckt werden. Seit der Entdeckung der kurzen nichtkodierenden RNAs und insbesondere der microRNAs und ihrer Rolle als Genregulatoren untersuchen viele Forscher ihren Zusammenhang mit der Krankheitsentwicklung. Insbesondere untersuchen die Forscher die in Körperflüssigkeiten vorkommenden microRNAs, die potenziell kosteneffiziente und minimal- oder nicht-invasive Alternativen zu den bisher beschriebenen etablierten Diagnosemethoden darstellen könnten. Diese Dissertation konzentriert sich auf microRNAs und untersucht deren Eignung als minimal-invasive blutbasierte Biomarker für potenzielle diagnostische Zwecke. Genauer gesagt sind die Ziele dieser Arbeit (1) die Implementierung einer neuen Methode zur Vorhersage neuartiger microRNAs, (2) das Verständnis über die Stabilität und Charakteristika dieser kurzen nicht-kodierenden RNAs, die möglicherweise für das nächste Ziel relevant sind, (3) die Entdeckung potenzieller diagnostischer Biomarker für verschiedene Anwendungen. Das erste Ziel wurde durch die Entwicklung von miRMaster verfolgt, einem Webdienst zur Vorhersage neuer microRNAs. Das Tool nutzt maschinelles Lernen und Hochdurchsatz-Sequenzierungsdaten, um microRNA-Kandidaten zu finden, die den bekannten Wege der Biogenese folgen. Das zweite Ziel wurde in vier Veröffentlichungen verfolgt. Zunächst führten wir eine groß angelegte Evaluierung von miRMaster durch, indem wir eine High-Resolution Map des menschlichen Transkriptoms kurzer nichtkodierender RNAs erstellten, für die wir potenzielle microRNA-Kandidaten analysierten und validierten. Anschließend untersuchten wir den Einfluss saisonaler Effekte auf die microRNA-Expressionsprofile und beobachteten den größten Unterschied zwischen dem Frühling und den anderen Jahreszeiten. Darüber hinaus untersuchten wir die evolutionäre Erhaltung kurzer nichtkodierender RNAs in Zoo-Tieren und zeigten, dass die Verteilung der kurzer nichtkodierenden RNA-Klassen zwischen den Arten variiert, während gemeinsame microRNA-Familien in verschiedeneren Organismen vorkommen als bisher angenommen. Darüber hinaus analysierten wir, ob microRNAs technisch stabil sind und ob die biologische Variation erhalten bleibt, wenn kapillares Trockenblut als alternatives Probenentnahmeverfahren zu venösen Blutproben verwendet werden. Schließlich untersuchten wir die Eignung von microRNAs als Biomarker für zwei Krankheiten: Lungenkrebs und Marfan-Krankheit. In einer groß angelegten multizentrischen Studie identifizierten wir mit Hilfe von maschinellem Lernen Biomarker-Kandidaten aus dem Blut für die Erkennung von Lungenkrebs. Für die seltene Marfan-Krankheit analysierten wir die gepaarten Expressionsniveaus von messengerRNA und microRNA in Vollblutproben. Dabei wurden mehrere signifikant deregulierte microRNAs und messengerRNAs festgestellt, die wir anschließend in einer unabhängigen Kohorte validierten. Zusammenfassend lässt sich sagen, dass diese Arbeit wertvolle Ergebnisse im Hinblick auf die potenzielle klinische Verwendung von microRNAs liefert. Die hier beschriebenen Projekte stellen umfassende Analysen aus verschiedenen Blickwinkeln dar: angefangen bei der Entdeckung von microRNAs, über verschiedene technische und biologische Fragen bis hin zur potenziellen Verwendung als Biomarker

    Machine learning and feature selection for drug response prediction in precision oncology applications

    Get PDF
    In-depth modeling of the complex interplay among multiple omics data measured from cancer cell lines or patient tumors is providing new opportunities toward identification of tailored therapies for individual cancer patients. Supervised machine learning algorithms are increasingly being applied to the omics profiles as they enable integrative analyses among the high-dimensional data sets, as well as personalized predictions of therapy responses using multi-omics panels of response-predictive biomarkers identified through feature selection and cross-validation. However, technical variability and frequent missingness in input "big data" require the application of dedicated data preprocessing pipelines that often lead to some loss of information and compressed view of the biological signal. We describe here the state-of-the-art machine learning methods for anti-cancer drug response modeling and prediction and give our perspective on further opportunities to make better use of high-dimensional multi-omics profiles along with knowledge about cancer pathways targeted by anti-cancer compounds when predicting their phenotypic responses
    corecore