300 research outputs found

    Model Population Analysis for Statistical Model Comparison

    Get PDF

    Gene Expression Analysis Methods on Microarray Data a A Review

    Get PDF
    In recent years a new type of experiments are changing the way that biologists and other specialists analyze many problems. These are called high throughput experiments and the main difference with those that were performed some years ago is mainly in the quantity of the data obtained from them. Thanks to the technology known generically as microarrays, it is possible to study nowadays in a single experiment the behavior of all the genes of an organism under different conditions. The data generated by these experiments may consist from thousands to millions of variables and they pose many challenges to the scientists who have to analyze them. Many of these are of statistical nature and will be the center of this review. There are many types of microarrays which have been developed to answer different biological questions and some of them will be explained later. For the sake of simplicity we start with the most well known ones: expression microarrays

    SelecciĂłn de CaracterĂ­sticas de Microarreglos de ADN Utilizando una BĂșsqueda Cuckoo

    Get PDF
    En este artĂ­culo, se propone un mĂ©todo hĂ­brido para la selecciĂłn y clasificaciĂłn de datos de microarreglos de AND. Primero, el mĂ©todo combina los subconjuntos de genes relevantes obtenidos de cinco mĂ©todos de filtro, despuĂ©s, se implementa un algoritmo basado en una bĂșsqueda cuckoo combinado con un clasificador MSV. El algoritmo hĂ­brido explora dentro del subconjunto obtenido en la etapa anterior y selecciona los genes que alcanzan un alto desempeño al entrenar al clasificador. En los resultados experimentales, el algoritmo obtiene una tasa de clasificaciĂłn alta seleccionado un nĂșmero pequeño de genes, los resultados obtenidos son comparados con otros mĂ©todos reportados en la literatura

    Exploring the Intersection of Multi-Omics and Machine Learning in Cancer Research

    Get PDF
    Cancer biology and machine learning represent two seemingly disparate yet intrinsically linked fields of study. Cancer biology, with its complexities at the cellular and molecular levels, brings up a myriad of challenges. Of particular concern are the deviations in cell behaviour and rearrangements of genetic material that fuel transformation, growth, and spread of cancerous cells. Contemporary studies of cancer biology often utilise wide arrays of genomic data to pinpoint and exploit these abnormalities with an end-goal of translating them into functional therapies. Machine learning allows machines to make predictions based on the learnt data without explicit programming. It leverages patterns and inferences from large datasets, making it an invaluable tool in the modern era of large scale genomics. To this end, this doctoral thesis is underpinned by three themes: the application of machine learning, multi-omics, and cancer biology. It focuses on employment of machine learning algorithms to the tasks of cell annotation in single-cell RNA-seq datasets and drug response prediction in pre-clinical cancer models. In the first study, the author and colleagues developed a pipeline named Ikarus to differentiate between neoplastic and healthy cells within single-cell datasets, a task crucial for understanding the cellular landscape of tumours. Ikarus is designed to construct cancer cell-specific gene signatures from expert-annotated scRNA-seq datasets, score these genes, and distribute the scores to neighbouring cells via network propagation. This method successfully circumvents two common challenges in single-cell annotation: batch effects and unstable clustering. Furthermore, Ikarus utilises a multi-omic approach by incorporating CNVs inferred from scRNA-seq to enhance classification accuracy. The second study investigated how multi-omic analysis could enhance drug response prediction in pre-clinical cancer models. The research suggests that the typical practice of panel sequencing — a deep profiling of select, validated genomic features — is limited in its predictive power. However, incorporating transcriptomic features into the model significantly improves predictive ability across a variety of cancer models and is especially effective for drugs with collateral effects. This implies that the combined use of genomic and transcriptomic data has potential advantages in the pharmacogenomic arena. This dissertation recapitulates the findings of two aforementioned studies, which were published in Genome Biology and Cancers journals respectively. The two studies illustrate the application of machine learning techniques and multi-omic approaches to address conceptually distinct problems within the realm of cancer biology.Die Krebsbiologie und das maschinelle Lernen sind zwei scheinbar kontrĂ€re, aber intrinsisch verbundene Forschungsbereiche. Insbesondere die Krebsbiologie ist auf zellul ̈arer und molekularer Ebene hoch komplex und stellt den Forschenden vor eine Vielzahl von Herausforderungen. Zu verstehen wie abweichendes Zellverhalten und die Umstrukturierung genetischer Komponente die Transformation, das Wachstum und die Ausbreitung von Krebszellen antreiben, ist hierbei eine besondere Herausforderung. Gleichzeitig bestrebt die Krebsbiologie diese AbnormalitĂ€ten zu nutzen zu machen, Wissen aus ihnen zu gewinnen und sie so in funktionale Therapien umzusetzen. Maschinelles Lernen ermöglicht es Vorhersagen auf der Grundlage von gelernten Daten ohne explizite Programmierung zu treffen. Es erkennt Muster in großen DatensĂ€tzen, erschließt sich so Erkenntnisse und ist deswegen ein unschĂ€tzbar wertvolles Werkzeug im modernen Zeitalter der Hochdurchsatz Genomforschung. Aus diesem Grund ist maschinelles Lernen eines der drei Haupthemen dieser Doktorarbeit, neben Multi-Omics und Krebsbiologie. Der Fokus liegt hierbei insbesondere auf dem Einsatz von maschinellen Lernalgorithmen zum Zweck der Zellannotation in Einzelzell RNA-SequenzdatensĂ€tzen und der Vorhersage der Arzneimittelwirkung in prĂ€klinischen Krebsmodellen. In der ersten, hier prĂ€sentierten Studie, entwickelten der Autor und seine Kollegen eine Pipeline namens Ikarus. Diese kann zwischen neoplastischen und gesunden Zellen in Einzelzell-DatensĂ€tzen unterscheiden. Eine Aufgabe, die fĂŒr das Verst ̈andnis der zellulĂ€ren Landschaft von Tumoren entscheidend ist. Ikarus ist darauf ausgelegt, krebszellenspezifische Gensignaturen aus expertenanotierten scRNA-seq-DatensĂ€tzen zu konstruieren, diese Gene zu bewerten und die Bewertungen ĂŒber Netzwerkverbreitung auf benachbarte Zellen zu verteilen. Diese Methode umgeht erfolgreich zwei hĂ€ufige Herausforderungen bei der Einzelzellannotation: den Chargeneffekt und die instabile Clusterbildung. DarĂŒber hinaus verwendet Ikarus, durch das Einbeziehen von scRNA-seq abgeleiteten CNVs, einen Multi-Omic-Ansatz der die Klassifikationsgenauigkeit verbessert. Die zweite Studie untersuchte, wie Multi-Omic-Analysen die Vorhersage der Arzneimittelwirkung in prĂ€klinischen Krebsmodellen optimieren können. Die Forschung legt nahe, dass die ĂŒbliche Praxis des Panel Sequenzierens - die umfassende Profilierung ausgewĂ€hlter, validierter genomischer Merkmale - in ihrer Vorhersagekraft begrenzt ist. Durch das Einbeziehen transkriptomischer Merkmale in das Modell konnte jedoch die VorhersagefĂ€higkeit bei verschiedenen Krebsmodellen signifikant verbessert werden, ins besondere fĂŒr Arzneimittel mit Nebenwirkungen. Diese Dissertation fasst die Ergebnisse der beiden oben genannten Studien zusammen, die jeweils in Genome Biology und Cancers Journalen veröffentlicht wurden. Die beiden Studien veranschaulichen die Anwendung von maschinellem Lernen und Multi-Omic-AnsĂ€tzen zur Lösung konzeptionell unterschiedlicher Probleme im Bereich der Krebsbiologie

    Geometric Inference in Bayesian Hierarchical Models with Applications to Topic Modeling

    Full text link
    Unstructured data is available in abundance with the rapidly growing size of digital information. Labeling such data is expensive and impractical, making unsupervised learning an increasingly important field. Big data collections often have rich latent structure that statistical modeler is challenged to uncover. Bayesian hierarchical modeling is a particularly suitable approach for complex latent patterns. Graphical model formalism has been prominent in developing various procedures for inference in Bayesian models, however the corresponding computational limits often fall behind the demands of the modern data sizes. In this thesis we develop new approaches for scalable approximate Bayesian inference. In particular, our approaches are driven by the analysis of latent geometric structures induced by the models. Our specific contributions include the following. We develop full geometric recipe of the Latent Dirichlet Allocation topic model. Next, we study several approaches for exploiting the latent geometry to first arrive at a fast weighted clustering procedure augmented with geometric corrections for topic inference, and then a nonparametric approach based on the analysis of the concentration of mass and angular geometry of the topic simplex, a convex polytope constructed by taking the convex hull of vertices representing the latent topics. Estimates produced by our methods are shown to be statistically consistent under some conditions. Finally, we develop a series of models for temporal dynamics of the latent geometric structures where inference can be performed in online and distributed fashion. All our algorithms are evaluated with extensive experiments on simulated and real datasets, culminating at a method several orders of magnitude faster than existing state-of-the-art topic modeling approaches, as demonstrated by experiments working with several million documents in a dozen minutes.PHDStatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/146051/1/moonfolk_1.pd

    Learning by Fusing Heterogeneous Data

    Get PDF
    It has become increasingly common in science and technology to gather data about systems at different levels of granularity or from different perspectives. This often gives rise to data that are represented in totally different input spaces. A basic premise behind the study of learning from heterogeneous data is that in many such cases, there exists some correspondence among certain input dimensions of different input spaces. In our work we found that a key bottleneck that prevents us from better understanding and truly fusing heterogeneous data at large scales is identifying the kind of knowledge that can be transferred between related data views, entities and tasks. We develop interesting and accurate data fusion methods for predictive modeling, which reduce or entirely eliminate some of the basic feature engineering steps that were needed in the past when inferring prediction models from disparate data. In addition, our work has a wide range of applications of which we focus on those from molecular and systems biology: it can help us predict gene functions, forecast pharmacological actions of small chemicals, prioritize genes for further studies, mine disease associations, detect drug toxicity and regress cancer patient survival data. Another important aspect of our research is the study of latent factor models. We aim to design latent models with factorized parameters that simultaneously tackle multiple types of data heterogeneity, where data diversity spans across heterogeneous input spaces, multiple types of features, and a variety of related prediction tasks. Our algorithms are capable of retaining the relational structure of a data system during model inference, which turns out to be vital for good performance of data fusion in certain applications. Our recent work included the study of network inference from many potentially nonidentical data distributions and its application to cancer genomic data. We also model the epistasis, an important concept from genetics, and propose algorithms to efficiently find the ordering of genes in cellular pathways. A central topic of our Thesis is also the analysis of large data compendia as predictions about certain phenomena, such as associations between diseases and involvement of genes in a certain phenotype, are only possible when dealing with lots of data. Among others, we analyze 30 heterogeneous data sets to assess drug toxicity and over 40 human gene association data collections, the largest number of data sets considered by a collective latent factor model up to date. We also make interesting observations about deciding which data should be considered for fusion and develop a generic approach that can estimate the sensitivities between different data sets

    Omics in a Digital World: The Role of Bioinformatics in Providing New Insights Into Human Aging

    Get PDF
    BackgroundAging is a complex phenotype influenced by a combination of genetic and environmental factors. Although many studies addressed its cellular and physiological age-related changes, the molecular causes of aging remain undetermined. Considering the biological complexity and heterogeneity of the aging process, it is now clear that full understanding of mechanisms underlying aging can only be achieved through the integration of different data types and sources, and with new computational methods capable to achieve such integration.Recent AdvancesIn this review, we show that an omics vision of the age-dependent changes occurring as the individual ages can provide researchers with new opportunities to understand the mechanisms of aging. Combining results from single-cell analysis with systems biology tools would allow building interaction networks and investigate how these networks are perturbed during aging and disease. The development of high-throughput technologies such as next-generation sequencing, proteomics, metabolomics, able to investigate different biological markers and to monitor them simultaneously during the aging process with high accuracy and specificity, represents a unique opportunity offered to biogerontologists today.Critical IssuesAlthough the capacity to produce big data drastically increased over the years, integration, interpretation and sharing of high-throughput data remain major challenges. In this paper we present a survey of the emerging omics approaches in aging research and provide a large collection of datasets and databases as a useful resource for the scientific community to identify causes of aging. We discuss their peculiarities, emphasizing the need for the development of methods focused on the integration of different data types.Future DirectionsWe critically review the contribution of bioinformatics into the omics of aging research, and we propose a few recommendations to boost collaborations and produce new insights. We believe that significant advancements can be achieved by following major developments in bioinformatics, investing in diversity, data sharing and community-driven portable bioinformatics methods. We also argue in favor of more engagement and participation, and we highlight the benefits of new collaborations along these lines. This review aims at being a useful resource for many researchers in the field, and a call for new partnerships in aging research

    Interpretable Machine Learning Methods for Prediction and Analysis of Genome Regulation in 3D

    Get PDF
    With the development of chromosome conformation capture-based techniques, we now know that chromatin is packed in three-dimensional (3D) space inside the cell nucleus. Changes in the 3D chromatin architecture have already been implicated in diseases such as cancer. Thus, a better understanding of this 3D conformation is of interest to help enhance our comprehension of the complex, multipronged regulatory mechanisms of the genome. The work described in this dissertation largely focuses on development and application of interpretable machine learning methods for prediction and analysis of long-range genomic interactions output from chromatin interaction experiments. In the first part, we demonstrate that the genetic sequence information at the ge- nomic loci is predictive of the long-range interactions of a particular locus of interest (LoI). For example, the genetic sequence information at and around enhancers can help predict whether it interacts with a promoter region of interest. This is achieved by building string kernel-based support vector classifiers together with two novel, in- tuitive visualization methods. These models suggest a potential general role of short tandem repeat motifs in the 3D genome organization. But, the insights gained out of these models are still coarse-grained. To this end, we devised a machine learning method, called CoMIK for Conformal Multi-Instance Kernels, capable of providing more fine-grained insights. When comparing sequences of variable length in the su- pervised learning setting, CoMIK can not only identify the features important for classification but also locate them within the sequence. Such precise identification of important segments of the whole sequence can help in gaining de novo insights into any role played by the intervening chromatin towards long-range interactions. Although CoMIK primarily uses only genetic sequence information, it can also si- multaneously utilize other information modalities such as the numerous functional genomics data if available. The second part describes our pipeline, pHDee, for easy manipulation of large amounts of 3D genomics data. We used the pipeline for analyzing HiChIP experimen- tal data for studying the 3D architectural changes in Ewing sarcoma (EWS) which is a rare cancer affecting adolescents. In particular, HiChIP data for two experimen- tal conditions, doxycycline-treated and untreated, and for primary tumor samples is analyzed. We demonstrate that pHDee facilitates processing and easy integration of large amounts of 3D genomics data analysis together with other data-intensive bioinformatics analyses.Mit der Entwicklung von Techniken zur Bestimmung der Chromosomen-Konforma- tion wissen wir jetzt, dass Chromatin in einer dreidimensionalen (3D) Struktur in- nerhalb des Zellkerns gepackt ist. Änderungen in der 3D-Chromatin-Architektur sind bereits mit Krankheiten wie Krebs in Verbindung gebracht worden. Daher ist ein besseres VerstĂ€ndnis dieser 3D-Konformation von Interesse, um einen tieferen Einblick in die komplexen, vielschichtigen Regulationsmechanismen des Genoms zu ermöglichen. Die in dieser Dissertation beschriebene Arbeit konzentriert sich im Wesentlichen auf die Entwicklung und Anwendung interpretierbarer maschineller Lernmethoden zur Vorhersage und Analyse von weitreichenden genomischen Inter- aktionen aus Chromatin-Interaktionsexperimenten. Im ersten Teil zeigen wir, dass die genetische Sequenzinformation an den genomis- chen Loci prĂ€diktiv fĂŒr die weitreichenden Interaktionen eines bestimmten Locus von Interesse (LoI) ist. Zum Beispiel kann die genetische Sequenzinformation an und um Enhancer-Elemente helfen, vorherzusagen, ob diese mit einer Promotorregion von Interesse interagieren. Dies wird durch die Erstellung von String-Kernel-basierten Support Vector Klassifikationsmodellen zusammen mit zwei neuen, intuitiven Visual- isierungsmethoden erreicht. Diese Modelle deuten auf eine mögliche allgemeine Rolle von kurzen, repetitiven Sequenzmotiven (”tandem repeats”) in der dreidimensionalen Genomorganisation hin. Die Erkenntnisse aus diesen Modellen sind jedoch immer noch grobkörnig. Zu diesem Zweck haben wir die maschinelle Lernmethode CoMIK (fĂŒr Conformal Multi-Instance-Kernel) entwickelt, welche feiner aufgelöste Erkennt- nisse liefern kann. Beim Vergleich von Sequenzen mit variabler LĂ€nge in ĂŒberwachten Lernszenarien kann CoMIK nicht nur die fĂŒr die Klassifizierung wichtigen Merkmale identifizieren, sondern sie auch innerhalb der Sequenz lokalisieren. Diese genaue Identifizierung wichtiger Abschnitte der gesamten Sequenz kann dazu beitragen, de novo Einblick in jede Rolle zu gewinnen, die das dazwischen liegende Chromatin fĂŒr weitreichende Interaktionen spielt. Obwohl CoMIK hauptsĂ€chlich nur genetische Se- quenzinformationen verwendet, kann es gleichzeitig auch andere Informationsquellen nutzen, beispielsweise zahlreiche funktionellen Genomdaten sofern verfĂŒgbar. Der zweite Teil beschreibt unsere Pipeline pHDee fĂŒr die einfache Bearbeitung großer Mengen von 3D-Genomdaten. Wir haben die Pipeline zur Analyse von HiChIP- Experimenten zur Untersuchung von dreidimensionalen ArchitekturĂ€nderungen bei der seltenen Krebsart Ewing-Sarkom (EWS) verwendet, welche Jugendliche betrifft. Insbesondere werden HiChIP-Daten fĂŒr zwei experimentelle Bedingungen, Doxycyclin- behandelt und unbehandelt, und fĂŒr primĂ€re Tumorproben analysiert. Wir zeigen, dass pHDee die Verarbeitung und einfache Integration großer Mengen der 3D-Genomik- Datenanalyse zusammen mit anderen datenintensiven Bioinformatik-Analysen erle- ichtert
    • 

    corecore