3,489 research outputs found

    Exploring Patterns of Epigenetic Information With Data Mining Techniques

    Get PDF
    [Abstract] Data mining, a part of the Knowledge Discovery in Databases process (KDD), is the process of extracting patterns from large data sets by combining methods from statistics and artificial intelligence with database management. Analyses of epigenetic data have evolved towards genome-wide and high-throughput approaches, thus generating great amounts of data for which data mining is essential. Part of these data may contain patterns of epigenetic information which are mitotically and/or meiotically heritable determining gene expression and cellular differentiation, as well as cellular fate. Epigenetic lesions and genetic mutations are acquired by individuals during their life and accumulate with ageing. Both defects, either together or individually, can result in losing control over cell growth and, thus, causing cancer development. Data mining techniques could be then used to extract the previous patterns. This work reviews some of the most important applications of data mining to epigenetics.Programa Iberoamericano de Ciencia y Tecnología para el Desarrollo; 209RT-0366Galicia. Consellería de Economía e Industria; 10SIN105004PRInstituto de Salud Carlos III; RD07/0067/000

    Machine Learning Models for Deciphering Regulatory Mechanisms and Morphological Variations in Cancer

    Get PDF
    The exponential growth of multi-omics biological datasets is resulting in an emerging paradigm shift in fundamental biological research. In recent years, imaging and transcriptomics datasets are increasingly incorporated into biological studies, pushing biology further into the domain of data-intensive-sciences. New approaches and tools from statistics, computer science, and data engineering are profoundly influencing biological research. Harnessing this ever-growing deluge of multi-omics biological data requires the development of novel and creative computational approaches. In parallel, fundamental research in data sciences and Artificial Intelligence (AI) has advanced tremendously, allowing the scientific community to generate a massive amount of knowledge from data. Advances in Deep Learning (DL), in particular, are transforming many branches of engineering, science, and technology. Several of these methodologies have already been adapted for harnessing biological datasets; however, there is still a need to further adapt and tailor these techniques to new and emerging technologies. In this dissertation, we present computational algorithms and tools that we have developed to study gene-regulation and cellular morphology in cancer. The models and platforms that we have developed are general and widely applicable to several problems relating to dysregulation of gene expression in diseases. Our pipelines and software packages are disseminated in public repositories for larger scientific community use. This dissertation is organized in three main projects. In the first project, we present Causal Inference Engine (CIE), an integrated platform for the identification and interpretation of active regulators of transcriptional response. The platform offers visualization tools and pathway enrichment analysis to map predicted regulators to Reactome pathways. We provide a parallelized R-package for fast and flexible directional enrichment analysis to run the inference on custom regulatory networks. Next, we designed and developed MODEX, a fully automated text-mining system to extract and annotate causal regulatory interaction between Transcription Factors (TFs) and genes from the biomedical literature. MODEX uses putative TF-gene interactions derived from high-throughput ChIP-Seq or other experiments and seeks to collect evidence and meta-data in the biomedical literature to validate and annotate the interactions. MODEX is a complementary platform to CIE that provides auxiliary information on CIE inferred interactions by mining the literature. In the second project, we present a Convolutional Neural Network (CNN) classifier to perform a pan-cancer analysis of tumor morphology, and predict mutations in key genes. The main challenges were to determine morphological features underlying a genetic status and assess whether these features were common in other cancer types. We trained an Inception-v3 based model to predict TP53 mutation in five cancer types with the highest rate of TP53 mutations. We also performed a cross-classification analysis to assess shared morphological features across multiple cancer types. Further, we applied a similar methodology to classify HER2 status in breast cancer and predict response to treatment in HER2 positive samples. For this study, our training slides were manually annotated by expert pathologists to highlight Regions of Interest (ROIs) associated with HER2+/- tumor microenvironment. Our results indicated that there are strong morphological features associated with each tumor type. Moreover, our predictions highly agree with manual annotations in the test set, indicating the feasibility of our approach in devising an image-based diagnostic tool for HER2 status and treatment response prediction. We have validated our model using samples from an independent cohort, which demonstrates the generalizability of our approach. Finally, in the third project, we present an approach to use spatial transcriptomics data to predict spatially-resolved active gene regulatory mechanisms in tissues. Using spatial transcriptomics, we identified tissue regions with differentially expressed genes and applied our CIE methodology to predict active TFs that can potentially regulate the marker genes in the region. This project bridged the gap between inference of active regulators using molecular data and morphological studies using images. The results demonstrate a significant local pattern in TF activity across the tissue, indicating differential spatial-regulation in tissues. The results suggest that the integrative analysis of spatial transcriptomics data with CIE can capture discriminant features and identify localized TF-target links in the tissue

    hSAGEing: An Improved SAGE-Based Software for Identification of Human Tissue-Specific or Common Tumor Markers and Suppressors

    Get PDF
    SAGE (serial analysis of gene expression) is a powerful method of analyzing gene expression for the entire transcriptome. There are currently many well-developed SAGE tools. However, the cross-comparison of different tissues is seldom addressed, thus limiting the identification of common- and tissue-specific tumor markers.To improve the SAGE mining methods, we propose a novel function for cross-tissue comparison of SAGE data by combining the mathematical set theory and logic with a unique “multi-pool method” that analyzes multiple pools of pair-wise case controls individually. When all the settings are in “inclusion”, the common SAGE tag sequences are mined. When one tissue type is in “inclusion” and the other types of tissues are not in “inclusion”, the selected tissue-specific SAGE tag sequences are generated. They are displayed in tags-per-million (TPM) and fold values, as well as visually displayed in four kinds of scales in a color gradient pattern. In the fold visualization display, the top scores of the SAGE tag sequences are provided, along with cluster plots. A user-defined matrix file is designed for cross-tissue comparison by selecting libraries from publically available databases or user-defined libraries

    Joint and individual analysis of breast cancer histologic images and genomic covariates

    Get PDF
    A key challenge in modern data analysis is understanding connections between complex and differing modalities of data. For example, two of the main approaches to the study of breast cancer are histopathology (analyzing visual characteristics of tumors) and genetics. While histopathology is the gold standard for diagnostics and there have been many recent breakthroughs in genetics, there is little overlap between these two fields. We aim to bridge this gap by developing methods based on Angle-based Joint and Individual Variation Explained (AJIVE) to directly explore similarities and differences between these two modalities. Our approach exploits Convolutional Neural Networks (CNNs) as a powerful, automatic method for image feature extraction to address some of the challenges presented by statistical analysis of histopathology image data. CNNs raise issues of interpretability that we address by developing novel methods to explore visual modes of variation captured by statistical algorithms (e.g. PCA or AJIVE) applied to CNN features. Our results provide many interpretable connections and contrasts between histopathology and genetics

    Classification of microarray gene expression cancer data by using artificial intelligence methods

    Get PDF
    Günümüzde bilgisayar teknolojilerinin gelişmesi ile birçok alanda yapılan çalışmaları etkilemiştir. Moleküler biyoloji ve bilgisayar teknolojilerinde meydana gelen gelişmeler biyoinformatik adlı bilimi ortaya çıkarmıştır. Biyoinformatik alanında meydana gelen hızlı gelişmeler, bu alanda çözülmeyi bekleyen birçok probleme çözüm olma yolunda büyük katkılar sağlamıştır. DNA mikroarray gen ekspresyonlarının sınıflandırılması da bu problemlerden birisidir. DNA mikroarray çalışmaları, biyoinformatik alanında kullanılan bir teknolojidir. DNA mikroarray veri analizi, kanser gibi genlerle alakalı hastalıkların teşhisinde çok etkin bir rol oynamaktadır. Hastalık türüne bağlı gen ifadeleri belirlenerek, herhangi bir bireyin hastalıklı gene sahip olup olmadığı büyük bir başarı oranı ile tespit edilebilir. Bireyin sağlıklı olup olmadığının tespiti için, mikroarray gen ekspresyonları üzerinde yüksek performanslı sınıflandırma tekniklerinin kullanılması büyük öneme sahiptir. DNA mikroarray’lerini sınıflandırmak için birçok yöntem bulunmaktadır. Destek Vektör Makinaları, Naive Bayes, k-En yakın Komşu, Karar Ağaçları gibi birçok istatistiksel yöntemler yaygın olarak kullanlmaktadır. Fakat bu yöntemler tek başına kullanıldığında, mikroarray verilerini sınıflandırmada her zaman yüksek başarı oranları vermemektedir. Bu yüzden mikroarray verilerini sınıflandırmada yüksek başarı oranları elde etmek için yapay zekâ tabanlı yöntemlerin de kullanılması yapılan çalışmalarda görülmektedir. Bu çalışmada, bu istatistiksel yöntemlere ek olarak yapay zekâ tabanlı ANFIS gibi bir yöntemi kullanarak daha yüksek başarı oranları elde etmek amaçlanmıştır. İstatistiksel sınıflandırma yöntemleri olarak K-En Yakın Komşuluk, Naive Bayes ve Destek Vektör Makineleri kullanılmıştır. Burada Göğüs ve Merkezi Sinir Sistemi kanseri olmak üzere iki farklı kanser veri seti üzerinde çalışmalar yapılmıştır. Sonuçlardan elde edilen bilgilere göre, genel olarak yapay zekâ tabanlı ANFIS tekniğinin, istatistiksel yöntemlere göre daha başarılı olduğu tespit edilmiştir

    Formal Concept Analysis Applications in Bioinformatics

    Get PDF
    Bioinformatics is an important field that seeks to solve biological problems with the help of computation. One specific field in bioinformatics is that of genomics, the study of genes and their functions. Genomics can provide valuable analysis as to the interaction between how genes interact with their environment. One such way to measure the interaction is through gene expression data, which determines whether (and how much) a certain gene activates in a situation. Analyzing this data can be critical for predicting diseases or other biological reactions. One method used for analysis is Formal Concept Analysis (FCA), a computing technique based in partial orders that allows the user to examine the structural properties of binary data based on which subsets of the data set depend on each other. This thesis surveys, in breadth and depth, the current literature related to the use of FCA for bioinformatics, with particular focus on gene expression data. This includes descriptions of current data management techniques specific to FCA, such as lattice reduction, discretization, and variations of FCA to account for different data types. Advantages and shortcomings of using FCA for genomic investigations, as well as the feasibility of using FCA for this application are addressed. Finally, several areas for future doctoral research are proposed. Adviser: Jitender S. Deogu
    corecore