338 research outputs found

    Infinite feature selection: a graph-based feature filtering approach

    Get PDF
    We propose a filtering feature selection framework that considers a subset of features as a path in a graph, where a node is a feature and an edge indicates pairwise (customizable) relations among features, dealing with relevance and redundancy principles. By two different interpretations (exploiting properties of power series of matrices and relying on Markov chains fundamentals) we can evaluate the values of paths (i.e., feature subsets) of arbitrary lengths, eventually go to infinite, from which we dub our framework Infinite Feature Selection (Inf-FS). Going to infinite allows to constrain the computational complexity of the selection process, and to rank the features in an elegant way, that is, considering the value of any path (subset) containing a particular feature. We also propose a simple unsupervised strategy to cut the ranking, so providing the subset of features to keep. In the experiments, we analyze diverse setups with heterogeneous features, for a total of 11 benchmarks, comparing against 18 widely-known yet effective comparative approaches. The results show that Inf-FS behaves better in almost any situation, that is, when the number of features to keep are fixed a priori, or when the decision of the subset cardinality is part of the process

    On-Chip Living-Cell Microarrays for Network Biology

    Get PDF

    Machine Learning Models for Deciphering Regulatory Mechanisms and Morphological Variations in Cancer

    Get PDF
    The exponential growth of multi-omics biological datasets is resulting in an emerging paradigm shift in fundamental biological research. In recent years, imaging and transcriptomics datasets are increasingly incorporated into biological studies, pushing biology further into the domain of data-intensive-sciences. New approaches and tools from statistics, computer science, and data engineering are profoundly influencing biological research. Harnessing this ever-growing deluge of multi-omics biological data requires the development of novel and creative computational approaches. In parallel, fundamental research in data sciences and Artificial Intelligence (AI) has advanced tremendously, allowing the scientific community to generate a massive amount of knowledge from data. Advances in Deep Learning (DL), in particular, are transforming many branches of engineering, science, and technology. Several of these methodologies have already been adapted for harnessing biological datasets; however, there is still a need to further adapt and tailor these techniques to new and emerging technologies. In this dissertation, we present computational algorithms and tools that we have developed to study gene-regulation and cellular morphology in cancer. The models and platforms that we have developed are general and widely applicable to several problems relating to dysregulation of gene expression in diseases. Our pipelines and software packages are disseminated in public repositories for larger scientific community use. This dissertation is organized in three main projects. In the first project, we present Causal Inference Engine (CIE), an integrated platform for the identification and interpretation of active regulators of transcriptional response. The platform offers visualization tools and pathway enrichment analysis to map predicted regulators to Reactome pathways. We provide a parallelized R-package for fast and flexible directional enrichment analysis to run the inference on custom regulatory networks. Next, we designed and developed MODEX, a fully automated text-mining system to extract and annotate causal regulatory interaction between Transcription Factors (TFs) and genes from the biomedical literature. MODEX uses putative TF-gene interactions derived from high-throughput ChIP-Seq or other experiments and seeks to collect evidence and meta-data in the biomedical literature to validate and annotate the interactions. MODEX is a complementary platform to CIE that provides auxiliary information on CIE inferred interactions by mining the literature. In the second project, we present a Convolutional Neural Network (CNN) classifier to perform a pan-cancer analysis of tumor morphology, and predict mutations in key genes. The main challenges were to determine morphological features underlying a genetic status and assess whether these features were common in other cancer types. We trained an Inception-v3 based model to predict TP53 mutation in five cancer types with the highest rate of TP53 mutations. We also performed a cross-classification analysis to assess shared morphological features across multiple cancer types. Further, we applied a similar methodology to classify HER2 status in breast cancer and predict response to treatment in HER2 positive samples. For this study, our training slides were manually annotated by expert pathologists to highlight Regions of Interest (ROIs) associated with HER2+/- tumor microenvironment. Our results indicated that there are strong morphological features associated with each tumor type. Moreover, our predictions highly agree with manual annotations in the test set, indicating the feasibility of our approach in devising an image-based diagnostic tool for HER2 status and treatment response prediction. We have validated our model using samples from an independent cohort, which demonstrates the generalizability of our approach. Finally, in the third project, we present an approach to use spatial transcriptomics data to predict spatially-resolved active gene regulatory mechanisms in tissues. Using spatial transcriptomics, we identified tissue regions with differentially expressed genes and applied our CIE methodology to predict active TFs that can potentially regulate the marker genes in the region. This project bridged the gap between inference of active regulators using molecular data and morphological studies using images. The results demonstrate a significant local pattern in TF activity across the tissue, indicating differential spatial-regulation in tissues. The results suggest that the integrative analysis of spatial transcriptomics data with CIE can capture discriminant features and identify localized TF-target links in the tissue

    Microarray image processing : a novel neural network framework

    Get PDF
    Due to the vast success of bioengineering techniques, a series of large-scale analysis tools has been developed to discover the functional organization of cells. Among them, cDNA microarray has emerged as a powerful technology that enables biologists to cDNA microarray technology has enabled biologists to study thousands of genes simultaneously within an entire organism, and thus obtain a better understanding of the gene interaction and regulation mechanisms involved. Although microarray technology has been developed so as to offer high tolerances, there exists high signal irregularity through the surface of the microarray image. The imperfection in the microarray image generation process causes noises of many types, which contaminate the resulting image. These errors and noises will propagate down through, and can significantly affect, all subsequent processing and analysis. Therefore, to realize the potential of such technology it is crucial to obtain high quality image data that would indeed reflect the underlying biology in the samples. One of the key steps in extracting information from a microarray image is segmentation: identifying which pixels within an image represent which gene. This area of spotted microarray image analysis has received relatively little attention relative to the advances in proceeding analysis stages. But, the lack of advanced image analysis, including the segmentation, results in sub-optimal data being used in all downstream analysis methods. Although there is recently much research on microarray image analysis with many methods have been proposed, some methods produce better results than others. In general, the most effective approaches require considerable run time (processing) power to process an entire image. Furthermore, there has been little progress on developing sufficiently fast yet efficient and effective algorithms the segmentation of the microarray image by using a highly sophisticated framework such as Cellular Neural Networks (CNNs). It is, therefore, the aim of this thesis to investigate and develop novel methods processing microarray images. The goal is to produce results that outperform the currently available approaches in terms of PSNR, k-means and ICC measurements.EThOS - Electronic Theses Online ServiceAleppo University, SyriaGBUnited Kingdo

    Study of digital signal processing tools to infer gene regulatory networks from microarrays

    Get PDF
    [ANGLÈS] Since the mid-1990's, the field of genomic signal processing has exploded due to the development of DNA microarray technology, which made possible the measurement of mRNA expression of thousands of genes in parallel. Researchers had developed a vast body of knowledge in classification methods. The scientific community has developed a broad knowledge of the individual parts involved in the operation of a cell, but we still do not understand how these individual parts interact. For this reason a new type of analysis of the microarray data called Pathways analysis has been developed. This approach considers that genes work together in cascades and do not act for themselves in a biological system. The activity of the genes in a cell is controlled by the gene regulatory networks, which consist of the union and interconnection of the various pathways. This thesis is placed in the field of computer systems and signal processing applied to biology and aims to study and develop methods to infer the relationship of genes in a large-scale gene network topology where regulation is not known, and must be inferred from experimental data. First, we present a review and a comparison of the different methods in the state of the art that have tried to solve this challenge with different approaches: Gene networks based in co-expression, information-theoretic approach, bayesian networks, and finally the one based on differential equations. Secondly, we present an exhaustive study of two selected techniques, the Z-score and Zavlanos algorithms, in order to analyze their strengths and drawbacks. The chosen methods have been tested on two public datasets: the SOS pathway and a synthetic dataset simulated by computer. The proposed approach obtains good identification results, confirming the goodness of the approach. And finally, we present an analysis of the ability of the inferred network to predict the behavior of the system to an external perturbation. Also a new approach to boost the identification performance is presented. It is based on an ensemble decision paradigm. It is a preliminary idea but even though, we have found some promising results that demonstrate the potential of the approach.[CASTELLÀ] Desde mediados de los noventa, el campo de la genómica fue revolucionado debido al desarrollo de la tecnología de los DNA microarrays, el cual hizo posible la medición de la expresión de mRNA de miles de genes en paralelo. Los investigadores han desarrollado un vasto conocimiento en los métodos de clasificación. Y aunque la comunidad científica tiene un amplio conocimiento de las distintas partes implicadas en el funcionamiento de una célula, todavía no han logrado entender cómo estas partes individuales interactúan. Por esta razón, un nuevo tipo de análisis de los datos de microarrays llamado análisis de rutas metabólicas se está desarrollando. Este enfoque considera que los genes trabajan conjuntamente y que no actúan por sí mismos en un sistema biológico. La actividad de los genes en una célula está controlada por las redes reguladoras de genes, que consisten en la unión y la interconexión de las diversas rutas metabólicas. Esta tesis se sitúa en el campo del procesamiento de señal aplicada a la biología y tiene como objetivo estudiar y desarrollar métodos para inferir la relación de los genes en una topología de genes a gran escala donde la regulación es desconocida, y debe ser inferida a partir de datos experimentales. En primer lugar, se presenta una revisión y una comparación de los diferentes métodos en el estado del arte, que han tratado de resolver este problema con diferentes enfoques: las redes de genes basadas en la co-expresión, la teoría de la información, las redes bayesianas, y finalmente uno basado en ecuaciones diferenciales. En segundo lugar, se presenta un estudio exhaustivo de las dos técnicas seleccionadas, los algoritmos Z-score y de Zavlanos, con el fin de analizar sus puntos fuertes y débiles. Los métodos elegidos han sido probados en dos conjuntos de datos públicos: el SOS pathway y un conjunto de datos sintéticos simulados por ordenador. El método propuesto permite obtener buenos resultados de identificación, lo que confirma la bondad del enfoque escogido. Y, por último, se presenta un análisis de la capacidad para predecir el comportamiento del sistema ante una perturbación externa de la red inferida. Además, se aplica un nuevo enfoque para mejorar la identificación. Está basado en un paradigma de decisión conjunta. Es una idea preliminar, pero a pesar de ello, se han encontrado algunos resultados prometedores que demuestran el potencial de este enfoque.[CATALÀ] Des de mitjans dels anys noranta, el camp de la genòmica va ser revolucionat gràcies al desenvolupament de la tecnologia dels DNA microarrays, la qual va fer possible el mesurament de l'expressió de mRNA de milers de gens en paral·lel. Els investigadors han desenvolupat un vast coneixement en els mètodes de classificació i encara que la comunitat científica té un ampli coneixement de les diferents parts implicades en el funcionament d'una cèl·lula, encara no han aconseguit entendre com aquestes parts individuals interactuen. Per això, un nou tipus d'anàlisi de les dades de microarrays anomenat anàlisi de rutes metabòliques s'està desenvolupant. Aquesta tècnica considera que els gens treballen conjuntament i que no actuen per si mateixos a un sistema biològic. L'activitat dels gens en una cèl·lula està controlada per les xarxes reguladores de gens, que consisteixen en la unió i la interconnexió de les diverses rutes metabòliques. Aquesta tesi se situa en el camp de la processament del senyal aplicat a la biologia i té com a objectiu estudiar i desenvolupar mètodes per inferir la relació dels gens en una topologia de gens a gran escala on la regulació és desconeguda, i ha de ser inferida a partir de dades experimentals. En primer lloc, es presenta una revisió i una comparació dels diferents mètodes presents a l'estat de l'art, que han tractat de resoldre aquest problema amb diferents enfocaments: les xarxes de gens basats en la coexpressió, la teoria de la informació, les xarxes bayesianes, i finalment un basat en equacions diferencials. En segon lloc, es presenta un estudi exhaustiu de les dues tècniques seleccionades, els algoritmes Z-score i de Zavlanos, amb la finalitat d'analitzar els seus punts forts i febles. Els mètodes escollits han estat testats amb dos conjunts de dades públiques: el SOS Pathway i un conjunt de dades sintètiques simulades per ordinador. El mètode proposat permet obtindre bons resultats d'identificació, el que confirma la bondat de la tècnica escollida. I, finalment, es presenta una anàlisi de la capacitat de predir el comportament del sistema davant d'una pertorbació externa de la xarxa inferida. A més, es presenta una nova tècnica per millorar la identificació. Es basa en un paradigma de decisió conjunta. És una idea preliminar, però tot i així, s'han trobat alguns resultats prometedors que demostren el potencial de la idea

    Methods towards precision bioinformatics in single cell era

    Get PDF
    Single-cell technology offers unprecedented insight into the molecular landscape of individual cell and is transforming precision medicine. Key to the effective use of single-cell data for disease understanding is the analysis of such information through bioinformatics methods. In this thesis, we examine and address several challenges in single-cell bioinformatics methods for precision medicine. While most of current single-cell analytical tools employ statistical and machine learning methods, deep learning technology has gained tremendous success in computer science. Combined with ensemble learning, this further improve model performance. Through a review article (Cao et al., 2020), we share recent key developments in this area and their contribution to bioinformatics research. Bioinformatics tools often use simulation data to assess proposed methodologies, but evaluation of the quality of single-cell RNA-sequencing (scRNA-seq) data simulation tools is lacking. We develop a comprehensive framework, SimBench (Cao et al., 2021), that examines a range of aspects from data properties to the ability to maintain biological signals, scalability, and applicability. While individual patient understanding is the key to precision medicine, there is little consensus on the best ways to compress complex single-cell data into summary statistics that represent each individual. We present scFeatures (Cao et al., 2022b), an approach that creates interpretable molecular representations for individuals. Finally, in a case study using multiple COVID-19 scRNA-seq data, we utilise scFeatures to generate molecular characterisations of individuals and illustrate the impact of ensemble learning and deep learning on improving disease outcome prediction. Overall, this thesis addresses several gaps in precision bioinformatics in the single-cell field by highlighting research advances, developing methodologies, and illustrating practical uses through experimental datasets and case studies
    corecore