16 research outputs found

    Compositional Mining of Multi-Relational Biological Datasets

    Get PDF
    High-throughput biological screens are yielding ever-growing streams of information about multiple aspects of cellular activity. As more and more categories of datasets come online, there is a corresponding multitude of ways in which inferences can be chained across them, motivating the need for compositional data mining algorithms. In this paper, we argue that such compositional data mining can be effectively realized by functionally cascading redescription mining and biclustering algorithms as primitives. Both these primitives mirror shifts of vocabulary that can be composed in arbitrary ways to create rich chains of inferences. Given a relational database and its schema, we show how the schema can be automatically compiled into a compositional data mining program, and how different domains in the schema can be related through logical sequences of biclustering and redescription invocations. This feature allows us to rapidly prototype new data mining applications, yielding greater understanding of scientific datasets. We describe two applications of compositional data mining: (i) matching terms across categories of the Gene Ontology and (ii) understanding the molecular mechanisms underlying stress response in human cells

    Bit-Table Based Biclustering and Frequent Closed Itemset Mining in High-Dimensional Binary Data

    Get PDF
    During the last decade various algorithms have been developed and proposed for discovering overlapping clusters in high-dimensional data. The two most prominent application fields in this research, proposed independently, are frequent itemset mining (developed for market basket data) and biclustering (applied to gene expression data analysis). The common limitation of both methodologies is the limited applicability for very large binary data sets. In this paper we propose a novel and efficient method to find both frequent closed itemsets and biclusters in high-dimensional binary data. The method is based on simple but very powerful matrix and vector multiplication approaches that ensure that all patterns can be discovered in a fast manner. The proposed algorithm has been implemented in the commonly used MATLAB environment and freely available for researchers

    Plsi: A Computational Software Pipeline For Pathway Level Disease Subtype Identification

    Get PDF
    It is accepted that many complex diseases, like cancer, consist in collections of distinct genetic diseases. Clinical advances in treatments are attributed to molecular treatments aimed at specific genes resulting in greater ecacy and fewer debilitating side effects. This proves that it is important to identify and appropriately treat each individual disease subtype. Our current understanding of subtypes is limited: despite targeted treatment advances, targeted therapies often fail for some patients. The main limitation of current methods for subtype identification is that they focus on gene expression, and they are subject to its intrinsic noise. Signaling pathways describe biological processes that are carried out by networks of genes interacting with each other. We developed PLSI, a software that allows to identify the specific pathways impacted in individual patients, subgroups of patients, or a given subtype of disease. The expected impact includes a better understanding of disease and resistance to treatment

    Biclustering sobre datos de expresión génica basado en búsqueda dispersa

    Get PDF
    Falta palabras claveLos datos de expresión génica, y su particular naturaleza e importancia, motivan no sólo el desarrollo de nuevas técnicas sino la formulación de nuevos problemas como el problema del biclustering. El biclustering es una técnica de aprendizaje no supervisado que agrupa tanto genes como condiciones. Este doble agrupamiento lo diferencia del clustering tradicional sobre este tipo de datos ya que éste sólo agrupa o bien genes o condiciones. La presente tesis presenta un nuevo algoritmo de biclustering que permite el estudio de distintos criterios de búsqueda. Dicho algoritmo utiliza esquema de búsqueda dispersa, o scatter search, que independiza el mecanismo de búsqueda del criterio empleado. Se han estudiado tres criterios de búsqueda diferentes que motivan las tres principales aportaciones de la tesis. En primer lugar se estudia la correlación lineal entre los genes, que se integra como parte de la función objetivo empleada por el algoritmo de biclustering. La correlación lineal permite encontrar biclusters con patrones de desplazamiento y escalado, lo que mejora propuestas anteriores. En segundo lugar, y motivado por el significado biológico de los patrones de activación-inhibición entre genes, se modifica la correlación lineal de manera que se contemplen estos patrones. Por último, se ha tenido en cuenta la información disponible sobre genes en repositorios públicos, como la ontología de genes GO, y se incorpora dicha información como parte del criterio de búsqueda. Se añade un término extra que refleja, por cada bicluster que se evalúe, la calidad de ese grupo de genes según su información almacenada en GO. Se estudian dos posibilidades para dicho término de integración de información biológica, se comparan entre sí y se comprueba que los resultados son mejores cuando se usa información biológica en el algoritmo de biclustering. Las tres aportaciones descritas, junto con una serie de pasos intermedios, han dado lugar a resultados publicados tanto en revistas como en conferencias nacionales e internacionales

    Comparative analysis of gene duplications and their impact on expression levels in nematode genomes

    Get PDF
    Gene duplication is a major mechanism that plays a vital role in different evolutionary innovations, ranging from generating novel traits to phenotypic plasticity. Evolutionary impact of gene duplication and the fate of duplicated genes has been studied in detail. However, little is known about the impact of gene duplication on gene expression with respect to different evolutionary time scales. Here, we study genome-wide patterns of gene duplications in nematodes and assess their effect on expression levels. This study encompasses various macroevolutionary comparisons at different time scales and microevolutionary comparisons within the species Pristionchus pacificus. At the macroevolutionary level, by comparing species separated more than 280 million years ago, we found various lineage-specific expansions in multiple gene families along the Pristionchus lineage. Moreover, we found that duplicated genes are highly enriched among developmentally regulated genes. Interestingly, the results also show evidence for selection on duplication to increases the gene expression levels in a developmental stage-specific manner. To gain insights into the microevolution of gene expression levels after gene duplication, we compared different strains of P.pacificus and found that an additional gene copy does usually not increase gene expression levels in the different strains. Furthermore, we found a strong depletion of duplicated genes in large parts of the P. pacificus genome indicating towards negative selection against gene duplication. This shows that the impact on gene expression levels following gene duplication differs dramatically, where a selection for increased gene dosage dominates macroevolution and negative selection on gene duplication dominates within species level. This led us to wonder what happens at the intermediate time scale. We compared recent duplicates of P. pacificus with their single-copy orthologs in two closely related species and found a pattern similar to the microevolutionary trend. Additionally, comparison of closely related species of the Strongyloides genus and its developmental transcriptome also shows overall strong depletion of duplicated genes, similar to the observation at the microevolutionary level. At the same time, a strong enrichment of duplicated genes was found at a developmental stage associated with the parasitic activity of the nematodes. Similar to the macroevolutionary picture of P. pacificus, we also found selection for higher gene dosage in parasitism-associated gene families of S. papillosus, indicating the adaptive potential of duplicated genes. Even though these studies show widespread selection against both duplication and changes in gene expression, duplications are favoured in some conditions leading to adaptive changes in the organism. Overall this indicates that the regulation of expression levels of duplicated genes was subjected to different selection processes at different time scales, which represent a complex interplay between different evolutionary processes like natural selection, population dynamics, and genetic drift

    Similitud funcional de genes basada en conocimiento biológico

    Get PDF
    Programa de Doctorado en Tecnología e Ingeniería del SoftwareOver the last few year, our knowledge about biological processes in living organisms has greatly expanded both in quantity and resolution, mostly thanks to the introduction of high-throughput sequencing technology. Making sense of these vast amount of biological data through methods such as automated learning is therefore critical to gain further insights into the molecular mechanisms behind fundamental biological processes. This work aims at establishing the quality of new genetic model based on actual biological data. First, a tool for analyzing the coherence of a group of genes according to their common role in metabolic processes is developed. This tool allows the evaluation and validation of different gene sets obtained through any clustering technique. Additionally, a novel measure of functional similarity of a group of genes has been introduced. This measure, called GFD, is based on the Gene Ontology, and it assigns a numerical value to a gene set for each of the three GO ontologies. Concretely, GFD computes the similarity based only on the most common and specific functionality of the genes. GFD compre favorably against the most relevant measures. Our approach is especially relevant in the study of genes that are involved in several functions.Universidad Pablo de Olavide de Sevilla. Departamento de Deporte e InformáticaPostprin

    Gene regulatory network modelling with evolutionary algorithms -an integrative approach

    Get PDF
    Building models for gene regulation has been an important aim of Systems Biology over the past years, driven by the large amount of gene expression data that has become available. Models represent regulatory interactions between genes and transcription factors and can provide better understanding of biological processes, and means of simulating both natural and perturbed systems (e.g. those associated with disease). Gene regulatory network (GRN) quantitative modelling is still limited, however, due to data issues such as noise and restricted length of time series, typically used for GRN reverse engineering. These issues create an under-determination problem, with many models possibly fitting the data. However, large amounts of other types of biological data and knowledge are available, such as cross-platform measurements, knockout experiments, annotations, binding site affinities for transcription factors and so on. It has been postulated that integration of these can improve model quality obtained, by facilitating further filtering of possible models. However, integration is not straightforward, as the different types of data can provide contradictory information, and are intrinsically noisy, hence large scale integration has not been fully explored, to date. Here, we present an integrative parallel framework for GRN modelling, which employs evolutionary computation and different types of data to enhance model inference. Integration is performed at different levels. (i) An analysis of cross-platform integration of time series microarray data, discussing the effects on the resulting models and exploring crossplatform normalisation techniques, is presented. This shows that time-course data integration is possible, and results in models more robust to noise and parameter perturbation, as well as reduced noise over-fitting. (ii) Other types of measurements and knowledge, such as knock-out experiments, annotated transcription factors, binding site affinities and promoter sequences are integrated within the evolutionary framework to obtain more plausible GRN models. This is performed by customising initialisation, mutation and evaluation of candidate model solutions. The different data types are investigated and both qualitative and quantitative improvements are obtained. Results suggest that caution is needed in order to obtain improved models from combined data, and the case study presented here provides an example of how this can be achieved. Furthermore, (iii), RNA-seq data is studied in comparison to microarray experiments, to identify overlapping features and possibilities of integration within the framework. The extension of the framework to this data type is straightforward and qualitative improvements are obtained when combining predicted interactions from single-channel and RNA-seq datasets
    corecore