16 research outputs found
Compositional Mining of Multi-Relational Biological Datasets
High-throughput biological screens are yielding ever-growing streams of
information about multiple aspects of cellular activity. As more and more
categories of datasets come online, there is a corresponding multitude of ways
in which inferences can be chained across them, motivating the need for
compositional data mining algorithms. In this paper, we argue that such
compositional data mining can be effectively realized by functionally cascading
redescription mining and biclustering algorithms as primitives. Both these
primitives mirror shifts of vocabulary that can be composed in arbitrary ways
to create rich chains of inferences. Given a relational database and its
schema, we show how the schema can be automatically compiled into a
compositional data mining program, and how different domains in the schema can
be related through logical sequences of biclustering and redescription
invocations. This feature allows us to rapidly prototype new data mining
applications, yielding greater understanding of scientific datasets. We
describe two applications of compositional data mining: (i) matching terms
across categories of the Gene Ontology and (ii) understanding the molecular
mechanisms underlying stress response in human cells
Bit-Table Based Biclustering and Frequent Closed Itemset Mining in High-Dimensional Binary Data
During the last decade various algorithms have been developed and proposed for discovering overlapping clusters in high-dimensional data. The two most prominent application fields in this research, proposed independently, are frequent itemset mining (developed for market basket data) and biclustering (applied to gene expression data analysis). The common limitation of both methodologies is the limited applicability for very large binary data sets. In this paper we propose a novel and efficient method to find both frequent closed itemsets and biclusters in high-dimensional binary data. The method is based on simple but very powerful matrix and vector multiplication approaches that ensure that all patterns can be discovered in a fast manner. The proposed algorithm has been implemented in the commonly used MATLAB environment and freely available for researchers
Plsi: A Computational Software Pipeline For Pathway Level Disease Subtype Identification
It is accepted that many complex diseases, like cancer, consist in collections of distinct genetic diseases. Clinical advances in treatments are attributed to molecular treatments aimed at specific genes resulting in greater ecacy and fewer debilitating side effects. This proves that it is important to identify and appropriately treat each individual disease subtype. Our current understanding of subtypes is limited: despite targeted treatment advances, targeted therapies often fail for some patients. The main limitation of current methods for subtype identification is that they focus on gene expression, and they are subject to its intrinsic noise. Signaling pathways describe biological processes that are carried out by networks of genes interacting with each other. We developed PLSI, a software that allows to identify the specific pathways impacted in individual patients, subgroups of patients, or a given subtype of disease. The expected impact includes a better understanding of disease and resistance to treatment
Biclustering sobre datos de expresión génica basado en búsqueda dispersa
Falta palabras claveLos datos de expresión génica, y su particular naturaleza e importancia,
motivan no sólo el desarrollo de nuevas técnicas sino la formulación de
nuevos problemas como el problema del biclustering. El biclustering es una
técnica de aprendizaje no supervisado que agrupa tanto genes como
condiciones. Este doble agrupamiento lo diferencia del clustering
tradicional sobre este tipo de datos ya que éste sólo agrupa o bien genes o
condiciones.
La presente tesis presenta un nuevo algoritmo de biclustering que permite
el estudio de distintos criterios de búsqueda. Dicho algoritmo utiliza
esquema de búsqueda dispersa, o scatter search, que independiza el
mecanismo de búsqueda del criterio empleado.
Se han estudiado tres criterios de búsqueda diferentes que motivan las tres
principales aportaciones de la tesis. En primer lugar se estudia la
correlación lineal entre los genes, que se integra como parte de la función
objetivo empleada por el algoritmo de biclustering. La correlación lineal
permite encontrar biclusters con patrones de desplazamiento y escalado, lo
que mejora propuestas anteriores. En segundo lugar, y motivado por el
significado biológico de los patrones de activación-inhibición entre genes,
se modifica la correlación lineal de manera que se contemplen estos
patrones. Por último, se ha tenido en cuenta la información disponible
sobre genes en repositorios públicos, como la ontología de genes GO, y se
incorpora dicha información como parte del criterio de búsqueda. Se añade
un término extra que refleja, por cada bicluster que se evalúe, la calidad de
ese grupo de genes según su información almacenada en GO. Se estudian
dos posibilidades para dicho término de integración de información
biológica, se comparan entre sí y se comprueba que los resultados son
mejores cuando se usa información biológica en el algoritmo de
biclustering.
Las tres aportaciones descritas, junto con una serie de pasos intermedios,
han dado lugar a resultados publicados tanto en revistas como en
conferencias nacionales e internacionales
Comparative analysis of gene duplications and their impact on expression levels in nematode genomes
Gene duplication is a major mechanism that plays a vital role in different evolutionary innovations, ranging from generating novel traits to phenotypic plasticity. Evolutionary impact of gene duplication and the fate of duplicated genes has been studied in detail. However, little is known about the impact of gene duplication on gene expression with respect to different evolutionary time scales. Here, we study genome-wide patterns of gene duplications in nematodes and assess their effect on expression levels. This study encompasses various macroevolutionary comparisons at different time scales and microevolutionary comparisons within the species Pristionchus pacificus.
At the macroevolutionary level, by comparing species separated more than 280 million years ago, we found various lineage-specific expansions in multiple gene families along the Pristionchus lineage. Moreover, we found that duplicated genes are highly enriched among developmentally regulated genes. Interestingly, the results also show evidence for selection on duplication to increases the gene expression levels in a developmental stage-specific manner.
To gain insights into the microevolution of gene expression levels after gene duplication, we compared different strains of P.pacificus and found that an additional gene copy does usually not increase gene expression levels in the different strains. Furthermore, we found a strong depletion of duplicated genes in large parts of the P. pacificus genome indicating towards negative selection against gene duplication. This shows that the impact on gene expression levels following gene duplication differs dramatically, where a selection for increased gene dosage dominates macroevolution and negative selection on gene duplication dominates within species level.
This led us to wonder what happens at the intermediate time scale. We compared recent duplicates of P. pacificus with their single-copy orthologs in two closely related species and found a pattern similar to the microevolutionary trend. Additionally, comparison of closely related species of the Strongyloides genus and its developmental transcriptome also shows overall strong depletion of duplicated genes, similar to the observation at the microevolutionary level. At the same time, a strong enrichment of duplicated genes was found at a developmental stage associated with the parasitic activity of the nematodes. Similar to the macroevolutionary picture of P. pacificus, we also found selection for higher gene dosage in parasitism-associated gene families of S. papillosus, indicating the adaptive potential of duplicated genes. Even though these studies show widespread selection against both duplication and changes in gene expression, duplications are favoured in some conditions leading to adaptive changes in the organism. Overall this indicates that the regulation of expression levels of duplicated genes was subjected to different selection processes at different time scales, which represent a complex interplay between different evolutionary processes like natural selection, population dynamics, and genetic drift
Similitud funcional de genes basada en conocimiento biológico
Programa de Doctorado en Tecnología e Ingeniería del SoftwareOver the last few year, our knowledge about biological processes in living organisms has greatly expanded both in quantity and resolution, mostly thanks to the introduction of high-throughput sequencing technology. Making sense of these vast amount of biological data through methods such as automated learning is therefore critical to gain further insights into the molecular mechanisms behind fundamental biological processes.
This work aims at establishing the quality of new genetic model based on actual biological data. First, a tool for analyzing the coherence of a group of genes according to their common role in metabolic processes is developed. This tool allows the evaluation and validation of different gene sets obtained through any clustering technique.
Additionally, a novel measure of functional similarity of a group of genes has been introduced. This measure, called GFD, is based on the Gene Ontology, and it assigns a numerical value to a gene set for each of the three GO ontologies. Concretely, GFD computes the similarity based only on the most common and specific functionality of the genes. GFD compre favorably against the most relevant measures. Our approach is especially relevant in the study of genes that are involved in several functions.Universidad Pablo de Olavide de Sevilla. Departamento de Deporte e InformáticaPostprin
Gene regulatory network modelling with evolutionary algorithms -an integrative approach
Building models for gene regulation has been an important aim of Systems Biology over the past years, driven by the large amount of gene expression data that has become available. Models represent regulatory interactions between genes and transcription factors and can provide better understanding of biological processes, and means of simulating both natural and perturbed systems (e.g. those associated with disease). Gene regulatory network
(GRN) quantitative modelling is still limited, however, due to data issues such as noise and restricted length of time series, typically used for GRN reverse engineering. These issues create an under-determination problem, with many models possibly fitting the data. However,
large amounts of other types of biological data and knowledge are available, such as cross-platform measurements, knockout experiments, annotations, binding site affinities for transcription factors and so on. It has been postulated that integration of these can improve
model quality obtained, by facilitating further filtering of possible models. However, integration is not straightforward, as the different types of data can provide contradictory information, and are intrinsically noisy, hence large scale integration has not been fully
explored, to date. Here, we present an integrative parallel framework for GRN modelling, which employs
evolutionary computation and different types of data to enhance model inference. Integration is performed at different levels. (i) An analysis of cross-platform integration of time series microarray data, discussing the effects on the resulting models and exploring crossplatform
normalisation techniques, is presented. This shows that time-course data integration is possible, and results in models more robust to noise and parameter perturbation, as
well as reduced noise over-fitting. (ii) Other types of measurements and knowledge, such as knock-out experiments, annotated transcription factors, binding site affinities and promoter sequences are integrated within the evolutionary framework to obtain more plausible GRN models. This is performed by customising initialisation, mutation and evaluation of candidate model solutions. The different data types are investigated and both qualitative and
quantitative improvements are obtained. Results suggest that caution is needed in order to obtain improved models from combined data, and the case study presented here provides
an example of how this can be achieved. Furthermore, (iii), RNA-seq data is studied in comparison to microarray experiments, to identify overlapping features and possibilities of integration within the framework. The extension of the framework to this data type is
straightforward and qualitative improvements are obtained when combining predicted interactions
from single-channel and RNA-seq datasets
Recommended from our members
Topics in Signal Processing: applications in genomics and genetics
The information in genomic or genetic data is influenced by various complex processes and appropriate mathematical modeling is required for studying the underlying processes and the data. This dissertation focuses on the formulation of mathematical models for certain problems in genomics and genetics studies and the development of algorithms for proposing efficient solutions. A Bayesian approach for the transcription factor (TF) motif discovery is examined and the extensions are proposed to deal with many interdependent parameters of the TF-DNA binding. The problem is described by statistical terms and a sequential Monte Carlo sampling method is employed for the estimation of unknown parameters. In particular, a class-based resampling approach is applied for the accurate estimation of a set of intrinsic properties of the DNA binding sites. Through statistical analysis of the gene expressions, a motif-based computational approach is developed for the inference of novel regulatory networks in a given bacterial genome. To deal with high false-discovery rates in the genome-wide TF binding predictions, the discriminative learning approaches are examined in the context of sequence classification, and a novel mathematical model is introduced to the family of kernel-based Support Vector Machines classifiers. Furthermore, the problem of haplotype phasing is examined based on the genetic data obtained from cost-effective genotyping technologies. Based on the identification and augmentation of a small and relatively more informative genotype set, a sparse dictionary selection algorithm is developed to infer the haplotype pairs for the sampled population. In a relevant context, to detect redundant information in the single nucleotide polymorphism (SNP) sites, the problem of representative (tag) SNP selection is introduced. An information theoretic heuristic is designed for the accurate selection of tag SNPs that capture the genetic diversity in a large sample set from multiple populations. The method is based on a multi-locus mutual information measure, reflecting a biological principle in the population genetics that is linkage disequilibrium