Search CORE

16 research outputs found

Compositional Mining of Multi-Relational Biological Datasets

Author: Jin Ying
Murali T.M.
Ramakrishnan Naren
Publication venue
Publication date: 01/01/2007
Field of study

High-throughput biological screens are yielding ever-growing streams of information about multiple aspects of cellular activity. As more and more categories of datasets come online, there is a corresponding multitude of ways in which inferences can be chained across them, motivating the need for compositional data mining algorithms. In this paper, we argue that such compositional data mining can be effectively realized by functionally cascading redescription mining and biclustering algorithms as primitives. Both these primitives mirror shifts of vocabulary that can be composed in arbitrary ways to create rich chains of inferences. Given a relational database and its schema, we show how the schema can be automatically compiled into a compositional data mining program, and how different domains in the schema can be related through logical sequences of biclustering and redescription invocations. This feature allows us to rapidly prototype new data mining applications, yielding greater understanding of scientific datasets. We describe two applications of compositional data mining: (i) matching terms across categories of the Gene Ontology and (ii) understanding the molecular mechanisms underlying stress response in human cells

Computer Science Technical Reports @Virginia Tech

CiteSeerX

Bit-Table Based Biclustering and Frequent Closed Itemset Mining in High-Dimensional Binary Data

Author: András Király
Attila Gyenesei
János Abonyi
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2014
Field of study

During the last decade various algorithms have been developed and proposed for discovering overlapping clusters in high-dimensional data. The two most prominent application fields in this research, proposed independently, are frequent itemset mining (developed for market basket data) and biclustering (applied to gene expression data analysis). The common limitation of both methodologies is the limited applicability for very large binary data sets. In this paper we propose a novel and efficient method to find both frequent closed itemsets and biclusters in high-dimensional binary data. The method is based on simple but very powerful matrix and vector multiplication approaches that ensure that all patterns can be discovered in a fast manner. The proposed algorithm has been implemented in the commonly used MATLAB environment and freely available for researchers

Crossref

Directory of Open Access Journals

PubMed Central

Plsi: A Computational Software Pipeline For Pathway Level Disease Subtype Identification

Author: Donato Michele
Publication venue: DigitalCommons@WayneState
Publication date: 01/01/2015
Field of study

It is accepted that many complex diseases, like cancer, consist in collections of distinct genetic diseases. Clinical advances in treatments are attributed to molecular treatments aimed at specific genes resulting in greater ecacy and fewer debilitating side effects. This proves that it is important to identify and appropriately treat each individual disease subtype. Our current understanding of subtypes is limited: despite targeted treatment advances, targeted therapies often fail for some patients. The main limitation of current methods for subtype identification is that they focus on gene expression, and they are subject to its intrinsic noise. Signaling pathways describe biological processes that are carried out by networks of genes interacting with each other. We developed PLSI, a software that allows to identify the specific pathways impacted in individual patients, subgroups of patients, or a given subtype of disease. The expected impact includes a better understanding of disease and resistance to treatment

Digital Commons@Wayne State University

Compositional mining of multirelational biological datasets

Author: Agrawal R.
Ball C.
Bayardo R.
Benjamini Y.
Matzke M.
Michalski R.
Murali T.
Naren Ramakrishnan
Parida L.
T. M. Murali
Ying Jin
Zaki M.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

Biclustering sobre datos de expresión génica basado en búsqueda dispersa

Author: Nepomuceno Chamorro Juan Antonio
Publication venue
Publication date: 21/07/2015
Field of study

Falta palabras claveLos datos de expresión génica, y su particular naturaleza e importancia, motivan no sólo el desarrollo de nuevas técnicas sino la formulación de nuevos problemas como el problema del biclustering. El biclustering es una técnica de aprendizaje no supervisado que agrupa tanto genes como condiciones. Este doble agrupamiento lo diferencia del clustering tradicional sobre este tipo de datos ya que éste sólo agrupa o bien genes o condiciones. La presente tesis presenta un nuevo algoritmo de biclustering que permite el estudio de distintos criterios de búsqueda. Dicho algoritmo utiliza esquema de búsqueda dispersa, o scatter search, que independiza el mecanismo de búsqueda del criterio empleado. Se han estudiado tres criterios de búsqueda diferentes que motivan las tres principales aportaciones de la tesis. En primer lugar se estudia la correlación lineal entre los genes, que se integra como parte de la función objetivo empleada por el algoritmo de biclustering. La correlación lineal permite encontrar biclusters con patrones de desplazamiento y escalado, lo que mejora propuestas anteriores. En segundo lugar, y motivado por el significado biológico de los patrones de activación-inhibición entre genes, se modifica la correlación lineal de manera que se contemplen estos patrones. Por último, se ha tenido en cuenta la información disponible sobre genes en repositorios públicos, como la ontología de genes GO, y se incorpora dicha información como parte del criterio de búsqueda. Se añade un término extra que refleja, por cada bicluster que se evalúe, la calidad de ese grupo de genes según su información almacenada en GO. Se estudian dos posibilidades para dicho término de integración de información biológica, se comparan entre sí y se comprueba que los resultados son mejores cuando se usa información biológica en el algoritmo de biclustering. Las tres aportaciones descritas, junto con una serie de pasos intermedios, han dado lugar a resultados publicados tanto en revistas como en conferencias nacionales e internacionales

idUS. Depósito de Investigación Universidad de Sevilla

Comparative analysis of gene duplications and their impact on expression levels in nematode genomes

Author: Baskaran Praveen
Publication venue: Universität Tübingen
Publication date: 01/01/2017
Field of study

Gene duplication is a major mechanism that plays a vital role in different evolutionary innovations, ranging from generating novel traits to phenotypic plasticity. Evolutionary impact of gene duplication and the fate of duplicated genes has been studied in detail. However, little is known about the impact of gene duplication on gene expression with respect to different evolutionary time scales. Here, we study genome-wide patterns of gene duplications in nematodes and assess their effect on expression levels. This study encompasses various macroevolutionary comparisons at different time scales and microevolutionary comparisons within the species Pristionchus pacificus. At the macroevolutionary level, by comparing species separated more than 280 million years ago, we found various lineage-specific expansions in multiple gene families along the Pristionchus lineage. Moreover, we found that duplicated genes are highly enriched among developmentally regulated genes. Interestingly, the results also show evidence for selection on duplication to increases the gene expression levels in a developmental stage-specific manner. To gain insights into the microevolution of gene expression levels after gene duplication, we compared different strains of P.pacificus and found that an additional gene copy does usually not increase gene expression levels in the different strains. Furthermore, we found a strong depletion of duplicated genes in large parts of the P. pacificus genome indicating towards negative selection against gene duplication. This shows that the impact on gene expression levels following gene duplication differs dramatically, where a selection for increased gene dosage dominates macroevolution and negative selection on gene duplication dominates within species level. This led us to wonder what happens at the intermediate time scale. We compared recent duplicates of P. pacificus with their single-copy orthologs in two closely related species and found a pattern similar to the microevolutionary trend. Additionally, comparison of closely related species of the Strongyloides genus and its developmental transcriptome also shows overall strong depletion of duplicated genes, similar to the observation at the microevolutionary level. At the same time, a strong enrichment of duplicated genes was found at a developmental stage associated with the parasitic activity of the nematodes. Similar to the macroevolutionary picture of P. pacificus, we also found selection for higher gene dosage in parasitism-associated gene families of S. papillosus, indicating the adaptive potential of duplicated genes. Even though these studies show widespread selection against both duplication and changes in gene expression, duplications are favoured in some conditions leading to adaptive changes in the organism. Overall this indicates that the regulation of expression levels of duplicated genes was subjected to different selection processes at different time scales, which represent a complex interplay between different evolutionary processes like natural selection, population dynamics, and genetic drift

Publikationsserver der Universität Tübingen

Similitud funcional de genes basada en conocimiento biológico

Author: Diaz-Diaz Norberto
Publication venue
Publication date: 01/01/2012
Field of study

Programa de Doctorado en Tecnología e Ingeniería del SoftwareOver the last few year, our knowledge about biological processes in living organisms has greatly expanded both in quantity and resolution, mostly thanks to the introduction of high-throughput sequencing technology. Making sense of these vast amount of biological data through methods such as automated learning is therefore critical to gain further insights into the molecular mechanisms behind fundamental biological processes. This work aims at establishing the quality of new genetic model based on actual biological data. First, a tool for analyzing the coherence of a group of genes according to their common role in metabolic processes is developed. This tool allows the evaluation and validation of different gene sets obtained through any clustering technique. Additionally, a novel measure of functional similarity of a group of genes has been introduced. This measure, called GFD, is based on the Gene Ontology, and it assigns a numerical value to a gene set for each of the three GO ontologies. Concretely, GFD computes the similarity based only on the most common and specific functionality of the genes. GFD compre favorably against the most relevant measures. Our approach is especially relevant in the study of genes that are involved in several functions.Universidad Pablo de Olavide de Sevilla. Departamento de Deporte e InformáticaPostprin

Repositorio Institucional Olavide

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Gene regulatory network modelling with evolutionary algorithms -an integrative approach

Author: Sîrbu Alina
Publication venue: Dublin City University. School of Computing
Publication date: 01/01/2011
Field of study

Building models for gene regulation has been an important aim of Systems Biology over the past years, driven by the large amount of gene expression data that has become available. Models represent regulatory interactions between genes and transcription factors and can provide better understanding of biological processes, and means of simulating both natural and perturbed systems (e.g. those associated with disease). Gene regulatory network (GRN) quantitative modelling is still limited, however, due to data issues such as noise and restricted length of time series, typically used for GRN reverse engineering. These issues create an under-determination problem, with many models possibly fitting the data. However, large amounts of other types of biological data and knowledge are available, such as cross-platform measurements, knockout experiments, annotations, binding site affinities for transcription factors and so on. It has been postulated that integration of these can improve model quality obtained, by facilitating further filtering of possible models. However, integration is not straightforward, as the different types of data can provide contradictory information, and are intrinsically noisy, hence large scale integration has not been fully explored, to date. Here, we present an integrative parallel framework for GRN modelling, which employs evolutionary computation and different types of data to enhance model inference. Integration is performed at different levels. (i) An analysis of cross-platform integration of time series microarray data, discussing the effects on the resulting models and exploring crossplatform normalisation techniques, is presented. This shows that time-course data integration is possible, and results in models more robust to noise and parameter perturbation, as well as reduced noise over-fitting. (ii) Other types of measurements and knowledge, such as knock-out experiments, annotated transcription factors, binding site affinities and promoter sequences are integrated within the evolutionary framework to obtain more plausible GRN models. This is performed by customising initialisation, mutation and evaluation of candidate model solutions. The different data types are investigated and both qualitative and quantitative improvements are obtained. Results suggest that caution is needed in order to obtain improved models from combined data, and the case study presented here provides an example of how this can be achieved. Furthermore, (iii), RNA-seq data is studied in comparison to microarray experiments, to identify overlapping features and possibilities of integration within the framework. The extension of the framework to this data type is straightforward and qualitative improvements are obtained when combining predicted interactions from single-channel and RNA-seq datasets

Irish Universities

Archivio della Ricerca - Università di Pisa

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

DCU Online Research Access Service

Recommended from our members

Topics in Signal Processing: applications in genomics and genetics

Author: Elmas Abdulkadir
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2016
Field of study

The information in genomic or genetic data is influenced by various complex processes and appropriate mathematical modeling is required for studying the underlying processes and the data. This dissertation focuses on the formulation of mathematical models for certain problems in genomics and genetics studies and the development of algorithms for proposing efficient solutions. A Bayesian approach for the transcription factor (TF) motif discovery is examined and the extensions are proposed to deal with many interdependent parameters of the TF-DNA binding. The problem is described by statistical terms and a sequential Monte Carlo sampling method is employed for the estimation of unknown parameters. In particular, a class-based resampling approach is applied for the accurate estimation of a set of intrinsic properties of the DNA binding sites. Through statistical analysis of the gene expressions, a motif-based computational approach is developed for the inference of novel regulatory networks in a given bacterial genome. To deal with high false-discovery rates in the genome-wide TF binding predictions, the discriminative learning approaches are examined in the context of sequence classification, and a novel mathematical model is introduced to the family of kernel-based Support Vector Machines classifiers. Furthermore, the problem of haplotype phasing is examined based on the genetic data obtained from cost-effective genotyping technologies. Based on the identification and augmentation of a small and relatively more informative genotype set, a sparse dictionary selection algorithm is developed to infer the haplotype pairs for the sampled population. In a relevant context, to detect redundant information in the single nucleotide polymorphism (SNP) sites, the problem of representative (tag) SNP selection is introduced. An information theoretic heuristic is designed for the accurate selection of tag SNPs that capture the genetic diversity in a large sample set from multiple populations. The method is based on a multi-locus mutual information measure, reflecting a biological principle in the population genetics that is linkage disequilibrium

Columbia University Academic Commons