2,749 research outputs found

    Semantic distillation: a method for clustering objects by their contextual specificity

    Full text link
    Techniques for data-mining, latent semantic analysis, contextual search of databases, etc. have long ago been developed by computer scientists working on information retrieval (IR). Experimental scientists, from all disciplines, having to analyse large collections of raw experimental data (astronomical, physical, biological, etc.) have developed powerful methods for their statistical analysis and for clustering, categorising, and classifying objects. Finally, physicists have developed a theory of quantum measurement, unifying the logical, algebraic, and probabilistic aspects of queries into a single formalism. The purpose of this paper is twofold: first to show that when formulated at an abstract level, problems from IR, from statistical data analysis, and from physical measurement theories are very similar and hence can profitably be cross-fertilised, and, secondly, to propose a novel method of fuzzy hierarchical clustering, termed \textit{semantic distillation} -- strongly inspired from the theory of quantum measurement --, we developed to analyse raw data coming from various types of experiments on DNA arrays. We illustrate the method by analysing DNA arrays experiments and clustering the genes of the array according to their specificity.Comment: Accepted for publication in Studies in Computational Intelligence, Springer-Verla

    Integration and mining of malaria molecular, functional and pharmacological data: how far are we from a chemogenomic knowledge space?

    Get PDF
    The organization and mining of malaria genomic and post-genomic data is highly motivated by the necessity to predict and characterize new biological targets and new drugs. Biological targets are sought in a biological space designed from the genomic data from Plasmodium falciparum, but using also the millions of genomic data from other species. Drug candidates are sought in a chemical space containing the millions of small molecules stored in public and private chemolibraries. Data management should therefore be as reliable and versatile as possible. In this context, we examined five aspects of the organization and mining of malaria genomic and post-genomic data: 1) the comparison of protein sequences including compositionally atypical malaria sequences, 2) the high throughput reconstruction of molecular phylogenies, 3) the representation of biological processes particularly metabolic pathways, 4) the versatile methods to integrate genomic data, biological representations and functional profiling obtained from X-omic experiments after drug treatments and 5) the determination and prediction of protein structures and their molecular docking with drug candidate structures. Progresses toward a grid-enabled chemogenomic knowledge space are discussed.Comment: 43 pages, 4 figures, to appear in Malaria Journa

    Eliciting the Functional Taxonomy from protein annotations and taxa

    Get PDF
    The advances of omics technologies have triggered the production of an enormous volume of data coming from thousands of species. Meanwhile, joint international efforts like the Gene Ontology (GO) consortium have worked to provide functional information for a vast amount of proteins. With these data available, we have developed FunTaxIS, a tool that is the first attempt to infer functional taxonomy (i.e. how functions are distributed over taxa) combining functional and taxonomic information. FunTaxIS is able to define a taxon specific functional space by exploiting annotation frequencies in order to establish if a function can or cannot be used to annotate a certain species. The tool generates constraints between GO terms and taxa and then propagates these relations over the taxonomic tree and the GO graph. Since these constraints nearly cover the whole taxonomy, it is possible to obtain the mapping of a function over the taxonomy. FunTaxIS can be used to make functional comparative analyses among taxa, to detect improper associations between taxa and functions, and to discover how functional knowledge is either distributed or missing. A benchmark test set based on six different model species has been devised to get useful insights on the generated taxonomic rules

    Propagating semantic information in biochemical network models

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>To enable automatic searches, alignments, and model combination, the elements of systems biology models need to be compared and matched across models. Elements can be identified by machine-readable biological annotations, but assigning such annotations and matching non-annotated elements is tedious work and calls for automation.</p> <p>Results</p> <p>A new method called "semantic propagation" allows the comparison of model elements based not only on their own annotations, but also on annotations of surrounding elements in the network. One may either propagate feature vectors, describing the annotations of individual elements, or quantitative similarities between elements from different models. Based on semantic propagation, we align partially annotated models and find annotations for non-annotated model elements.</p> <p>Conclusions</p> <p>Semantic propagation and model alignment are included in the open-source library semanticSBML, available on sourceforge. Online services for model alignment and for annotation prediction can be used at <url>http://www.semanticsbml.org</url>.</p

    Functional coherence and annotation agreement metrics for enzyme families

    Get PDF
    Tese de doutoramento, Informática (Bioinformática), Universidade de Lisboa, Faculdade de Ciências, 2015A range of methodologies is used to create sequence annotations, from manual curation by specialized curators to several automatic procedures. The multitude of existing annotation methods consequently generates an annotation heterogeneity in terms of coverage and specificity across the biological sequence space. When comparing groups of similar sequences (such as protein families) this heterogeneity can introduce issues regarding the interpretation of the actual functional similarity and the overall functional coherence. A direct path to mitigate these issues is the annotation extension within the protein families under analysis. This thesis postulates that the protein families can be used as knowledgebases for their own annotation extension with the assistance of a proper functional coherence analysis. Therefore, a modular framework for functional coherence analysis and annotation extension in protein families was proposed. The framework includes a proposed module for functional coherence analysis that relies on graph visualization, term enrichment and other statistics. In this work it was implemented and made available as a publicly accessible web application, GRYFUN which can be accessed at http://xldb.di.fc.ul.pt/gryfun/. In addition, four metrics were developed to assess distinct aspects of the coherence and completeness in protein families in conjunction with additional existing metrics. Therefore the use of the complete proposed framework by curators can be regarded as a semi-automatic approach to annotation able to assist with protein annotation extension.Diversas metodologias são usadas para criar anotações em sequências, desde a curação manual por curadores especializados até vários procedimentos automáticos. A multitude de métodos de anotação existentes consequentemente gera heterogeneidade nas anotações em termos de cobertura e especificidade em espaços de sequências biológicas. Ao comparar grupos de sequências semelhantes (tais como famílias proteícas) esta heterogeneidade pode introduzir dificuldades quanto à interpretação da semelhança e coerência funcional nesses grupos. Uma maneira de mitigar essas dificuldades é a extensão da anotação dentro das famílias proteícas em análise. Esta tese postula que famílias proteícas podem ser usadas como bases de conhecimento para a sua própria extensão de anotação através do uso de análises de coerência funcional apropriadas. Portanto, uma framework modular para a análise de coerência funcional e extensão de anotação em famílias proteícas foi proposta. A framework incluí um módulo proposto para a análise de coerência funcional baseado em visualização de grafos, enriquecimento de termos e outras estatísticas. Neste trabalho o módulo foi implementado e disponibilizado como uma aplicação web, GRYFUN que pode ser acedida em http://xldb.di.fc.ul.pt/gryfun/. Adicionalmente, quatro métricas foram desenvolvidas para aferir aspectos distinctos da coerência e completude de anotação em famílias proteícas em conjunção com métricas já existentes. Portanto, o uso da framework completa por curadores, como uma estratégia de anotação semi-automática, é capaz de potenciar a extensão de anotação.Fundação para a Ciência e a Tecnologia (FCT), SFRH/BD/48035/200

    Bioinformatics protocols for analysis of functional genomics data applied to neuropathy microarray datasets

    Get PDF
    Microarray technology allows the simultaneous measurement of the abundance of thousands of transcripts in living cells. The high-throughput nature of microarray technology means that automatic analytical procedures are required to handle the sheer amount of data, typically generated in a single microarray experiment. Along these lines, this work presents a contribution to the automatic analysis of microarray data by attempting to construct protocols for the validation of publicly available methods for microarray. At the experimental level, an evaluation of amplification of RNA targets prior to hybridisation with the physical array was undertaken. This had the important consequence of revealing the extent to which the significance of intensity ratios between varying biological conditions may be compromised following amplification as well as identifying the underlying cause of this effect. On the basis of these findings, recommendations regarding the usability of RNA amplification protocols with microarray screening were drawn in the context of varying microarray experimental conditions. On the data analysis side, this work has had the important outcome of developing an automatic framework for the validation of functional analysis methods for microarray. This is based on using a GO semantic similarity scoring metric to assess the similarity between functional terms found enriched by functional analysis of a model dataset and those anticipated from prior knowledge of the biological phenomenon under study. Using such validation system, this work has shown, for the first time, that ‘Catmap’, an early functional analysis method performs better than the more recent and most popular methods of its kind. Crucially, the effectiveness of this validation system implies that such system may be reliably adopted for validation of newly developed functional analysis methods for microarray

    Using Neural Networks for Relation Extraction from Biomedical Literature

    Full text link
    Using different sources of information to support automated extracting of relations between biomedical concepts contributes to the development of our understanding of biological systems. The primary comprehensive source of these relations is biomedical literature. Several relation extraction approaches have been proposed to identify relations between concepts in biomedical literature, namely, using neural networks algorithms. The use of multichannel architectures composed of multiple data representations, as in deep neural networks, is leading to state-of-the-art results. The right combination of data representations can eventually lead us to even higher evaluation scores in relation extraction tasks. Thus, biomedical ontologies play a fundamental role by providing semantic and ancestry information about an entity. The incorporation of biomedical ontologies has already been proved to enhance previous state-of-the-art results.Comment: Artificial Neural Networks book (Springer) - Chapter 1
    corecore