2,749 research outputs found
Semantic distillation: a method for clustering objects by their contextual specificity
Techniques for data-mining, latent semantic analysis, contextual search of
databases, etc. have long ago been developed by computer scientists working on
information retrieval (IR). Experimental scientists, from all disciplines,
having to analyse large collections of raw experimental data (astronomical,
physical, biological, etc.) have developed powerful methods for their
statistical analysis and for clustering, categorising, and classifying objects.
Finally, physicists have developed a theory of quantum measurement, unifying
the logical, algebraic, and probabilistic aspects of queries into a single
formalism. The purpose of this paper is twofold: first to show that when
formulated at an abstract level, problems from IR, from statistical data
analysis, and from physical measurement theories are very similar and hence can
profitably be cross-fertilised, and, secondly, to propose a novel method of
fuzzy hierarchical clustering, termed \textit{semantic distillation} --
strongly inspired from the theory of quantum measurement --, we developed to
analyse raw data coming from various types of experiments on DNA arrays. We
illustrate the method by analysing DNA arrays experiments and clustering the
genes of the array according to their specificity.Comment: Accepted for publication in Studies in Computational Intelligence,
Springer-Verla
Integration and mining of malaria molecular, functional and pharmacological data: how far are we from a chemogenomic knowledge space?
The organization and mining of malaria genomic and post-genomic data is
highly motivated by the necessity to predict and characterize new biological
targets and new drugs. Biological targets are sought in a biological space
designed from the genomic data from Plasmodium falciparum, but using also the
millions of genomic data from other species. Drug candidates are sought in a
chemical space containing the millions of small molecules stored in public and
private chemolibraries. Data management should therefore be as reliable and
versatile as possible. In this context, we examined five aspects of the
organization and mining of malaria genomic and post-genomic data: 1) the
comparison of protein sequences including compositionally atypical malaria
sequences, 2) the high throughput reconstruction of molecular phylogenies, 3)
the representation of biological processes particularly metabolic pathways, 4)
the versatile methods to integrate genomic data, biological representations and
functional profiling obtained from X-omic experiments after drug treatments and
5) the determination and prediction of protein structures and their molecular
docking with drug candidate structures. Progresses toward a grid-enabled
chemogenomic knowledge space are discussed.Comment: 43 pages, 4 figures, to appear in Malaria Journa
Eliciting the Functional Taxonomy from protein annotations and taxa
The advances of omics technologies have triggered the production of an enormous volume of data coming from thousands of species. Meanwhile, joint international efforts like the Gene Ontology (GO) consortium have worked to provide functional information for a vast amount of proteins. With these data available, we have developed FunTaxIS, a tool that is the first attempt to infer functional taxonomy (i.e. how functions are distributed over taxa) combining functional and taxonomic information. FunTaxIS is able to define a taxon specific functional space by exploiting annotation frequencies in order to establish if a function can or cannot be used to annotate a certain species. The tool generates constraints between GO terms and taxa and then propagates these relations over the taxonomic tree and the GO graph. Since these constraints nearly cover the whole taxonomy, it is possible to obtain the mapping of a function over the taxonomy. FunTaxIS can be used to make functional comparative analyses among taxa, to detect improper associations between taxa and functions, and to discover how functional knowledge is either distributed or missing. A benchmark test set based on six different model species has been devised to get useful insights on the generated taxonomic rules
Propagating semantic information in biochemical network models
<p>Abstract</p> <p>Background</p> <p>To enable automatic searches, alignments, and model combination, the elements of systems biology models need to be compared and matched across models. Elements can be identified by machine-readable biological annotations, but assigning such annotations and matching non-annotated elements is tedious work and calls for automation.</p> <p>Results</p> <p>A new method called "semantic propagation" allows the comparison of model elements based not only on their own annotations, but also on annotations of surrounding elements in the network. One may either propagate feature vectors, describing the annotations of individual elements, or quantitative similarities between elements from different models. Based on semantic propagation, we align partially annotated models and find annotations for non-annotated model elements.</p> <p>Conclusions</p> <p>Semantic propagation and model alignment are included in the open-source library semanticSBML, available on sourceforge. Online services for model alignment and for annotation prediction can be used at <url>http://www.semanticsbml.org</url>.</p
Functional coherence and annotation agreement metrics for enzyme families
Tese de doutoramento, Informática (Bioinformática), Universidade de Lisboa, Faculdade de Ciências, 2015A range of methodologies is used to create sequence annotations, from manual curation by specialized curators to several automatic procedures. The multitude of existing annotation methods consequently generates an annotation heterogeneity in terms of coverage and specificity across the biological sequence space. When comparing groups of similar sequences (such as protein families) this heterogeneity can introduce issues regarding the interpretation of the actual functional similarity and the overall functional coherence. A direct path to mitigate these issues is the annotation extension within the protein families under analysis. This thesis postulates that the protein families can be used as knowledgebases for their own annotation extension with the assistance of a proper functional coherence analysis. Therefore, a modular framework for functional coherence analysis and annotation extension in protein families was proposed. The framework includes a proposed module for functional coherence analysis that relies on graph visualization, term enrichment and other statistics. In this work it was implemented and made available as a publicly accessible web application, GRYFUN which can be accessed at http://xldb.di.fc.ul.pt/gryfun/. In addition, four metrics were developed to assess distinct aspects of the coherence and completeness in protein families in conjunction with additional existing metrics. Therefore the use of the complete proposed framework by curators can be regarded as a semi-automatic approach to annotation able to assist with protein annotation extension.Diversas metodologias são usadas para criar anotações em sequências, desde a curação manual por curadores especializados até vários procedimentos automáticos. A multitude de métodos de anotação existentes consequentemente gera heterogeneidade nas anotações em termos de cobertura e especificidade em espaços de sequências biológicas. Ao comparar grupos de sequências semelhantes (tais como famílias proteícas) esta heterogeneidade pode introduzir dificuldades quanto à interpretação da semelhança e coerência funcional nesses grupos. Uma maneira de mitigar essas dificuldades é a extensão da anotação dentro das famílias proteícas em análise. Esta tese postula que famílias proteícas podem ser usadas como bases de conhecimento para a sua própria extensão de anotação através do uso de análises de coerência funcional apropriadas. Portanto, uma framework modular para a análise de coerência funcional e extensão de anotação em famílias proteícas foi proposta. A framework incluí um módulo proposto para a análise de coerência funcional baseado em visualização de grafos, enriquecimento de termos e outras estatísticas. Neste trabalho o módulo foi implementado e disponibilizado como uma aplicação web, GRYFUN que pode ser acedida em http://xldb.di.fc.ul.pt/gryfun/. Adicionalmente, quatro métricas foram desenvolvidas para aferir aspectos distinctos da coerência e completude de anotação em famílias proteícas em conjunção com métricas já existentes. Portanto, o uso da framework completa por curadores, como uma estratégia de anotação semi-automática, é capaz de potenciar a extensão de anotação.Fundação para a Ciência e a Tecnologia (FCT), SFRH/BD/48035/200
Bioinformatics protocols for analysis of functional genomics data applied to neuropathy microarray datasets
Microarray technology allows the simultaneous measurement of the
abundance of thousands of transcripts in living cells. The high-throughput
nature of microarray technology means that automatic analytical procedures
are required to handle the sheer amount of data, typically generated in a single
microarray experiment. Along these lines, this work presents a contribution to
the automatic analysis of microarray data by attempting to construct protocols
for the validation of publicly available methods for microarray.
At the experimental level, an evaluation of amplification of RNA targets prior
to hybridisation with the physical array was undertaken. This had the
important consequence of revealing the extent to which the significance of
intensity ratios between varying biological conditions may be compromised
following amplification as well as identifying the underlying cause of this
effect. On the basis of these findings, recommendations regarding the usability
of RNA amplification protocols with microarray screening were drawn in the
context of varying microarray experimental conditions.
On the data analysis side, this work has had the important outcome of
developing an automatic framework for the validation of functional analysis
methods for microarray. This is based on using a GO semantic similarity
scoring metric to assess the similarity between functional terms found enriched by functional analysis of a model dataset and those anticipated from
prior knowledge of the biological phenomenon under study. Using such
validation system, this work has shown, for the first time, that ‘Catmap’, an
early functional analysis method performs better than the more recent and
most popular methods of its kind. Crucially, the effectiveness of this
validation system implies that such system may be reliably adopted for
validation of newly developed functional analysis methods for microarray
Using Neural Networks for Relation Extraction from Biomedical Literature
Using different sources of information to support automated extracting of
relations between biomedical concepts contributes to the development of our
understanding of biological systems. The primary comprehensive source of these
relations is biomedical literature. Several relation extraction approaches have
been proposed to identify relations between concepts in biomedical literature,
namely, using neural networks algorithms. The use of multichannel architectures
composed of multiple data representations, as in deep neural networks, is
leading to state-of-the-art results. The right combination of data
representations can eventually lead us to even higher evaluation scores in
relation extraction tasks. Thus, biomedical ontologies play a fundamental role
by providing semantic and ancestry information about an entity. The
incorporation of biomedical ontologies has already been proved to enhance
previous state-of-the-art results.Comment: Artificial Neural Networks book (Springer) - Chapter 1
- …