7 research outputs found

    Metrics for GO based protein semantic similarity: a systematic evaluation

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Several semantic similarity measures have been applied to gene products annotated with Gene Ontology terms, providing a basis for their functional comparison. However, it is still unclear which is the best approach to semantic similarity in this context, since there is no conclusive evaluation of the various measures. Another issue, is whether electronic annotations should or not be used in semantic similarity calculations.</p> <p>Results</p> <p>We conducted a systematic evaluation of GO-based semantic similarity measures using the relationship with sequence similarity as a means to quantify their performance, and assessed the influence of electronic annotations by testing the measures in the presence and absence of these annotations. We verified that the relationship between semantic and sequence similarity is not linear, but can be well approximated by a rescaled Normal cumulative distribution function. Given that the majority of the semantic similarity measures capture an identical behaviour, but differ in resolution, we used the latter as the main criterion of evaluation.</p> <p>Conclusions</p> <p>This work has provided a basis for the comparison of several semantic similarity measures, and can aid researchers in choosing the most adequate measure for their work. We have found that the hybrid <it>simGIC</it> was the measure with the best overall performance, followed by Resnik's measure using a best-match average combination approach. We have also found that the average and maximum combination approaches are problematic since both are inherently influenced by the number of terms being combined. We suspect that there may be a direct influence of data circularity in the behaviour of the results including electronic annotations, as a result of functional inference from sequence similarity.</p

    Gene function prediction in five model eukaryotes exclusively based on gene relative location through machine learning

    Get PDF
    The function of most genes is unknown. The best results in automated function prediction are obtained with machine learning-based methods that combine multiple data sources, typically sequence derived features, protein structure and interaction data. Even though there is ample evidence showing that a gene鈥檚 function is not independent of its location, the few available examples of gene function prediction based on gene location rely on sequence identity between genes of different organisms and are thus subjected to the limitations of the relationship between sequence and function. Here we predict thousands of gene functions in five model eukaryotes (Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, Mus musculus and Homo sapiens) using machine learning models exclusively trained with features derived from the location of genes in the genomes to which they belong. Our aim was not to obtain the best performing method to automated function prediction but to explore the extent to which a gene's location can predict its function in eukaryotes. We found that our models outperform BLAST when predicting terms from Biological Process and Cellular Component Ontologies, showing that, at least in some cases, gene location alone can be more useful than sequence to infer gene function.ANII: FSDA_1_2017_1_1424

    Patterns of protein expression in tissues of the killifish, Fundulus heteroclitus and Fundulus grandis

    Get PDF
    Fundulus is a diverse and widespread genus of small teleost fish of North America. Due to its high tolerance for physiochemical variation (e.g. temperature, oxygen, salinity), Fundulus is a model organism to study physiological and molecular adaptations to environmental stress. The thesis focuses on patterns of protein expression in Fundulus heteroclitus and F. grandis.The patterns of protein expression were investigated using traditional methods of enzyme activity measurements and recent proteomic approaches. The findings of the study can be used to guide future studies on the proteomic responses of vertebrates to environmental stress. Chapter 2 focuses on measurement of the temporal effects of oxygen treatments on the maximal specific activities of nine glycolytic enzymes in liver and skeletal muscle during chronic exposure (28d) of Fundulus heteroclitus. The fish was exposed to four different oxygen treatments: hyperoxia, normoxia, moderate hypoxia, and severe hypoxia. The time course of changes in maximal glycolytic enzyme specific activities was assessed at 0, 8, 14 and 28 d. The results demonstrate that chronic hypoxia alters the capacity for carbohydrate metabolism in F. heteroclitus, with the important observation that the responses are both tissue- and enzyme-specific. Chapter 3 studies the effect of tissue storage on protein profile of tissues of F. grandis. The technique of one dimensional gel electrophoresis (1D-SDS-PAGE) was used to assess the effects of tissue sampling, flash frozen in liquid nitrogen versus immersion of fresh tissue in RNA later, for five tissues, liver, skeletal muscle, brain, gill, and heart, followed by LC-MS/MS to identify protein bands that were differentially stabilized in gill and liver. The study shows that, in F. grandis, the preferred method of preservation was tissue specific. xi Chapter 4 focuses on the use of advanced 2DE-MS/MS to characterize the proteome of multiple tissues in F. grandis. Database searching resulted in the identification of 253 non-redundant proteins in five tissues: liver, muscle, brain, gill, and heart. Identifications include enzymes of energy metabolism, heat shock proteins, and structural proteins. The protein identification rate was approximately 50 % of the protein spots analyzed. This identification rate for a species without a sequenced genome demonstrates the utility of F. grandis as a model organism for environmental proteomic studies in vertebrates

    Implementaci贸n de clasificadores jer谩rquicos multiclase para la predicci贸n de funci贸n de genes a partir de su ubicaci贸n en el genoma

    Get PDF
    El reciente desarrollo tecnol贸gico est谩 generando datos gen贸micos mucho m谩s r谩pido que nuestra capacidad de analizarlos. Es imprescindible, en este contexto, implementar herramientas que permitan reducir el tiempo y el costo necesario para determinar las funciones de los genes experimentalmente, dado que para la mayor铆a de los genes a煤n se desconoce su funci贸n. Para aliviar este problema, en las 煤ltimas d茅cadas se han desarrollado varios m茅todos de predicci贸n de funciones de genes. Algunos se basan en alineamientos de secuencia con prote铆nas para las cuales su funci贸n se ha establecido experimentalmente [Clark and Radivojac, 2011, Martin et al., 2004, Engelhardt et al., 2005], y otros explotan otros tipos de datos: estructuras de prote铆nas [Pal and Eisenberg, 2005,Pazos and Sternberg, 2004], niveles de expresi贸n de genes [Huttenhower et al., 2006], perfiles temporales de transcripci贸n [Pazos Obreg贸n et al., 2015], interacciones macromoleculares [Letovsky and Kasif, 2003, Nabieva et al., 2005], o una combinaci贸n de varios tipos de ellos. A pesar de que se sabe que los genes con la misma funci贸n se agrupan de diferentes maneras en el genoma, y que su posici贸n en el mismo no es independiente de su funci贸n biol贸gica, el potencial de la posici贸n de un gen dentro del genoma como variable predictora de la funci贸n permanece poco explorado en organismos eucariotas. En este trabajo se implementa un modelo para predecir funciones de genes, utilizando datos generados a partir de su posici贸n en el genoma y de funciones conocidas, en cinco organismos modelo. Los resultados obtenidos indican que, para algunos organismos y ontolog铆as, la posici贸n de un gen predice mejor su funci贸n que la secuencia.The recent technological development is generating genomic data much faster than our ability to analyze it. In this context, it is essential to implement tools that reduce the time and cost necessary to determine the functions of genes experimentally, given that the function of most genes is still unknown. To alleviate this problem, various gene function prediction methods have been developed in recent decades. Some are based on sequence alignments with proteins for which their function has been established experimentally [Clark and Radivojac, 2011, Martin et al., 2004, Engelhardt et al., 2005], and others exploit other types of data: protein structures [Pal and Eisenberg, 2005, Pazos and Sternberg, 2004], expression levels of genes [Huttenhower et al., 2006], temporal transcription profiles [Pazos Obreg贸n et al., 2015], macromolecular interactions [Letovsky and Kasif, 2003, Nabieva et al., 2005], or a combination of several types of them. Although genes with the same function are known to cluster in different ways in the genome, and their position in the genome is not independent of their biological function, the potential of a gene's position within the genome as a predictive variable of function remains unexplored in eukaryotic organisms. In this work, a model is implemented to predict gene functions, using data generated from their position in the genome and from known functions, in five model organisms. The results obtained indicate that, for some organisms and ontologies, the position of a gene is a better predictor of its function than its sequence

    The relationship between protein sequences and their gene ontology functions

    Get PDF
    Abstract Background One main research challenge in the post-genomic era is to understand the relationship between protein sequences and their biological functions. In recent years, several automated annotation systems have been developed for the functional assignment of uncharacterized proteins. The underlying assumption of these systems is that similar sequences imply similar biological functions. However, it has been noted that matching sequences do not always infer similar functions. Results In this paper, we present the correlation between protein sequences and protein functions for the yeast proteome in the context of gene ontology. A novel measure is introduced to define the overall similarity between two protein sequences. The effects of the level as well as the size of a gene ontology group on the degree of similarity were studied. The similarity distributions at different levels of gene ontology trees are presented. To evaluate the theoretical prediction power of similar sequences, we computed the posterior probability of correct predictions. Conclusion The results indicate that protein pairs of similar biological functions tend to have higher sequence similarity, although the similarity distribution in each functional group is heterogeneous and varies from group to group. We conclude that sequence similarity can serve as a key measure in protein function prediction. However, the resulting annotations must be verified through other means. A method that combines a broader range of measures is more likely to provide more accurate prediction. Our study indicates that the posterior probability of a correct prediction could serve as one of the key measures.</p
    corecore