10 research outputs found

    Quantitative assessment of relationship between sequence similarity and function similarity

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Comparative sequence analysis is considered as the first step towards annotating new proteins in genome annotation. However, sequence comparison may lead to creation and propagation of function assignment errors. Thus, it is important to perform a thorough analysis for the quality of sequence-based function assignment using large-scale data in a systematic way.</p> <p>Results</p> <p>We present an analysis of the relationship between sequence similarity and function similarity for the proteins in four model organisms, i.e., <it>Arabidopsis thaliana, Saccharomyces cerevisiae, Caenorrhabditis elegans</it>, and <it>Drosophila melanogaster</it>. Using a measure of functional similarity based on the three categories of Gene Ontology (GO) classifications (biological process, molecular function, and cellular component), we quantified the correlation between functional similarity and sequence similarity measured by sequence identity or statistical significance of the alignment and compared such a correlation against randomly chosen protein pairs.</p> <p>Conclusion</p> <p>Various sequence-function relationships were identified from BLAST versus PSI-BLAST, sequence identity versus Expectation Value, GO indices versus semantic similarity approaches, and within genome versus between genome comparisons, for the three GO categories. Our study provides a benchmark to estimate the confidence in assignment of functions purely based on sequence similarity.</p

    An account of conserved functions and how biologists use them to integrate cell and evolutionary biology

    Get PDF
    In this paper, we characterize a type of functional explanation that addresses why a homologous trait that originated deep in the evolutionary history of a clade is observed to have remained widespread and largely unchanged across many lineages in the clade. We argue this type of explanation is provided when evolutionary biologists attribute conserved functions to traits, both phenotypic and genetic. The concept of conserved function applies broadly to many biological domains, but we illustrate its importance in particular using examples at the intersection of evolution and cell biology. We also show how the study of conserved functions serves to integrate knowledge of both a trait’s evolutionary history of natural selection and its causal effects on fitness, but in an overlooked way that does not rely on positive selection. Moreover, we show how conserved function provides a novel basis for addressing several objections against evolutionary functions raised by Robert Cummins

    Mining a Chinese hyperthermophilic metagenome

    Get PDF
    Philosophiae Doctor - PhDMetagenomic sequencing of environmental samples provide direct access to genomic information of organisms within the respective environments. This sequence information represents a significant resource for the identification and subsequent characterization of potentially novel genes, or known genes with acquired novel characteristics. Within this context, the thermophilic environments are of particular interest due to its potential for deriving novel thermostable enzymes with biotechnological and industrial applications. In this work metagenomic library construction, random sequencing and sequence analysis strategies were employed to enhance identification and characterisation of potentially novel genes, from a thermophilic soil sample. High molecular weight metagenomic DNA was extracted from two Chinese hydrothermal soil samples. This was used as source material for the construction of four genomic DNA libraries. The combined libraries were estimated to contain in the order of 1.3 million genes, which provides a rich resource for gene identification. Approximately 70 kbp of sequence data was generated from one of the libraries as a resource for sequence-based analysis. Initial BLAST analysis predicted the presence of 53 ORFs/partial ORFs. The BLAST similarity scores for the investigated ORFs were sufficiently high (>40%) to infer homology with database proteins while also being indicative of novel sequence variants of these database matches. In an attempt to enhance the potential for deriving more full length ORFs a novel strategy, based on WGA technology, was employed. This resulted in the recovery of the near complete sequence of partial ORF5, directly from the WGA DNA of the environmental sample. While the full length ORF5 could not be recovered, the feasibility of this novel approach, for enhanced metagenomic sequence recovery was proved in principle. The implementation of multiple insilico strategies resulted in the identification of two ORFs, classified as homologs of the DUF29 and Usp protein families respectively. The functional inference obtained from the integrated in-silico predictions was furthermore highly suggestive of a putative nucleotide binding/interaction role for both ORFs. A putative novel DNA polymerase gene (denoted TC11pol) was identified from the sequence data. Expression and characterization of the full length TC11pol did however not result in detectable polymerase activity. The implementation of a homology modeling approach proved succesfull for deriving a structural model of the polymerase that was used for: (i) deriving functional inferences of the potential activities of the polymerase and (ii) deriving a 5’ exonuclease deletion mutant for functional analysis. Expression and subsequent functional characterization of the putative 5’exo- TC11pol mutant resulted in detectable polymerase and 3’-5’ exonuclease activity at 37 and 45 oC, following a heat denaturation step at 55 oC for 1 hour. It was, therefore concluded that the putative 5’exo- TC11pol mutant was functionally equivalent to the Klenow fragment of E. coli, while exhibiting increased thermostability.South Afric

    Discovery and Analysis of Aligned Pattern Clusters from Protein Family Sequences

    Get PDF
    Protein sequences are essential for encoding molecular structures and functions. Consequently, biologists invest substantial resources and time discovering functional patterns in proteins. Using high-throughput technologies, biologists are generating an increasing amount of data. Thus, the major challenge in biosequencing today is the ability to conduct data analysis in an effi cient and productive manner. Conserved amino acids in proteins reveal important functional domains within protein families. Conversely, less conserved amino acid variations within these protein sequence patterns reveal areas of evolutionary and functional divergence. Exploring protein families using existing methods such as multiple sequence alignment is computationally expensive, thus pattern search is used. However, at present, combinatorial methods of pattern search generate a large set of solutions, and probabilistic methods require richer representations. They require biological ground truth of the input sequences, such as gene name or taxonomic species, as class labels based on traditional classi fication practice to train a model for predicting unknown sequences. However, these algorithms are inherently biased by mislabelling and may not be able to reveal class characteristics in a detailed and succinct manner. A novel pattern representation called an Aligned Pattern Cluster (AP Cluster) as developed in this dissertation is compact yet rich. It captures conservations and variations of amino acids and covers more sequences with lower entropy and greatly reduces the number of patterns. AP Clusters contain statistically signi cant patterns with variations; their importance has been confi rmed by the following biological evidences: 1) Most of the discovered AP Clusters correspond to binding segments while their aligned columns correspond to binding sites as verifi ed by pFam, PROSITE, and the three-dimensional structure. 2) By compacting strong correlated functional information together, AP Clusters are able to reveal class characteristics for taxonomical classes, gene classes and other functional classes, or incorrect class labelling. 3) Co-occurrence of AP Clusters on the same homologous protein sequences are spatially close in the protein's three-dimensional structure. These results demonstrate the power and usefulness of AP Clusters. They bring in similar statistically signifi cance patterns with variation together and align them to reveal protein regional functionality, class characteristics, binding and interacting sites for the study of protein-protein and protein-drug interactions, for diff erentiation of cancer tumour types, targeted gene therapy as well as for drug target discovery.1 yea

    Knowledge derivation and data mining strategies for probabilistic functional integrated networks

    Get PDF
    PhDOne of the fundamental goals of systems biology is the experimental verification of the interactome: the entire complement of molecular interactions occurring in the cell. Vast amounts of high-throughput data have been produced to aid this effort. However these data are incomplete and contain high levels of both false positives and false negatives. In order to combat these limitations in data quality, computational techniques have been developed to evaluate the datasets and integrate them in a systematic fashion using graph theory. The result is an integrated network which can be analysed using a variety of network analysis techniques to draw new inferences about biological questions and to guide laboratory experiments. Individual research groups are interested in specific biological problems and, consequently, network analyses are normally performed with regard to a specific question. However, the majority of existing data integration techniques are global and do not focus on specific areas of biology. Currently this issue is addressed by using known annotation data (such as that from the Gene Ontology) to produce process-specific subnetworks. However, this approach discards useful information and is of limited use in poorly annotated areas of the interactome. Therefore, there is a need for network integration techniques that produce process-specific networks without loss of data. The work described here addresses this requirement by extending one of the most powerful integration techniques, probabilistic functional integrated networks (PFINs), to incorporate a concept of biological relevance. Initially, the available functional data for the baker’s yeast Saccharomyces cerevisiae was evaluated to identify areas of bias and specificity which could be exploited during network integration. This information was used to develop an integration technique which emphasises interactions relevant to specific biological questions, using yeast ageing as an exemplar. The integration method improves performance during network-based protein functional prediction in relation to this process. Further, the process-relevant networks complement classical network integration techniques and significantly improve network analysis in a wide range of biological processes. The method developed has been used to produce novel predictions for 505 Gene Ontology biological processes. Of these predictions 41,610 are consistent with existing computational annotations, and 906 are consistent with known expert-curated annotations. The approach significantly reduces the hypothesis space for experimental validation of genes hypothesised to be involved in the oxidative stress response. Therefore, incorporation of biological relevance into network integration can significantly improve network analysis with regard to individual biological questions

    Issues in predicting protein function from sequence.

    No full text
    Identifying homologues, defined as genes that arose from a common evolutionary ancestor, is often a relatively straightforward task, thanks to recent advances made in estimating the statistical significance of sequence similarities found from database searches. The extent by which homologues possess similarities in function, however, is less amenable to statistical analysis. Consequently, predicting function by homology is a qualitative, rather than quantitative, process and requires particular care to be taken. This review focuses on the various approaches that have been developed to predict function from the scale of the atom to that of the organism. Similarities in homologues' functions differ considerably at each of these different scales and also vary for different domain families. It is argued that due attention should be paid to all available clues to function, including orthologue identification, conservation of particular residue types, and the co-occurrence of domains in proteins. Pitfalls in database searching methods arising from amino acid compositional bias and database size effects are also discussed
    corecore