62 research outputs found

    Toward community standards in the quest for orthologs

    Get PDF
    The identification of orthologs—genes pairs descended from a common ancestor through speciation, rather than duplication—has emerged as an essential component of many bioinformatics applications, ranging from the annotation of new genomes to experimental target prioritization. Yet, the development and application of orthology inference methods is hampered by the lack of consensus on source proteomes, file formats and benchmarks. The second ‘Quest for Orthologs' meeting brought together stakeholders from various communities to address these challenges. We report on achievements and outcomes of this meeting, focusing on topics of particular relevance to the research community at large. The Quest for Orthologs consortium is an open community that welcomes contributions from all researchers interested in orthology research and applications. Contact: [email protected]

    Sequence Similarity Network Reveals Common Ancestry of Multidomain Proteins

    Get PDF
    We address the problem of homology identification in complex multidomain families with varied domain architectures. The challenge is to distinguish sequence pairs that share common ancestry from pairs that share an inserted domain but are otherwise unrelated. This distinction is essential for accuracy in gene annotation, function prediction, and comparative genomics. There are two major obstacles to multidomain homology identification: lack of a formal definition and lack of curated benchmarks for evaluating the performance of new methods. We offer preliminary solutions to both problems: 1) an extension of the traditional model of homology to include domain insertions; and 2) a manually curated benchmark of well-studied families in mouse and human. We further present Neighborhood Correlation, a novel method that exploits the local structure of the sequence similarity network to identify homologs with great accuracy based on the observation that gene duplication and domain shuffling leave distinct patterns in the sequence similarity network. In a rigorous, empirical comparison using our curated data, Neighborhood Correlation outperforms sequence similarity, alignment length, and domain architecture comparison. Neighborhood Correlation is well suited for automated, genome-scale analyses. It is easy to compute, does not require explicit knowledge of domain architecture, and classifies both single and multidomain homologs with high accuracy. Homolog predictions obtained with our method, as well as our manually curated benchmark and a web-based visualization tool for exploratory analysis of the network neighborhood structure, are available at http://www.neighborhoodcorrelation.org. Our work represents a departure from the prevailing view that the concept of homology cannot be applied to genes that have undergone domain shuffling. In contrast to current approaches that either focus on the homology of individual domains or consider only families with identical domain architectures, we show that homology can be rationally defined for multidomain families with diverse architectures by considering the genomic context of the genes that encode them. Our study demonstrates the utility of mining network structure for evolutionary information, suggesting this is a fertile approach for investigating evolutionary processes in the post-genomic era

    Vertebrate evolution: doubling and shuffling with a full deck.

    No full text
    The number and role of whole-genome duplications in vertebrate evolution has intrigued evolutionary biologists since Ohno first proposed genome duplication as the force driving the 'big leap' in vertebrate morphological innovation. Attempts to resolve these issues have been thwarted by small and noisy datasets, and by lack of computational accuracy and statistical rigor. Recently, Ken Wolfe and colleagues presented a genome-scale, statistically rigorous analysis of evidence based on the spatial organization of duplicated genes, as well as estimates of duplication times. Their results provide the strongest evidence to date of large-scale duplication throughout the vertebrate genome, consistent with at least one whole-genome duplication.</p

    Tests for Gene Clustering

    No full text
    Comparing chromosomal gene order in two or more related species is an important approach to studying the forces that guide genome organization and evolution. Linked clusters of similar genes found in related genomes are often used to support arguments of evolutionary relatedness or functional selection. However, as the gene order and the gene complement of sister genomes diverge progressively due to large scale rearrangements, horizontal gene transfer, gene duplication and gene loss, it becomes increasingly difficult to determine whether observed similarities in local genomic structure are indeed remnants of common ancestral gene order, or are merely coincidences

    The incompatible desiderata of gene cluster properties

    No full text
    There is widespread interest in comparative genomics in determining if historically and/or functionally related genes are spatially clustered in the genome, and whether the same sets of genes reappear in clusters in two or more genomes. We formalize and analyze the desirable properties of gene clusters and cluster definitions. Through detailed analysis of two commonly applied types of cluster, r-windows and maxgap, we investigate the extent to which a single definition can embody all of these properties simultaneously. We show that many of the most important properties are difficult to satisfy within the same definition. We also examine whether one commonly assumed property, which we call nestedness, is satisfied by the structures present in real genomic data

    On the Design of Optimization Criteria for Multiple Sequence Alignment

    No full text
    Multiple sequence alignment (MSA) is important in functional, structural and evolutionary studies of sequence data. Much research has focussed on posing MSA as an optimization problem, and several optimization criteria have been explored. In this paper, we discuss biological and mathematical problems that arise in cost function design for the multiple sequence alignment problem. In particular, we focus on tree alignment, which is often viewed as the most &quot;biological&quot; of the rigorous approaches to MSA. We point out several important pitfalls in current optimization approaches to MSA and identify characteristics for good cost function design. We address some extra design issues specific to approximation algorithms. We hope these ideas will lead to future research on a biologically realistic and mathematically rigorous approach to MSA. 1 Introduction One of the basic ways to extract shared information from a set of biopolymers is to compute their multiple sequence alignment (MSA). MSAs..
    corecore