263 research outputs found

    Positional orthology: putting genomic evolutionary relationships into context

    Get PDF
    Orthology is a powerful refinement of homology that allows us to describe more precisely the evolution of genomes and understand the function of the genes they contain. However, because orthology is not concerned with genomic position, it is limited in its ability to describe genes that are likely to have equivalent roles in different genomes. Because of this limitation, the concept of ‘positional orthology’ has emerged, which describes the relation between orthologous genes that retain their ancestral genomic positions. In this review, we formally define this concept, for which we introduce the shorter term ‘toporthology’, with respect to the evolutionary events experienced by a gene’s ancestors. Through a discussion of recent studies on the role of genomic context in gene evolution, we show that the distinction between orthology and toporthology is biologically significant. We then review a number of orthology prediction methods that take genomic context into account and thus that may be used to infer the important relation of toporthology

    Calculating Orthologs in Bacteria and Archaea: A Divide and Conquer Approach

    Get PDF
    Among proteins, orthologs are defined as those that are derived by vertical descent from a single progenitor in the last common ancestor of their host organisms. Our goal is to compute a complete set of protein orthologs derived from all currently available complete bacterial and archaeal genomes. Traditional approaches typically rely on all-against-all BLAST searching which is prohibitively expensive in terms of hardware requirements or computational time (requiring an estimated 18 months or more on a typical server). Here, we present xBASE-Orth, a system for ongoing ortholog annotation, which applies a “divide and conquer” approach and adopts a pragmatic scheme that trades accuracy for speed. Starting at species level, xBASE-Orth carefully constructs and uses pan-genomes as proxies for the full collections of coding sequences at each level as it progressively climbs the taxonomic tree using the previously computed data. This leads to a significant decrease in the number of alignments that need to be performed, which translates into faster computation, making ortholog computation possible on a global scale. Using xBASE-Orth, we analyzed an NCBI collection of 1,288 bacterial and 94 archaeal complete genomes with more than 4 million coding sequences in 5 weeks and predicted more than 700 million ortholog pairs, clustered in 175,531 orthologous groups. We have also identified sets of highly conserved bacterial and archaeal orthologs and in so doing have highlighted anomalies in genome annotation and in the proposed composition of the minimal bacterial genome. In summary, our approach allows for scalable and efficient computation of the bacterial and archaeal ortholog annotations. In addition, due to its hierarchical nature, it is suitable for incorporating novel complete genomes and alternative genome annotations. The computed ortholog data and a continuously evolving set of applications based on it are integrated in the xBASE database, available at http://www.xbase.ac.uk/

    Reversal Distances for Strings with Few Blocks or Small Alphabets

    Get PDF
    International audienceWe study the String Reversal Distance problem, an extension of the well-known Sorting by Reversals problem. String Reversal Distance takes two strings S and T as input, and asks for a minimum number of reversals to obtain T from S. We consider four variants: String Reversal Distance, String Prefix Reversal Distance (in which any reversal must include the first letter of the string), and the signed variants of these problems, namely Signed String Reversal Distance and Signed String Prefix Reversal Distance. We study algorithmic properties of these four problems, in connection with two parameters of the input strings: the number of blocks they contain (a block being maximal substring such that all letters in the substring are equal), and the alphabet size Σ. For instance, we show that Signed String Reversal Distance and Signed String Prefix Reversal Distance are NP-hard even if the input strings have only one letter

    Orthology prediction methods: a quality assessment using curated protein families

    Get PDF
    The increasing number of sequenced genomes has prompted the development of several automated orthology prediction methods. Tests to evaluate the accuracy of predictions and to explore biases caused by biological and technical factors are therefore required. We used 70 manually curated families to analyze the performance of five public methods in Metazoa. We analyzed the strengths and weaknesses of the methods and quantified the impact of biological and technical challenges. From the latter part of the analysis, genome annotation emerged as the largest single influencer, affecting up to 30% of the performance. Generally, most methods did well in assigning orthologous group but they failed to assign the exact number of genes for half of the groups. The publicly available benchmark set (http://eggnog.embl.de/orthobench/) should facilitate the improvement of current orthology assignment protocols, which is of utmost importance for many fields of biology and should be tackled by a broad scientific community

    3D reconstruction identifies loci linked to variation in angle of individual sorghum leaves

    Get PDF
    Selection for yield at high planting density has reshaped the leaf canopy of maize, improving photosynthetic productivity in high density settings. Further optimization of canopy architecture may be possible. However, measuring leaf angles, the widely studied component trait of leaf canopy architecture, by hand is a labor and time intensive process. Here, we use multiple, calibrated, 2D images to reconstruct the 3D geometry of individual sorghum plants using a voxel carving based algorithm. Automatic skeletonization and segmentation of these 3D geometries enable quan- tification of the angle of each leaf for each plant. The resulting measurements are both heritable and correlated with manually collected leaf angles. This automated and scaleable reconstruction approach was employed to measure leaf-by-leaf angles for a population of 366 sorghum plants at multiple time points, resulting in 971 successful reconstructions and 3,376 leaf angle measurements from individual leaves. A genome wide association study conducted using aggregated leaf angle data identified a known large effect leaf angle gene, several previously identified leaf angle QTL from a sorghum NAM population, and novel signals. Genome wide association studies conducted separately for three individual sorghum leaves identified a number of the same signals, a previously unreported signal shared across multiple leaves, and signals near the sorghum orthologs of two maize genes known to influence leaf angle. Automated measurement of individual leaves and mapping variants associated with leaf angle reduce the barriers to engineering ideal canopy architectures in sorghum and other grain crops

    An Effective Big Data Supervised Imbalanced Classification Approach for Ortholog Detection in Related Yeast Species

    Get PDF
    Orthology detection requires more effective scaling algorithms. In this paper, a set of gene pair features based on similarity measures (alignment scores, sequence length, gene membership to conserved regions, and physicochemical profiles) are combined in a supervised pairwise ortholog detection approach to improve effectiveness considering low ortholog ratios in relation to the possible pairwise comparison between two genomes. In this scenario, big data supervised classifiers managing imbalance between ortholog and nonortholog pair classes allow for an effective scaling solution built from two genomes and extended to other genome pairs. The supervised approach was compared with RBH, RSD, and OMA algorithms by using the following yeast genome pairs: Saccharomyces cerevisiae-Kluyveromyces lactis, Saccharomyces cerevisiae-Candida glabrata, and Saccharomyces cerevisiaeSchizosaccharomyces pombe as benchmark datasets. Because of the large amount of imbalanced data, the building and testing of the supervised model were only possible by using big data supervised classifiers managing imbalance. Evaluation metrics taking low ortholog ratios into account were applied. From the effectiveness perspective, MapReduce Random Oversampling combined with Spark SVM outperformed RBH, RSD, and OMA, probably because of the consideration of gene pair features beyond alignment similarities combined with the advances in big data supervised classification

    METHODS FOR HIGH-THROUGHPUT COMPARATIVE GENOMICS AND DISTRIBUTED SEQUENCE ANALYSIS

    Get PDF
    High-throughput sequencing has accelerated applications of genomics throughout the world. The increased production and decentralization of sequencing has also created bottlenecks in computational analysis. In this dissertation, I provide novel computational methods to improve analysis throughput in three areas: whole genome multiple alignment, pan-genome annotation, and bioinformatics workflows. To aid in the study of populations, tools are needed that can quickly compare multiple genome sequences, millions of nucleotides in length. I present a new multiple alignment tool for whole genomes, named Mugsy, that implements a novel method for identifying syntenic regions. Mugsy is computationally efficient, does not require a reference genome, and is robust in identifying a rich complement of genetic variation including duplications, rearrangements, and large-scale gain and loss of sequence in mixtures of draft and completed genome data. Mugsy is evaluated on the alignment of several dozen bacterial chromosomes on a single computer and was the fastest program evaluated for the alignment of assembled human chromosome sequences from four individuals. A distributed version of the algorithm is also described and provides increased processing throughput using multiple CPUs. Numerous individual genomes are sequenced to study diversity, evolution and classify pan-genomes. Pan-genome annotations contain inconsistencies and errors that hinder comparative analysis, even within a single species. I introduce a new tool, Mugsy-Annotator, that identifies orthologs and anomalous gene structure across a pan-genome using whole genome multiple alignments. Identified anomalies include inconsistently located translation initiation sites and disrupted genes due to draft genome sequencing or pseudogenes. An evaluation of pan-genomes indicates that such anomalies are common and alternative annotations suggested by the tool can improve annotation consistency and quality. Finally, I describe the Cloud Virtual Resource, CloVR, a desktop application for automated sequence analysis that improves usability and accessibility of bioinformatics software and cloud computing resources. CloVR is installed on a personal computer as a virtual machine and requires minimal installation, addressing challenges in deploying bioinformatics workflows. CloVR also seamlessly accesses remote cloud computing resources for improved processing throughput. In a case study, I demonstrate the portability and scalability of CloVR and evaluate the costs and resources for microbial sequence analysis

    A whole genome duplication drives the genome evolution of Phytophthora betacei, a closely related species to Phytophthora infestans

    Get PDF
    BACKGROUND: Pathogens of the genus Phytophthora are the etiological agents of many devastating diseases in several high-value crops and forestry species such as potato, tomato, cocoa, and oak, among many others. Phytophthora betacei is a recently described species that causes late blight almost exclusively in tree tomatoes, and it is closely related to Phytophthora infestans that causes the disease in potato crops and other Solanaceae. This study reports the assembly and annotation of the genomes of P. betacei P8084, the first of its species, and P. infestans RC1-10, a Colombian strain from the EC-1 lineage, using long-read SMRT sequencing technology. RESULTS: Our results show that P. betacei has the largest sequenced genome size of the Phytophthora genus so far with 270 Mb. A moderate transposable element invasion and a whole genome duplication likely explain its genome size expansion when compared to P. infestans, whereas P. infestans RC1-10 has expanded its genome under the activity of transposable elements. The high diversity and abundance (in terms of copy number) of classified and unclassified transposable elements in P. infestans RC1-10 relative to P. betacei bears testimony of the power of long-read technologies to discover novel repetitive elements in the genomes of organisms. Our data also provides support for the phylogenetic placement of P. betacei as a standalone species and as a sister group of P. infestans. Finally, we found no evidence to support the idea that the genome of P. betacei P8084 follows the same gene-dense/gense-sparse architecture proposed for P. infestans and other filamentous plant pathogens. CONCLUSIONS: This study provides the first genome-wide picture of P. betacei and expands the genomic resources available for P. infestans. This is a contribution towards the understanding of the genome biology and evolutionary history of Phytophthora species belonging to the subclade 1c
    corecore