4 research outputs found

    Bidirectional best hit r-window gene clusters

    Get PDF
    <p>Abstract</p> <p>Background</p> <p><it>Conserved gene clusters </it>are groups of genes that are located close to one another in the genomes of several species. They tend to code for proteins that have a functional interaction. The identification of conserved gene clusters is an important step towards understanding genome evolution and predicting gene function.</p> <p>Results</p> <p>In this paper, we propose a novel pairwise gene cluster model that combines the notion of bidirectional best hits with the <it>r</it>-window model introduced in 2003 by Durand and Sankoff. The bidirectional best hit (BBH) constraint removes the need to specify the minimum number of shared genes in the <it>r</it>-window model and improves the relevance of the results. We design a subquadratic time algorithm to compute the set of BBH <it>r</it>-window gene clusters efficiently.</p> <p>Conclusion</p> <p>We apply our cluster model to the comparative analysis of <it>E. coli </it>K-12 and <it>B. subtilis </it>and perform an extensive comparison between our new model and the gene teams model developed by Bergeron <it>et al</it>. As compared to the gene teams model, our new cluster model has a slightly lower recall but a higher precision at all levels of recall when the results were ranked using statistical tests. An analysis of the most significant BBH <it>r</it>-window gene cluster show that they correspond to known operons.</p

    Identification of conserved gene clusters in multiple genomes based on synteny and homology

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Uncovering the relationship between the conserved chromosomal segments and the functional relatedness of elements within these segments is an important question in computational genomics. We build upon the series of works on <it>gene teams</it> and <it>homology teams.</it></p> <p>Results</p> <p>Our primary contribution is a local sliding-window SYNS (SYNtenic teamS) algorithm that refines an existing family structure into orthologous sub-families by analyzing the neighborhoods around the members of a given family with a locally sliding window. The neighborhood analysis is done by computing conserved gene clusters. We evaluate our algorithm on the existing homologous families from the Genolevures database over five genomes of the Hemyascomycete phylum.</p> <p>Conclusions</p> <p>The result is an efficient algorithm that works on multiple genomes, considers paralogous copies of genes and is able to uncover orthologous clusters even in distant genomes. Resulting orthologous clusters are comparable to those obtained by manual curation.</p

    Finding conserved patterns in biological sequences, networks and genomes

    Get PDF
    Biological patterns are widely used for identifying biologically interesting regions within macromolecules, classifying biological objects, predicting functions and studying evolution. Good pattern finding algorithms will help biologists to formulate and validate hypotheses in an attempt to obtain important insights into the complex mechanisms of living things. In this dissertation, we aim to improve and develop algorithms for five biological pattern finding problems. For the multiple sequence alignment problem, we propose an alternative formulation in which a final alignment is obtained by preserving pairwise alignments specified by edges of a given tree. In contrast with traditional NPhard formulations, our preserving alignment formulation can be solved in polynomial time without using a heuristic, while having very good accuracy. For the path matching problem, we take advantage of the linearity of the query path to reduce the problem to finding a longest weighted path in a directed acyclic graph. We can find k paths with top scores in a network from the query path in polynomial time. As many biological pathways are not linear, our graph matching approach allows a non-linear graph query to be given. Our graph matching formulation overcomes the common weakness of previous approaches that there is no guarantee on the quality of the results. For the gene cluster finding problem, we investigate a formulation based on constraining the overall size of a cluster and develop statistical significance estimates that allow direct comparisons of clusters of different sizes. We explore both a restricted version which requires that orthologous genes are strictly ordered within each cluster, and the unrestricted problem that allows paralogous genes within a genome and clusters that may not appear in every genome. We solve the first problem in polynomial time and develop practical exact algorithms for the second one. In the gene cluster querying problem, based on a querying strategy, we propose an efficient approach for investigating clustering of related genes across multiple genomes for a given gene cluster. By analyzing gene clustering in 400 bacterial genomes, we show that our algorithm is efficient enough to study gene clusters across hundreds of genomes

    Algorithms for Gene Clustering Analysis on Genomes

    Get PDF
    The increased availability of data in biological databases provides many opportunities for understanding biological processes through these data. As recent attention has shifted from sequence analysis to higher-level analysis of genes across multiple genomes, there is a need to develop efficient algorithms for these large-scale applications that can help us understand the functions of genes. The overall objective of my research was to develop improved methods which can automatically assign groups of functionally related genes in large-scale data sets by applying new gene clustering algorithms. Proposed gene clustering algorithms that can help us understand gene function and genome evolution include new algorithms for protein family classification, a window-based strategy for gene clustering on chromosomes, and an exhaustive strategy that allows all clusters of small size to be enumerated. I investigate the problems of gene clustering in multiple genomes, and define gene clustering problems using mathematical methodology and solve the problems by developing efficient and effective algorithms. For protein family classification, I developed two supervised classification algorithms that can assign proteins to existing protein families in public databases and, by taking into account similarities between the unclassified proteins, allows for progressive construction of new families from proteins that cannot be assigned. This approach is useful for rapid assignment of protein sequences from genome sequencing projects to protein families. A comparative analysis of the method to other previously developed methods shows that the algorithm has a higher accuracy rate and lower mis-classification rate when compared to algorithms that are based on the use of multiple sequence alignments and hidden Markov models. The proposed algorithm performs well even on families with very few proteins and on families with low sequence similarity. Apart from the analysis of individual sequences, identifying genomic regions that descended from a common ancestor helps us study gene function and genome evolution. In distantly related genomes, clusters of homologous gene pairs serve as evidence used in function prediction, operon detection, etc. Thus, reliable identification of gene clusters is critical to functional annotation and analysis of genes. I developed an efficient gene clustering algorithm that can be applied on hundreds of genomes at the same time. This approach allows for large-scale study of evolutionary relationships of gene clusters and study of operon formation and destruction. By placing a stricter limit on the maximum cluster size, I developed another algorithm that uses a different formulation based on constraining the overall size of a cluster and statistical estimates that allow direct comparisons of clusters of different size. A comparative analysis of proposed algorithms shows that more biological insight can be obtained by analyzing gene clusters across hundreds of genomes, which can help us understand operon occurrences, gene orientations and gene rearrangements