744 research outputs found

    A Bayesian phylogenetic hidden Markov model for B cell receptor sequence analysis.

    Get PDF
    The human body generates a diverse set of high affinity antibodies, the soluble form of B cell receptors (BCRs), that bind to and neutralize invading pathogens. The natural development of BCRs must be understood in order to design vaccines for highly mutable pathogens such as influenza and HIV. BCR diversity is induced by naturally occurring combinatorial "V(D)J" rearrangement, mutation, and selection processes. Most current methods for BCR sequence analysis focus on separately modeling the above processes. Statistical phylogenetic methods are often used to model the mutational dynamics of BCR sequence data, but these techniques do not consider all the complexities associated with B cell diversification such as the V(D)J rearrangement process. In particular, standard phylogenetic approaches assume the DNA bases of the progenitor (or "naive") sequence arise independently and according to the same distribution, ignoring the complexities of V(D)J rearrangement. In this paper, we introduce a novel approach to Bayesian phylogenetic inference for BCR sequences that is based on a phylogenetic hidden Markov model (phylo-HMM). This technique not only integrates a naive rearrangement model with a phylogenetic model for BCR sequence evolution but also naturally accounts for uncertainty in all unobserved variables, including the phylogenetic tree, via posterior distribution sampling

    Inferring genome-scale rearrangement phylogeny and ancestral gene order: a Drosophila case study

    Get PDF
    A simple, fast, and biologically-inspired computational approach to infer genome-scale rearrangement phylogeny and ancestral gene order has been developed and applied to eight Drosophila genomes, providing insights into evolutionary chromosomal dynamics

    Strategies for Reliable Exploitation of Evolutionary Concepts in High Throughput Biology

    Get PDF
    The recent availability of the complete genome sequences of a large number of model organisms, together with the immense amount of data being produced by the new high-throughput technologies, means that we can now begin comparative analyses to understand the mechanisms involved in the evolution of the genome and their consequences in the study of biological systems. Phylogenetic approaches provide a unique conceptual framework for performing comparative analyses of all this data, for propagating information between different systems and for predicting or inferring new knowledge. As a result, phylogeny-based inference systems are now playing an increasingly important role in most areas of high throughput genomics, including studies of promoters (phylogenetic footprinting), interactomes (based on the presence and degree of conservation of interacting proteins), and in comparisons of transcriptomes or proteomes (phylogenetic proximity and co-regulation/co-expression). Here we review the recent developments aimed at making automatic, reliable phylogeny-based inference feasible in large-scale projects. We also discuss how evolutionary concepts and phylogeny-based inference strategies are now being exploited in order to understand the evolution and function of biological systems. Such advances will be fundamental for the success of the emerging disciplines of systems biology and synthetic biology, and will have wide-reaching effects in applied fields such as biotechnology, medicine and pharmacology

    Assigning and visualizing germline genes in antibody repertoires.

    Get PDF
    Identifying the germline genes involved in immunoglobulin rearrangements is an essential first step in the analysis of antibody repertoires. Based on our prior work in analysing diverse recombinant viruses, we present IgSCUEAL (Immunoglobulin Subtype Classification Using Evolutionary ALgorithms), a phylogenetic approach to assign V and J regions of immunoglobulin sequences to their corresponding germline alleles, with D regions assigned using a simple pairwise alignment algorithm. We also develop an interactive web application for viewing the results, allowing the user to explore the frequency distribution of sequence assignments and CDR3 region length statistics, which is useful for summarizing repertoires, as well as a detailed viewer of rearrangements and region alignments for individual query sequences. We demonstrate the accuracy and utility of our method compared with sequence similarity-based approaches and other non-phylogenetic model-based approaches, using both simulated data and a set of evaluation datasets of human immunoglobulin heavy chain sequences. IgSCUEAL demonstrates the highest accuracy of V and J assignment amongst existing approaches, even when the reassorted sequence is highly mutated, and can successfully cluster sequences on the basis of shared V/J germline alleles.S.K.L.P. and B.M. were supported in part by the U.S. National Institutes of Health (AI110181, AI90970, AI100665, DA34978, GM93939, HL108460, GM110749, LM7092, MH97520, MH83552), the UCSD Center for AIDS Research (Developmental Grant, AI36214, Bioinformatics and Information Technologies Core), the International AIDS Vaccine Initiative (through AI90970), the UC Laboratory Fees Research Program (grant no. 12-LR-236617). G.J.S. was supported in part the U.S. National Institute of Health (AI90118, AI68063, AI40305, and NIAID HHS N272201400019C), and a grant from the Lupus Research Institute. A.S.M.M.H. was supported by an Islamic Development Bank Scholarship, and S.D.W.F. was supported in part by the UK MRC Methodology Research Programme (grant no. MR/J013862/1).This is the final published version. It first appeared at http://rstb.royalsocietypublishing.org/content/370/1676/20140240

    Ortholog identification in the presence of domain architecture rearrangement

    Get PDF
    Ortholog identification is used in gene functional annotation, species phylogeny estimation, phylogenetic profile construction and many other analyses. Bioinformatics methods for ortholog identification are commonly based on pairwise protein sequence comparisons between whole genomes. Phylogenetic methods of ortholog identification have also been developed; these methods can be applied to protein data sets sharing a common domain architecture or which share a single functional domain but differ outside this region of homology. While promiscuous domains represent a challenge to all orthology prediction methods, overall structural similarity is highly correlated with proximity in a phylogenetic tree, conferring a degree of robustness to phylogenetic methods. In this article, we review the issues involved in orthology prediction when data sets include sequences with structurally heterogeneous domain architectures, with particular attention to automated methods designed for high-throughput application, and present a case study to illustrate the challenges in this area

    Integration of Alignment and Phylogeny in the Whole-Genome Era

    Get PDF
    With the development of new sequencing techniques, whole genomes of many species have become available. This huge amount of data gives rise to new opportunities and challenges. These new sequences provide valuable information on relationships among species, e.g. genome recombination and conservation. One of the principal ways to investigate such information is multiple sequence alignment (MSA). Currently, there is large amount of MSA data on the internet, such as the UCSC genome database, but how to effectively use this information to solve classical and new problems is still an area lacking of exploration. In this thesis, we explored how to use this information in four problems, i.e. sequence orthology search problem, multiple alignment improvement problem, short read mapping problem, and genome rearrangement inference problem. For the first problem, we developed a EM algorithm to iteratively align a query with a multiple alignment database with the information from a phylogeny relating the query species and the species in the multiple alignment. We also infer the query\u27s location in the phylogeny. We showed that by doing alignment and phylogeny inference together, we can improve the accuracies for both problems. For the second problem, we developed an optimization algorithm to iteratively refine the multiple alignment quality. Experiment results showed our algorithm is very stable in term of resulting alignments. The results showed that our method is more accurate than existing methods, i.e. Mafft, Clustal-O, and Mavid, on test data from three sets of species from the UCSC genome database. For the third problem, we developed a model, PhyMap, to align a read to a multiple alignment allowing mismatches and indels. PhyMap computes local alignments of a query sequence against a fixed multiple-genome alignment of closely related species. PhyMap uses a known phylogenetic tree on the species in the multiple alignment to improve the quality of its computed alignments while also estimating the placement of the query on this tree. Both theoretical computation and experiment results show that our model can differentiate between orthologous and paralogous alignments better than other popular short read mapping tools (BWA, BOWTIE and BLAST). For the fourth problem, we gave a simple genome recombination model which can express insertions, deletions, inversions, translocations and inverted translocations on aligned genome segments. We also developed an MCMC algorithm to infer the order of the query segments. We proved that using any Euclidian metrics to measure distance between two sequence orders in the tree optimization goal function will lead to a degenerated solution where the inferred order will be the order of one of the leaf nodes. We also gave a graph-based formulation of the problem which can represent the probability distribution of the order of the query sequences
    corecore