712 research outputs found

    Prediction of Indel Flanking Regions and Its Application in the Alignment of Multiple Protein Sequences

    Get PDF
    Proteins are the most important molecules in living organism, and they are involved in every function of the cells, such as signal transmission, metabolic regulation, transportation of molecules, and defense mechanism. As new protein sequences are discovered on an everyday basis and protein databases continue to grow exponentially with time, analysis of protein families, understanding their evolutionary trends and detection of remote homologues have become extremely important. The traditional laboratory techniques of studying these proteins are very slow and time consuming. Therefore, biologists have turned to automated methods that are fast and capable of analyzing large amounts of data and determining relationships between proteins that would be difficult, if not impossible, for humans to identify through the traditional techniques. Insertion/deletion (indel) and substitution of an amino acid are two common events that lead to the evolution of and variations in protein sequences. Further, many of the human diseases and functional divergence between homologous proteins are related more to the indel mutations than to the substitution mutations, even though the former occurs less often than the latter. A reliable detection of indels and their flanking regions is a major challenge in research related to protein evolution, structures and functions. The first and most important step in studying a newly discovered protein sequence is to search protein databases for proteins that are similar or closely-related to the new protein, and then to align the new protein sequence to these proteins. Thus, the alignment of multiple protein sequences is one of the most commonly performed tasks in bioinformatics analyses, and has been used in many applications, including sequence annotation, phylogenetic tree estimation, evolutionary analysis, secondary structure prediction and protein database search. In spite of considerable research and efforts that have been recently deployed for improving the performance of multiple sequence alignment (MSA) algorithms, finding a highly accurate alignment between multiple protein sequences still remains a challenging problem. The objectives of this thesis are to develop a novel scheme to predict indel flanking regions (IndelFRs) in a protein sequence and to develop an efficient algorithm for the alignment of multiple protein sequences incorporating the information on the predicted IndelFRs. In the first part of the thesis, a variable-order Markov model-based scheme to predict indel flanking regions in a protein sequence for a given protein fold is proposed. In this scheme, two predictors, referred to as the PPM IndelFR and PST IndelFR predictors, are designed based on prediction by partial match and probabilistic suffix tree, respectively. The performance of the proposed IndelFR predictors is evaluated in terms of the commonly used metrics, namely, accuracy of prediction and F1-measure. It is shown through extensive performance evaluation that the proposed predictors are able to predict IndelFRs in the protein sequences with high values of accuracy and F1-measure. It is also shown that if one is interested only in predicting IndelFRs in protein sequences, it would be preferable to use the proposed predictors instead of HMMER 3.0 in view of the substantially superior performance of the former. In the second part of the thesis, a novel and efficient algorithm incorporating the information on the predicted IndelFRs for the alignment of multiple protein sequences is proposed. A new variable gap penalty function is introduced, which makes the gap placement in protein sequences more accurate for the protein alignment. The performance of the proposed alignment algorithm, named as MSAIndelFR algorithm, is evaluated in terms of the so called metrics, sum-of-pairs (SP) and total columns (TC). It is shown through extensive performance evaluation using four popular benchmarks, BAliBASE 3.0, OXBENCH, PREFAB 4.0, and SABmark 1.65, that the performance of MSAIndelFR is superior to that of the six most-widely used alignment algorithms, namely, Clustal W2, Clustal Omega, MSAProbs, Kalign2, MAFFT and MUSCLE. Through the study undertaken in this thesis it is shown that a reliable detection of indels and their flanking regions can be achieved by using the proposed IndelFR predictors, and a substantial improvement in the protein alignment accuracy can be achieved by using the proposed variable gap penalty function. Thus, it is anticipated that this investigation will not only facilitate future studies on the modeling of indel mutations and protein sequence alignment, but will also open up new avenues for research concerning protein evolution, structures, and functions as well as for research concerning protein sequence alignment

    MSAIndelFR: a scheme for multiple protein sequence alignment using information on indel flanking regions

    Get PDF
    Background The alignment of multiple protein sequences is one of the most commonly performed tasks in bioinformatics. In spite of considerable research and efforts that have been recently deployed for improving the performance of multiple sequence alignment (MSA) algorithms, finding a highly accurate alignment between multiple protein sequences is still a challenging problem. Results We propose a novel and efficient algorithm called, MSAIndelFR, for multiple sequence alignment using the information on the predicted locations of IndelFRs and the computed average log–loss values obtained from IndelFR predictors, each of which is designed for a different protein fold. We demonstrate that the introduction of a new variable gap penalty function based on the predicted locations of the IndelFRs and the computed average log–loss values into the proposed algorithm substantially improves the protein alignment accuracy. This is illustrated by evaluating the performance of the algorithm in aligning sequences belonging to the protein folds for which the IndelFR predictors already exist and by using the reference alignments of the four popular benchmarks, BAliBASE 3.0, OXBENCH, PREFAB 4.0, and SABRE (SABmark 1.65). Conclusions We have proposed a novel and efficient algorithm, the MSAIndelFR algorithm, for multiple protein sequence alignment incorporating a new variable gap penalty function. It is shown that the performance of the proposed algorithm is superior to that of the most–widely used alignment algorithms, Clustal W2, Clustal Omega, Kalign2, MSAProbs, MAFFT, MUSCLE, ProbCons and Probalign, in terms of both the sum–of–pairs and total column metrics

    Progressive Mauve: Multiple alignment of genomes with gene flux and rearrangement

    Full text link
    Multiple genome alignment remains a challenging problem. Effects of recombination including rearrangement, segmental duplication, gain, and loss can create a mosaic pattern of homology even among closely related organisms. We describe a method to align two or more genomes that have undergone large-scale recombination, particularly genomes that have undergone substantial amounts of gene gain and loss (gene flux). The method utilizes a novel alignment objective score, referred to as a sum-of-pairs breakpoint score. We also apply a probabilistic alignment filtering method to remove erroneous alignments of unrelated sequences, which are commonly observed in other genome alignment methods. We describe new metrics for quantifying genome alignment accuracy which measure the quality of rearrangement breakpoint predictions and indel predictions. The progressive genome alignment algorithm demonstrates markedly improved accuracy over previous approaches in situations where genomes have undergone realistic amounts of genome rearrangement, gene gain, loss, and duplication. We apply the progressive genome alignment algorithm to a set of 23 completely sequenced genomes from the genera Escherichia, Shigella, and Salmonella. The 23 enterobacteria have an estimated 2.46Mbp of genomic content conserved among all taxa and total unique content of 15.2Mbp. We document substantial population-level variability among these organisms driven by homologous recombination, gene gain, and gene loss. Free, open-source software implementing the described genome alignment approach is available from http://gel.ahabs.wisc.edu/mauve .Comment: Revision dated June 19, 200

    Sequence context affects the rate of short insertions and deletions in flies and primates

    Get PDF
    Analysis of a large collection of short insertions and deletions in primates and flies shows that the rate of insertions or deletions of specific lengths can vary by more than 100 fold, depending on the surrounding sequence

    Evolution of Regulatory Sequences in 12 Drosophila Species

    Get PDF
    Characterization of the evolutionary constraints acting on cis-regulatory sequences is crucial to comparative genomics and provides key insights on the evolution of organismal diversity. We study the relationships among orthologous cis-regulatory modules (CRMs) in 12 Drosophila species, especially with respect to the evolution of transcription factor binding sites, and report statistical evidence in favor of key evolutionary hypotheses. Binding sites are found to have position-specific substitution rates. However, the selective forces at different positions of a site do not act independently, and the evidence suggests that constraints on sites are often based on their exact binding affinities. Binding site loss is seen to conform to a molecular clock hypothesis. The rate of site loss is transcription factor–specific and depends on the strength of binding and, in some cases, the presence of other binding sites in close proximity. Our analysis is based on a novel computational method for aligning orthologous CRMs on a tree, which rigorously accounts for alignment uncertainties and exploits binding site predictions through a unified probabilistic framework. Finally, we report weak purifying selection on short deletions, providing important clues about overall spatial constraints on CRMs. Our results present a complex picture of regulatory sequence evolution, with substantial plasticity that depends on a number of factors. The insights gained in this study will help us to understand the combinatorial control of gene regulation and how it evolves. They will pave the way for theoretical models that are cognizant of the important determinants of regulatory sequence evolution and will be critical in genome-wide identification of non-coding sequences under purifying or positive selection

    Bovine polledness

    Get PDF
    The persistent horns are an important trait of speciation for the family Bovidae with complex morphogenesis taking place briefly after birth. The polledness is highly favourable in modern cattle breeding systems but serious animal welfare issues urge for a solution in the production of hornless cattle other than dehorning. Although the dominant inhibition of horn morphogenesis was discovered more than 70 years ago, and the causative mutation was mapped almost 20 years ago, its molecular nature remained unknown. Here, we report allelic heterogeneity of the POLLED locus. First, we mapped the POLLED locus to a ∼381-kb interval in a multi-breed case-control design. Targeted re-sequencing of an enlarged candidate interval (547 kb) in 16 sires with known POLLED genotype did not detect a common allele associated with polled status. In eight sires of Alpine and Scottish origin (four polled versus four horned), we identified a single candidate mutation, a complex 202 bp insertion-deletion event that showed perfect association to the polled phenotype in various European cattle breeds, except Holstein-Friesian. The analysis of the same candidate interval in eight Holsteins identified five candidate variants which segregate as a 260 kb haplotype also perfectly associated with the POLLED gene without recombination or interference with the 202 bp insertion-deletion. We further identified bulls which are progeny tested as homozygous polled but bearing both, 202 bp insertion-deletion and Friesian haplotype. The distribution of genotypes of the two putative POLLED alleles in large semi-random sample (1,261 animals) supports the hypothesis of two independent mutations

    Using Expressing Sequence Tags to Improve Gene Structure Annotation

    Get PDF
    Finding all gene structures is a crucial step in obtaining valuable information from genomic sequences. It is still a challenging problem, especially for vertebrate genomes, such as the human genome. Expressed Sequence Tags (ESTs) provide a tremendous resource for determining intron-exon structures. However, they are short and error prone, which prevents existing methods from exploiting EST information efficiently. This dissertation addresses three aspects of using ESTs for gene structure annotation. The first aspect is using ESTs to improve de novo gene prediction. Probability models are introduced for EST alignments to genomic sequence in exons, introns, interknit regions, splice sites and UTRs, representing the EST alignment patterns in these regions. New gene prediction systems were developed by combining the EST alignments with comparative genomics gene prediction systems, such as TWINSCAN and N-SCAN, so that they can predict gene structures more accurately where EST alignments exist without compromising their ability to predict gene structures where no EST exists. The accuracy of TWINSCAN_EST and NSCAN_EST is shown to be substantially better than any existing methods without using full-length cDNA or protein similarity information. The second aspect is using ESTs and de novo gene prediction to guide biology experiments, such as finding full ORF-containing-cDNA clones, which provide the most direct experimental evidence for gene structures. A probability model was introduced to guide experiments by summing over gene structure models consistent with EST alignments. The last aspect is a novel EST-to-genome alignment program called QPAIRAGON to improve the alignment accuracy by using EST sequencing quality values. Gene prediction accuracy can be improved by using this new EST-to-genome alignment program. It can also be used for many other bioinformatics applications, such as SNP finding and alternative splicing site prediction

    Using signal processing, evolutionary computation, and machine learning to identify transposable elements in genomes

    Get PDF
    About half of the human genome consists of transposable elements (TE's), sequences that have many copies of themselves distributed throughout the genome. All genomes, from bacterial to human, contain TE's. TE's affect genome function by either creating proteins directly or affecting genome regulation. They serve as molecular fossils, giving clues to the evolutionary history of the organism. TE's are often challenging to identify because they are fragmentary or heavily mutated. In this thesis, novel features for the detection and study of TE's are developed. These features are of two types. The first type are statistical features based on the Fourier transform used to assess reading frame use. These features measure how different the reading frame use is from that of a random sequence, which reading frames the sequence is using, and the proportion of use of the active reading frames. The second type of feature, called side effect machine (SEM) features, are generated by finite state machines augmented with counters that track the number of times the state is visited. These counters then become features of the sequence. The number of possible SEM features is super-exponential in the number of states. New methods for selecting useful feature subsets that incorporate a genetic algorithm and a novel clustering method are introduced. The features produced reveal structural characteristics of the sequences of potential interest to biologists. A detailed analysis of the genetic algorithm, its fitness functions, and its fitness landscapes is performed. The features are used, together with features used in existing exon finding algorithms, to build classifiers that distinguish TE's from other genomic sequences in humans, fruit flies, and ciliates. The classifiers achieve high accuracy (> 85%) on a variety of TE classification problems. The classifiers are used to scan large genomes for TE's. In addition, the features are used to describe the TE's in the newly sequenced ciliate, Tetrahymena thermophile to provide information for biologists useful to them in forming hypotheses to test experimentally concerning the role of these TE's and the mechanisms that govern them

    Pan-parastagonospora comparative genome analysis-effector prediction and genome evolution

    Get PDF
    We report a fungal pan-genome study involving Parastagonospora spp., including 21 isolates of the wheat (Triticum aestivum) pathogen Parastagonospora nodorum, 10 of the grass-infecting Parastagonospora avenae, and 2 of a closely related undefined sister species. We observed substantial variation in the distribution of polymorphisms across the pan-genome, including repeat-induced point mutations, diversifying selection and gene gains and losses.We also discovered chromosome-scale inter and intraspecific presence/absence variation of some sequences, suggesting the occurrence of one or more accessory chromosomes or regions that may play a role in host-pathogen interactions. The presence of known pathogenicity effector loci SnToxA, SnTox1, and SnTox3 varied substantially among isolates. Three P. nodorum isolates lacked functional versions for all three loci, whereas three P. avenae isolates carried one or both of the SnTox1 and SnTox3 genes, indicating previously unrecognized potential for discovering additional effectors in the P. nodorum-wheat pathosystem. We utilized the pangenomic comparative analysis to improve the prediction of pathogenicity effector candidates, recovering the three confirmed effectors among our top-ranked candidates. We propose applying this pan-genomic approach to identify the effector repertoire involved in other host-microbe interactions involving necrotrophic pathogens in the Pezizomycotina
    • …
    corecore