21 research outputs found

    Probabilistic approaches to alignment with tandem repeats

    Full text link

    Pairwise sequence alignment with block and character edit operations

    Full text link
    Pairwise sequence comparison is one of the most fundamental problems in string processing. The most common metric to quantify the similarity between sequences S and T is edit distance, d(S,T), which corresponds to the number of characters that need to be substituted, deleted from, or inserted into S to generate T. However, fewer edit operations may be sufficient for some string pairs to transform one string to the other if larger rearrangements are permitted. Block edit distance refers to such changes in substring level (i.e., blocks) that "penalizes" entire block removals, insertions, copies, and reversals with the same cost as single-character edits (Lopresti & Tomkins, 1997). Most studies to calculate block edit distance to date aimed only to characterize the distance itself for applications in sequence nearest neighbor search without reporting the full alignment details. Although a few tools try to solve block edit distance for genomic sequences, such as GR-Aligner, they have limited functionality and are no longer maintained. Here, we present SABER, an algorithm to solve block edit distance that supports block deletions, block moves, and block reversals in addition to the classical single-character edit operations. Our algorithm runs in O(m^2.n.l_range) time for |S|=m, |T|=n and the permitted block size range of l_range; and can report all breakpoints for the block operations. We also provide an implementation of SABER currently optimized for genomic sequences (i.e., generated by the DNA alphabet), although the algorithm can theoretically be used for any alphabet. SABER is available at http://github.com/BilkentCompGen/sabe

    A Monte Carlo Method for Assessing the Quality of Duplication-Aware Alignment Algorithms

    Get PDF
    The increasing availability of high throughput sequencing technologies poses several challenges concerning the analysis of genomic data. Within this context, duplication-aware sequence alignment taking into account complex mutation events is regarded as an important problem, particularly in light of recent evolutionary bioinformatics researches that highlighted the role of tandem duplications as one of the most important mutation events. Traditional sequence comparison algorithms do not take into account these events, resulting in poor alignments in terms of biological significance, mainly because of their assumption of statistical independence among contiguous residues. Several duplication-aware algorithms have been proposed in the last years which differ either for the type of duplications they consider or for the methods adopted to identify and compare them. However, there is no solution which clearly outperforms the others and no methods exist for assessing the reliability of the resulting alignments. This paper proposes a Monte Carlo method for assessing the quality of duplication-aware alignment algorithms and for driving the choice of the most appropriate alignment technique to be used in a specific context

    A Lossy Compression Technique Enabling Duplication-Aware Sequence Alignment

    Get PDF
    In spite of the recognized importance of tandem duplications in genome evolution, commonly adopted sequence comparison algorithms do not take into account complex mutation events involving more than one residue at the time, since they are not compliant with the underlying assumption of statistical independence of adjacent residues. As a consequence, the presence of tandem repeats in sequences under comparison may impair the biological significance of the resulting alignment. Although solutions have been proposed, repeat-aware sequence alignment is still considered to be an open problem and new efficient and effective methods have been advocated. The present paper describes an alternative lossy compression scheme for genomic sequences which iteratively collapses repeats of increasing length. The resulting approximate representations do not contain tandem duplications, while retaining enough information for making their comparison even more significant than the edit distance between the original sequences. This allows us to exploit traditional alignment algorithms directly on the compressed sequences. Results confirm the validity of the proposed approach for the problem of duplication-aware sequence alignment

    The Molecular Epidemiology of the Highly Virulent ST93 Australian Community Staphylococcus aureus Strain

    Get PDF
    In Australia the PVL - positive ST93-IV [2B], colloquially known as ‘‘Queensland CA-MRSA’’ has become the dominant CA-MRSA clone. First described in the early 2000s, ST93-IV [2B] is associated with skin and severe invasive infections including necrotizing pneumonia. A singleton by multilocus sequence typing (MLST) eBURST analysis ST93 is distinct from other S aureus clones. To determine if the increased prevalence of ST93-IV [2B] is due to the widespread transmission of a single strain of ST93-IV [2B] the genetic relatedness of 58 S. aureus ST93 isolated throughout Australia over an extended period were studied in detail using a variety of molecular methods including pulsed-field gel electrophoresis, spa typing, MLST, microarray DNA, SCCmec typing and dru typing. Identification of the phage harbouring the lukS-PV/lukF-PV Panton Valentine leucocidin genes, detection of allelic variations in lukS-PV/lukF-PV, and quantification of LukF-PV expression was also performed. Although ST93-IV [2B] is known to have an apparent enhanced clinical virulence, the isolates harboured few known virulence determinants. All PVL-positive isolates carried the PVL-encoding phage WSa2USA and the lukS-PV/lukF-PV genes had the same R variant SNP profile. The isolates produced similar expression levels of LukF-PV. Although multiple rearrangements of the spa sequence have occurred, the core genome in ST93 is very stable.The emergence of ST93-MRSA is due to independent acquisitions of different dru-defined type IV and type V SCCmec elements in several spa-defined ST93-MSSA backgrounds. Rearrangement of the spa sequence in ST93-MRSA has subsequently occurred in some of these strains. Although multiple ST93-MRSA strains were characterised, little genetic diversity was identified for most isolates, with PVLpositive ST93-IVa [2B]-t202-dt10 predominant across Australia. Whether ST93-IVa [2B] t202-dt10 arose from one PVL-positive ST93-MSSA-t202, or by independent acquisitions of SCCmec-IVa [2B]-dt10 into multiple PVL-positive ST93-MSSA-t202 strains is not known

    Typing Clostridium difficile strains based on tandem repeat sequences

    Get PDF
    Background: Genotyping of epidemic Clostridium difficile strains is necessary to track their emergence and spread. Portability of genotyping data is desirable to facilitate inter-laboratory comparisons and epidemiological studies. Results: This report presents results from a systematic screen for variation in repetitive DNA in the genome of C. difficile. We describe two tandem repeat loci, designated \u27TR6\u27 and \u27TR10\u27, which display extensive sequence variation that may be useful for sequence-based strain typing. Based on an investigation of 154 C. difficile isolates comprising 75 ribotypes, tandem repeat sequencing demonstrated excellent concordance with widely used PCR ribotyping and equal discriminatory power. Moreover, tandem repeat sequences enabled the reconstruction of the isolates\u27 largely clonal population structure and evolutionary history. Conclusion: We conclude that sequence analysis of the two repetitive loci introduced here may be highly useful for routine typing of C. difficile. Tandem repeat sequence typing resolves phylogenetic diversity to a level equivalent to PCR ribotypes. DNA sequences may be stored in databases accessible over the internet, obviating the need for the exchange of reference strains

    Multiple sequence alignment with user-defined constraints at GOBICS

    Get PDF
    Most multi-alignment methods are fully automated, i.e. they are based on a fixed set of mathematical rules. For various reasons, such methods may fail to produce biologically meaningful alignments. Herein, we describe a semi-automatic approach to multiple sequence alignment where biological expert knowledge can be used to influence the alignment procedure. The user can specify parts of the sequences that are biologically related to each other; our software program uses these sites as anchor points and creates a multiple alignment respecting these user-defined constraints. By using known functionally, structurally or evolutionarily related positions of the input sequences as anchor points, our method can produce alignments that reflect the true biological relationships among the input sequences more accurately than fully automated procedures can do

    XSTREAM: A practical algorithm for identification and architecture modeling of tandem repeats in protein sequences

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Biological sequence repeats arranged in tandem patterns are widespread in DNA and proteins. While many software tools have been designed to detect DNA tandem repeats (TRs), useful algorithms for identifying protein TRs with varied levels of degeneracy are still needed.</p> <p>Results</p> <p>To address limitations of current repeat identification methods, and to provide an efficient and flexible algorithm for the detection and analysis of TRs in protein sequences, we designed and implemented a new computational method called XSTREAM. Running time tests confirm the practicality of XSTREAM for analyses of multi-genome datasets. Each of the key capabilities of XSTREAM (e.g., merging, nesting, long-period detection, and TR architecture modeling) are demonstrated using anecdotal examples, and the utility of XSTREAM for identifying TR proteins was validated using data from a recently published paper.</p> <p>Conclusion</p> <p>We show that XSTREAM is a practical and valuable tool for TR detection in protein and nucleotide sequences at the multi-genome scale, and an effective tool for modeling TR domains with diverse architectures and varied levels of degeneracy. Because of these useful features, XSTREAM has significant potential for the discovery of naturally-evolved modular proteins with applications for engineering novel biostructural and biomimetic materials, and identifying new vaccine and diagnostic targets.</p

    Integrated multiple sequence alignment

    Get PDF
    Sammeth M. Integrated multiple sequence alignment. Bielefeld (Germany): Bielefeld University; 2005.The thesis presents enhancements for automated and manual multiple sequence alignment: existing alignment algorithms are made more easily accessible and new algorithms are designed for difficult cases. Firstly, we introduce the QAlign framework, a graphical user interface for multiple sequence alignment. It comprises several state-of-the-art algorithms and supports their parameters by convenient dialogs. An alignment viewer with guided editing functionality can also highlight or print regions of the alignment. Also phylogenetic features are provided, e.g., distance-based tree reconstruction methods, corrections for multiple substitutions and a tree viewer. The modular concept and the platform-independent implementation guarantee an easy extensibility. Further, we develop a constrained version of the divide-and-conquer alignment such that it can be restricted by anchors found earlier with local alignments. It can be shown that this method shares attributes of both, local and global aligners, in the quality of results as well as in the computation time. We further modify the local alignment step to work on bipartite (or even multipartite) sets for sequences where repeats overshadow valuable sequence information. In the end a technique is established that can accurately align sequences containing eventually repeated motifs. Finally, another algorithm is presented that allows to compare tandem repeat sequences by aligning them with respect to their possible repeat histories. We describe an evolutionary model including tandem duplications and excisions, and give an exact algorithm to compare two sequences under this model

    fbl-Typing of Staphylococcus lugdunensis: A Frontline Tool for Epidemiological Studies, but Not Predictive of Fibrinogen Binding Ability

    Get PDF
    Staphylococcus lugdunensis is increasingly recognized as a potent pathogen, responsible for severe infections with an outcome resembling that of Staphylococcus aureus. Here, we developed and evaluated a tool for S. lugdunensis typing, using DNA sequence analysis of the repeat-encoding region (R-domain) in the gene encoding the fibrinogen (Fg)-binding protein Fbl (fbl-typing). We typed 240 S. lugdunensis isolates from various clinical and geographical origins. The length of the R-domain ranged from 9 to 52 repeats. fbl-typing identified 54 unique 18-bp repeat sequences and 92 distinct fbl-types. The discriminatory power of fbl-typing was higher than that of multilocus sequence typing (MLST) and equivalent to that of tandem repeat sequence typing. fbl-types could assign isolates to MLST clonal complexes with excellent predictive power. The ability to promote adherence to immobilized human Fg was evaluated for 55 isolates chosen to reflect the genetic diversity of the fbl gene. We observed no direct correlation between Fg binding ability and fbl-types. However, the lowest percentage of Fg binding was observed for isolates carrying a 5′-end frameshift mutation of the fbl gene and for those harboring fewer than 43 repeats in the R-domain. qRT-PCR assays for some isolates revealed no correlation between fbl gene expression and Fg binding capacity. In conclusion, this study shows that fbl-typing is a useful tool in S. lugdunensis epidemiology, especially because it is an easy, cost-effective, rapid and portable method (http://fbl-typing.univ-rouen.fr/). The impact of fbl polymorphism on the structure of the protein, its expression on the cell surface and in virulence remains to be determined
    corecore