6,656 research outputs found

    PILER-CR: Fast and accurate identification of CRISPR repeats

    Get PDF
    BACKGROUND: Sequencing of prokaryotic genomes has recently revealed the presence of CRISPR elements: short, highly conserved repeats separated by unique sequences of similar length. The distinctive sequence signature of CRISPR repeats can be found using general-purpose repeat- or pattern-finding software tools. However, the output of such tools is not always ideal for studying these repeats, and significant effort is sometimes needed to build additional tools and perform manual analysis of the output. RESULTS: We present PILER-CR, a program specifically designed for the identification and analysis of CRISPR repeats. The program executes rapidly, completing a 5 Mb genome in around 5 seconds on a current desktop computer. We validate the algorithm by manual curation and by comparison with published surveys of these repeats, finding that PILER-CR has both high sensitivity and high specificity. We also present a catalogue of putative CRISPR repeats identified in a comprehensive analysis of 346 prokaryotic genomes. CONCLUSION: PILER-CR is a useful tool for rapid identification and classification of CRISPR repeats. The software is donated to the public domain. Source code and a Linux binary are freely available at

    Interpreting 16S metagenomic data without clustering to achieve sub-OTU resolution

    Full text link
    The standard approach to analyzing 16S tag sequence data, which relies on clustering reads by sequence similarity into Operational Taxonomic Units (OTUs), underexploits the accuracy of modern sequencing technology. We present a clustering-free approach to multi-sample Illumina datasets that can identify independent bacterial subpopulations regardless of the similarity of their 16S tag sequences. Using published data from a longitudinal time-series study of human tongue microbiota, we are able to resolve within standard 97% similarity OTUs up to 20 distinct subpopulations, all ecologically distinct but with 16S tags differing by as little as 1 nucleotide (99.2% similarity). A comparative analysis of oral communities of two cohabiting individuals reveals that most such subpopulations are shared between the two communities at 100% sequence identity, and that dynamical similarity between subpopulations in one host is strongly predictive of dynamical similarity between the same subpopulations in the other host. Our method can also be applied to samples collected in cross-sectional studies and can be used with the 454 sequencing platform. We discuss how the sub-OTU resolution of our approach can provide new insight into factors shaping community assembly.Comment: Updated to match the published version. 12 pages, 5 figures + supplement. Significantly revised for clarity, references added, results not change

    Expedited batch processing and analysis of transposon insertions

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>With advances in sequencing technology, greater and greater amounts of eukaryotic genome data are becoming available. Often, large portions of these genomes consist of transposable elements, frequently accounting for 50% or more in vertebrates. Each transposable element family may have thousands or tens of thousands of individual copies within a given genome, and therefore it can take an exorbitant amount of time and effort to process data in a meaningful fashion.</p> <p>Findings</p> <p>In order to combat this problem, we developed a set of bioinformatics techniques and programs to streamline the analysis. This includes a unique Perl script which automates the process of taking BLAST, Repeatmasker and similar data to extract and manipulate the hit sequences from the genome. This script, called Process_hits uses an object-oriented methodology to compile all hit locations from a given file for processing, organize this data into useable categories, and output it in multiple formats.</p> <p>Conclusions</p> <p>The program proved capable of handling large amounts of transposon data in an efficient fashion. It is equipped with a number of useful sub-functions, each of which is contained within its own sub-module to allow for greater expandability and as a foundation for future program design.</p

    Improving the Alignment Quality of Consistency Based Aligners with an Evaluation Function Using Synonymous Protein Words

    Get PDF
    Most sequence alignment tools can successfully align protein sequences with higher levels of sequence identity. The accuracy of corresponding structure alignment, however, decreases rapidly when considering distantly related sequences (<20% identity). In this range of identity, alignments optimized so as to maximize sequence similarity are often inaccurate from a structural point of view. Over the last two decades, most multiple protein aligners have been optimized for their capacity to reproduce structure-based alignments while using sequence information. Methods currently available differ essentially in the similarity measurement between aligned residues using substitution matrices, Fourier transform, sophisticated profile-profile functions, or consistency-based approaches, more recently

    Motif Minang Kaluak Paku Kacang Balimbiang pada Busana Kasual

    Get PDF
    Minangkabau sebagai salah satu suku bangsa yang mengisi kekhasan budaya Indonesia memiliki warisan budaya yang terpencar dalam berbagai aspek kehidupannya. Salah satu warisan budaya adalah seni ukir. Seni ukir yang dikembangkan dengan mengambil ide dari alam memiliki makna-makna filosofi bagi kehidupan masyarakat Minangkabau. Semua jenis ukiran yang dipahatkan di Rumah Gadang menunjukkan unsur penting pembentuk budaya Minangkabau bercerminkan kepada apa yang ada di alam. Salah satu ukiran pada rumah gadang yaitu kaluak paku. Kaluak paku adalah nama salah satu motif ukiran dalam adat Minangkabau. Berasal dari motif gulungan (kelukan/kaluak) pada ujung tanaman pakis (paku) yang masih muda. Ukiran kaluak paku rumah gadang melambangkan tanggung jawab seorang lelaki dalam adat Minangkabau kepada generasi penerus, sebagai ayah dari anak-anaknya dan sebagai mamak dari kemenakan (keponakan). Ukiran rumah gadang kaluak paku minangkabau inilah yang menjadi sumber ide penciptaan busana pada tugas akhir ini. Pada Penciptaan karya ini menggunakan beberapa metode, yaitu metode pendekatan estetis dan ergonomis, metode pengumpulan data dengan studi pustaka, dan motode penciptaan dengan teori Gustami Sp 3 tahap 6 Langkah. Dalam proses pembuatan karya dibutuhkan beberapa data, cara pengumpulan data acuan berdasarkan pengumpulan data pustaka yaitu berupa buku, jurnal pada media sosial, serta aplikasi pada smartphone seperti pinterest. Data yang dikumpulkan yang paling utama adalah gambar bentuk visual dari ukiran tanaman kaluak paku minangkabau dan busana kasual. Penciptaan karya yang dihasilkan yaitu berupa 8 busana kasual. Siluet pada kesuluruhan hasil karya yaitu memiliki siluet A yang mengembang pada bagian bawah. Pada penciptaan karya ini menggunakan bahan utama primisima. Perpaduan warna yang diterapkan menggunakan warna khas minangkabau yang diambil dari warna bendera adatnya “marawa” yaitu merah, hitam, dan kuning. Karya- karya yang dihasilkan dengan penggunaan warna tersebut sangat sesuai dengan tema yang mengangkat ukiran rumah gadang kaluak paku minangkabau. Kata Kunci : Minang, Kaluak Paku Kacang Balimbiang, Kasua

    Who Watches the Watchmen? An Appraisal of Benchmarks for Multiple Sequence Alignment

    Get PDF
    Multiple sequence alignment (MSA) is a fundamental and ubiquitous technique in bioinformatics used to infer related residues among biological sequences. Thus alignment accuracy is crucial to a vast range of analyses, often in ways difficult to assess in those analyses. To compare the performance of different aligners and help detect systematic errors in alignments, a number of benchmarking strategies have been pursued. Here we present an overview of the main strategies--based on simulation, consistency, protein structure, and phylogeny--and discuss their different advantages and associated risks. We outline a set of desirable characteristics for effective benchmarking, and evaluate each strategy in light of them. We conclude that there is currently no universally applicable means of benchmarking MSA, and that developers and users of alignment tools should base their choice of benchmark depending on the context of application--with a keen awareness of the assumptions underlying each benchmarking strategy.Comment: Revie

    Optimizing substitution matrix choice and gap parameters for sequence alignment

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>While substitution matrices can readily be computed from reference alignments, it is challenging to compute optimal or approximately optimal gap penalties. It is also not well understood which substitution matrices are the most effective when alignment accuracy is the goal rather than homolog recognition. Here a new parameter optimization procedure, POP, is described and applied to the problems of optimizing gap penalties and selecting substitution matrices for pair-wise global protein alignments.</p> <p>Results</p> <p>POP is compared to a recent method due to Kim and Kececioglu and found to achieve from 0.2% to 1.3% higher accuracies on pair-wise benchmarks extracted from BALIBASE. The VTML matrix series is shown to be the most accurate on several global pair-wise alignment benchmarks, with VTML200 giving best or close to the best performance in all tests. BLOSUM matrices are found to be slightly inferior, even with the marginal improvements in the bug-fixed RBLOSUM series. The PAM series is significantly worse, giving accuracies typically 2% less than VTML. Integer rounding is found to cause slight degradations in accuracy. No evidence is found that selecting a matrix based on sequence divergence improves accuracy, suggesting that the use of this heuristic in CLUSTALW may be ineffective. Using VTML200 is found to improve the accuracy of CLUSTALW by 8% on BALIBASE and 5% on PREFAB.</p> <p>Conclusion</p> <p>The hypothesis that more accurate alignments of distantly related sequences may be achieved using low-identity matrices is shown to be false for commonly used matrix types. Source code and test data is freely available from the author's web site at <url>http://www.drive5.com/pop</url>.</p

    Grammar-based distance in progressive multiple sequence alignment

    Get PDF
    Background: We propose a multiple sequence alignment (MSA) algorithm and compare the alignment-quality and execution-time of the proposed algorithm with that of existing algorithms. The proposed progressive alignment algorithm uses a grammar-based distance metric to determine the order in which biological sequences are to be pairwise aligned. The progressive alignment occurs via pairwise aligning new sequences with an ensemble of the sequences previously aligned. Results: The performance of the proposed algorithm is validated via comparison to popular progressive multiple alignment approaches, ClustalW and T-Coffee, and to the more recently developed algorithms MAFFT, MUSCLE, Kalign, and PSAlign using the BAliBASE 3.0 database of amino acid alignment files and a set of longer sequences generated by Rose software. The proposed algorithm has successfully built multiple alignments comparable to other programs with significant improvements in running time. The results are especially striking for large datasets. Conclusion: We introduce a computationally efficient progressive alignment algorithm using a grammar based sequence distance particularly useful in aligning large datasets
    corecore