89,558 research outputs found

    RasBhari: optimizing spaced seeds for database searching, read mapping and alignment-free sequence comparison

    Full text link
    Many algorithms for sequence analysis rely on word matching or word statistics. Often, these approaches can be improved if binary patterns representing match and don't-care positions are used as a filter, such that only those positions of words are considered that correspond to the match positions of the patterns. The performance of these approaches, however, depends on the underlying patterns. Herein, we show that the overlap complexity of a pattern set that was introduced by Ilie and Ilie is closely related to the variance of the number of matches between two evolutionarily related sequences with respect to this pattern set. We propose a modified hill-climbing algorithm to optimize pattern sets for database searching, read mapping and alignment-free sequence comparison of nucleic-acid sequences; our implementation of this algorithm is called rasbhari. Depending on the application at hand, rasbhari can either minimize the overlap complexity of pattern sets, maximize their sensitivity in database searching or minimize the variance of the number of pattern-based matches in alignment-free sequence comparison. We show that, for database searching, rasbhari generates pattern sets with slightly higher sensitivity than existing approaches. In our Spaced Words approach to alignment-free sequence comparison, pattern sets calculated with rasbhari led to more accurate estimates of phylogenetic distances than the randomly generated pattern sets that we previously used. Finally, we used rasbhari to generate patterns for short read classification with CLARK-S. Here too, the sensitivity of the results could be improved, compared to the default patterns of the program. We integrated rasbhari into Spaced Words; the source code of rasbhari is freely available at http://rasbhari.gobics.de

    RNA secondary structure prediction from multi-aligned sequences

    Full text link
    It has been well accepted that the RNA secondary structures of most functional non-coding RNAs (ncRNAs) are closely related to their functions and are conserved during evolution. Hence, prediction of conserved secondary structures from evolutionarily related sequences is one important task in RNA bioinformatics; the methods are useful not only to further functional analyses of ncRNAs but also to improve the accuracy of secondary structure predictions and to find novel functional RNAs from the genome. In this review, I focus on common secondary structure prediction from a given aligned RNA sequence, in which one secondary structure whose length is equal to that of the input alignment is predicted. I systematically review and classify existing tools and algorithms for the problem, by utilizing the information employed in the tools and by adopting a unified viewpoint based on maximum expected gain (MEG) estimators. I believe that this classification will allow a deeper understanding of each tool and provide users with useful information for selecting tools for common secondary structure predictions.Comment: A preprint of an invited review manuscript that will be published in a chapter of the book `Methods in Molecular Biology'. Note that this version of the manuscript may differ from the published versio

    Machine learning-guided directed evolution for protein engineering

    Get PDF
    Machine learning (ML)-guided directed evolution is a new paradigm for biological design that enables optimization of complex functions. ML methods use data to predict how sequence maps to function without requiring a detailed model of the underlying physics or biological pathways. To demonstrate ML-guided directed evolution, we introduce the steps required to build ML sequence-function models and use them to guide engineering, making recommendations at each stage. This review covers basic concepts relevant to using ML for protein engineering as well as the current literature and applications of this new engineering paradigm. ML methods accelerate directed evolution by learning from information contained in all measured variants and using that information to select sequences that are likely to be improved. We then provide two case studies that demonstrate the ML-guided directed evolution process. We also look to future opportunities where ML will enable discovery of new protein functions and uncover the relationship between protein sequence and function.Comment: Made significant revisions to focus on aspects most relevant to applying machine learning to speed up directed evolutio

    Spaced seeds improve k-mer-based metagenomic classification

    Full text link
    Metagenomics is a powerful approach to study genetic content of environmental samples that has been strongly promoted by NGS technologies. To cope with massive data involved in modern metagenomic projects, recent tools [4, 39] rely on the analysis of k-mers shared between the read to be classified and sampled reference genomes. Within this general framework, we show in this work that spaced seeds provide a significant improvement of classification accuracy as opposed to traditional contiguous k-mers. We support this thesis through a series a different computational experiments, including simulations of large-scale metagenomic projects. Scripts and programs used in this study, as well as supplementary material, are available from http://github.com/gregorykucherov/spaced-seeds-for-metagenomics.Comment: 23 page

    Traffic monitoring using image processing : a thesis presented in partial fulfillment of the requirements for the degree of Master of Engineering in Information and Telecommunications Engineering at Massey University, Palmerston North, New Zealand

    Get PDF
    Traffic monitoring involves the collection of data describing the characteristics of vehicles and their movements. Such data may be used for automatic tolls, congestion and incident detection, law enforcement, and road capacity planning etc. With the recent advances in Computer Vision technology, videos can be analysed automatically and relevant information can be extracted for particular applications. Automatic surveillance using video cameras with image processing technique is becoming a powerful and useful technology for traffic monitoring. In this research project, a video image processing system that has the potential to be developed for real-time application is developed for traffic monitoring including vehicle tracking, counting, and classification. A heuristic approach is applied in developing this system. The system is divided into several parts, and several different functional components have been built and tested using some traffic video sequences. Evaluations are carried out to show that this system is robust and can be developed towards real-time applications

    MRI-only based radiotherapy treatment planning for the rat brain on a Small Animal Radiation Research Platform (SARRP)

    Get PDF
    Computed tomography (CT) is the standard imaging modality in radiation therapy treatment planning (RTP). However, magnetic resonance (MR) imaging provides superior soft tissue contrast, increasing the precision of target volume selection. We present MR-only based RTP for a rat brain on a small animal radiation research platform (SARRP) using probabilistic voxel classification with multiple MR sequences. Six rat heads were imaged, each with one CT and five MR sequences. The MR sequences were: T1-weighted, T2-weighted, zero-echo time (ZTE), and two ultra-short echo time sequences with 20 mu s (UTE1) and 2 ms (UTE2) echo times. CT data were manually segmented into air, soft tissue, and bone to obtain the RTP reference. Bias field corrected MR images were automatically segmented into the same tissue classes using a fuzzy c-means segmentation algorithm with multiple images as input. Similarities between segmented CT and automatic segmented MR (ASMR) images were evaluated using Dice coefficient. Three ASMR images with high similarity index were used for further RTP. Three beam arrangements were investigated. Dose distributions were compared by analysing dose volume histograms. The highest Dice coefficients were obtained for the ZTE-UTE2 combination and for the T1-UTE1-T2 combination when ZTE was unavailable. Both combinations, along with UTE1-UTE2, often used to generate ASMR images, were used for further RTP. Using 1 beam, MR based RTP underestimated the dose to be delivered to the target (range: 1.4%-7.6%). When more complex beam configurations were used, the calculated dose using the ZTE-UTE2 combination was the most accurate, with 0.7% deviation from CT, compared to 0.8% for T1-UTE1-T2 and 1.7% for UTE1-UTE2. The presented MR-only based workflow for RTP on a SARRP enables both accurate organ delineation and dose calculations using multiple MR sequences. This method can be useful in longitudinal studies where CT's cumulative radiation dose might contribute to the total dose
    • …
    corecore