166 research outputs found

    Splitting the BLOSUM Score into Numbers of Biological Significance

    Get PDF
    Mathematical tools developed in the context of Shannon information theory were used to analyze the meaning of the BLOSUM score, which was split into three components termed as the BLOSUM spectrum (or BLOSpectrum). These relate respectively to the sequence convergence (the stochastic similarity of the two protein sequences), to the background frequency divergence (typicality of the amino acid probability distribution in each sequence), and to the target frequency divergence (compliance of the amino acid variations between the two sequences to the protein model implicit in the BLOCKS database). This treatment sharpens the protein sequence comparison, providing a rationale for the biological significance of the obtained score, and helps to identify weakly related sequences. Moreover, the BLOSpectrum can guide the choice of the most appropriate scoring matrix, tailoring it to the evolutionary divergence associated with the two sequences, or indicate if a compositionally adjusted matrix could perform better

    Improvements on Seeding Based Protein Sequence Similarity Search

    Get PDF
    The primary goal of bioinformatics is to increase an understanding in the biology of organisms. Computational, statistical, and mathematical theories and techniques have been developed on formal and practical problems that assist to achieve this primary goal. For the past three decades, the primary application of bioinformatics has been biological data analysis. The DNA or protein sequence similarity search is perhaps the most common, yet vitally important task for analyzing biological data. The sequence similarity search is a process of finding optimal sequence alignments. On the theoretical level, the problem of sequence similarity search is complex. On the applicational level, the sequences similarity search onto a biological database has been one of the most basic tasks today. Using traditional quadratic time complexity solutions becomes a challenge due to the size of the database. Seeding (or filtration) based approaches, which trade sensitivity for speed, are a popular choice among those available. Two main phases usually exist in a seeding based approach. The first phase is referred to as the hit generation, and the second phase is referred to as the hit extension. In this thesis, two improvements on the seeding based protein sequence similarity search are presented. First, for the hit generation, a new seeding idea, namely spaced k-mer neighbors, is presented. We present our effective algorithms to find a good set of spaced k-mer neighbors. Secondly, for the hit generation, a new method, namely HexFilter, is proposed to reduce the number of hit extensions while achieving better selectivity. We show our HexFilters with optimized configurations

    Protein Tertiary Model Assessment Using Granular Machine Learning Techniques

    Get PDF
    The automatic prediction of protein three dimensional structures from its amino acid sequence has become one of the most important and researched fields in bioinformatics. As models are not experimental structures determined with known accuracy but rather with prediction it’s vital to determine estimates of models quality. We attempt to solve this problem using machine learning techniques and information from both the sequence and structure of the protein. The goal is to generate a machine that understands structures from PDB and when given a new model, predicts whether it belongs to the same class as the PDB structures (correct or incorrect protein models). Different subsets of PDB (protein data bank) are considered for evaluating the prediction potential of the machine learning methods. Here we show two such machines, one using SVM (support vector machines) and another using fuzzy decision trees (FDT). First using a preliminary encoding style SVM could get around 70% in protein model quality assessment accuracy, and improved Fuzzy Decision Tree (IFDT) could reach above 80% accuracy. For the purpose of reducing computational overhead multiprocessor environment and basic feature selection method is used in machine learning algorithm using SVM. Next an enhanced scheme is introduced using new encoding style. In the new style, information like amino acid substitution matrix, polarity, secondary structure information and relative distance between alpha carbon atoms etc is collected through spatial traversing of the 3D structure to form training vectors. This guarantees that the properties of alpha carbon atoms that are close together in 3D space and thus interacting are used in vector formation. With the use of fuzzy decision tree, we obtained a training accuracy around 90%. There is significant improvement compared to previous encoding technique in prediction accuracy and execution time. This outcome motivates to continue to explore effective machine learning algorithms for accurate protein model quality assessment. Finally these machines are tested using CASP8 and CASP9 templates and compared with other CASP competitors, with promising results. We further discuss the importance of model quality assessment and other information from proteins that could be considered for the same

    Multiple Biolgical Sequence Alignment: Scoring Functions, Algorithms, and Evaluations

    Get PDF
    Aligning multiple biological sequences such as protein sequences or DNA/RNA sequences is a fundamental task in bioinformatics and sequence analysis. These alignments may contain invaluable information that scientists need to predict the sequences\u27 structures, determine the evolutionary relationships between them, or discover drug-like compounds that can bind to the sequences. Unfortunately, multiple sequence alignment (MSA) is NP-Complete. In addition, the lack of a reliable scoring method makes it very hard to align the sequences reliably and to evaluate the alignment outcomes. In this dissertation, we have designed a new scoring method for use in multiple sequence alignment. Our scoring method encapsulates stereo-chemical properties of sequence residues and their substitution probabilities into a tree-structure scoring scheme. This new technique provides a reliable scoring scheme with low computational complexity. In addition to the new scoring scheme, we have designed an overlapping sequence clustering algorithm to use in our new three multiple sequence alignment algorithms. One of our alignment algorithms uses a dynamic weighted guidance tree to perform multiple sequence alignment in progressive fashion. The use of dynamic weighted tree allows errors in the early alignment stages to be corrected in the subsequence stages. Other two algorithms utilize sequence knowledge-bases and sequence consistency to produce biological meaningful sequence alignments. To improve the speed of the multiple sequence alignment, we have developed a parallel algorithm that can be deployed on reconfigurable computer models. Analytically, our parallel algorithm is the fastest progressive multiple sequence alignment algorithm
    corecore