3 research outputs found

    Clustering of protein families into functional subtypes using Relative Complexity Measure with reduced amino acid alphabets

    Get PDF
    Background: Phylogenetic analysis can be used to divide a protein family into subfamilies in the absence of experimental information. Most phylogenetic analysis methods utilize multiple alignment of sequences and are based on an evolutionary model. However, multiple alignment is not an automated procedure and requires human intervention to maintain alignment integrity and to produce phylogenies consistent with the functional splits in underlying sequences. To address this problem, we propose to use the alignment-free Relative Complexity Measure (RCM) combined with reduced amino acid alphabets to cluster protein families into functional subtypes purely on sequence criteria. Comparison with an alignment-based approach was also carried out to test the quality of the clustering. Results: We demonstrate the robustness of RCM with reduced alphabets in clustering of protein sequences into families in a simulated dataset and seven well-characterized protein datasets. On protein datasets, crotonases, mandelate racemases, nucleotidyl cyclases and glycoside hydrolase family 2 were clustered into subfamilies with 100% accuracy whereas acyl transferase domains, haloacid dehalogenases, and vicinal oxygen chelates could be assigned to subfamilies with 97.2%, 96.9% and 92.2% accuracies, respectively. Conclusions: The overall combination of methods in this paper is useful for clustering protein families into subtypes based on solely protein sequence information. The method is also flexible and computationally fast because it does not require multiple alignment of sequences

    Characterization and use of structure and complexity of DNA sequences

    No full text
    In this dissertation we analyze biological sequences using two proposed methods of characterization. The first method uses the Average Mutual Information (AMI) profile of the sequences. This captures the statistical properties of the strings and provides a concise representation. The second method utilizes the notion of “complexity.” Using the Lempel-Ziv (LZ) complexity measure we define a distance metric for sequences. We use AMI profiles to solve the fragment assembly problem which is to reconstruct a target DNA sequence from randomly sampled fragments. Most existing fragment assembly techniques follow the overlap—layout—consensus approach, which requires extensive computation in each phase and becomes inefficient with increasing numbers of fragments. We propose a new algorithm which jointly solves the overlap, layout, and consensus problems. The fragments are clustered with respect to their AMI profiles using the k-means algorithm. This removes the unnecessary requirement that the collection of fragments be considered as a whole. Instead, the orientation and overlap detection are solved efficiently, within the clusters. We apply the second method of characterization to phylogeny construction. Most existing approaches for phylogenetic inference use multiple alignment of sequences and assume some sort of an evolutionary model. The multiple alignment strategy does not work for all types of data, e.g. whole genome phylogeny, and the evolutionary models may not always be correct. We propose a new sequence distance measure based on the relative information between the sequences using LZ complexity. The distance matrix thus obtained can be used to construct phylogenetic trees. The proposed approach does not require sequence alignment and is totally automatic. The proposed methods are not limited to the applications studied in this dissertation. They capture universal properties of the sequences and can be used to tackle other problems posed by computational biology

    Prediction of peptides binding to MHC class I and II alleles by temporal motif mining

    Get PDF
    Background: MHC (Major Histocompatibility Complex) is a key player in the immune response of most vertebrates. The computational prediction of whether a given antigenic peptide will bind to a specific MHC allele is important in the development of vaccines for emerging pathogens, the creation of possibilities for controlling immune response, and for the applications of immunotherapy. One of the problems that make this computational prediction difficult is the detection of the binding core region in peptides, coupled with the presence of bulges and loops causing variations in the total sequence length. Most machine learning methods require the sequences to be of the same length to successfully discover the binding motifs, ignoring the length variance in both motif mining and prediction steps. In order to overcome this limitation, we propose the use of time-based motif mining methods that work position-independently. Results: The prediction method was tested on a benchmark set of 28 different alleles for MHC class I and 27 different alleles for MHC class II. The obtained results are comparable to the state of the art methods for both MHC classes, surpassing the published results for some alleles. The average prediction AUC values are 0.897 for class I, and 0.858 for class II. Conclusions: Temporal motif mining using partial periodic patterns can capture information about the sequences well enough to predict the binding of the peptides and is comparable to state of the art methods in the literature. Unlike neural networks or matrix based predictors, our proposed method does not depend on peptide length and can work with both short and long fragments. This advantage allows better use of the available training data and the prediction of peptides of uncommon lengths
    corecore