Using Mutual Information Content of Protein Sequences for Classification 1 Abstract

Abstract

Protein sequence classification is a challenging problem. We are attempting to use the Mutual Information Content of protein sequences to provide an alternative method of classification. Since the 3D structure, and ultimately classification depends on the relationship between amino acids within a sequence, measuring the dependency of acids at different distances in a sequence may prove a valuable classification tool. This document contains some preliminary results, but is primarily a description of methods used to this point. 2 Mutual Information Content The goal of analyzing Information Content is to provide an alternative method of measuring the relatedness of protein sequences. We hope to use the protein sequence to generate numerical data that is useful for comparing sequences. Converting a character sequence to numerical data opens up many possibilities for using existing areas of study, including machine learning techniques, to study biological sequences, The method being tested for scoring sequences is to traverse the protein sequence from beginning to end, and at each position record the protein composition of a certain number of blocks, each separated by a set distance, termed the gap length. The simplest version of this involves two blocks, each of a single amino acid. Using these parameters, the protein sequenc

    Similar works

    Full text

    thumbnail-image

    Available Versions