Training Set Construction.

Abstract

<p>Fig 1A illustrates the simplest method for training set construction. Each genome (gray circles) is treated as a “bag of genes”; distance relationships between genes are ignored. One hidden Markov model (HMM) identifies target family proteins (orange squares) in the corresponding proteome. A second HMM finds proteins from a second family (yellow stars) whose presence or absence in the proteome is the attribute that controls how target family proteins are sorted. If an attribute family protein is found, members of the target family get sorted to the YES set (green container). If not, then target family proteins go to the NO set (red container). The training set builder (TSB) always works on one target protein family at a time, but more complicated rules may require multiple attributes to be jointly present for the YES set, and multiple attributes to be jointly absent for the NO set. Fig 1B shows training set construction using a distance rule. The S-shaped curved represents a long segment of genomic DNA. A target protein is sorted to the YES set if and only if its gene lies within a user-specified distance from the attribute protein’s gene. Target proteins from genomes that lack the attribute completely go to the NO set. A target protein goes to the FAR set if and only its gene sufficiently far from the nearest attribute gene, and the genome has already sent a target protein to the YES set. If a genome encodes an attribute family protein, but no target family protein qualifies for the YES set, then target family proteins are not sorted to any bin.</p

    Similar works

    Full text

    thumbnail-image

    Available Versions