A novel scheme is introduced to capture the spatial correlations of
consecutive amino acids in naturally occurring proteins. This knowledge-based
strategy is able to carry out optimally automated subdivisions of protein
fragments into classes of similarity. The goal is to provide the minimal set of
protein oligomers (termed ``oligons'' for brevity) that is able to represent
any other fragment. At variance with previous studies where recurrent local
motifs were classified, our concern is to provide simplified protein
representations that have been optimised for use in automated folding and/or
design attempts. In such contexts it is paramount to limit the number of
degrees of freedom per amino acid without incurring in loss of accuracy of
structural representations. The suggested method finds, by construction, the
optimal compromise between these needs. Several possible oligon lengths are
considered. It is shown that meaningful classifications cannot be done for
lengths greater than 6 or smaller than 4. Different contexts are considered
were oligons of length 5 or 6 are recommendable. With only a few dozen of
oligons of such length, virtually any protein can be reproduced within typical
experimental uncertainties. Structural data for the oligons is made publicly
available.Comment: 19 pages, 13 postscript figure