The explosion of whole genome sequence and environmental sequence data
afford us the opportunity to explore protein diversity and protein function. This is
particularly exciting given the nascent field of synthetic biology. A comprehensive
computational analysis of extant proteins is needed in order to define the limitations on
protein structure and diversity from a bioengineering perspective. This paper focuses on
defining an upper limit for protein diversity using computational approaches derived
from linguistic analyses. These methods are used to make a prediction on the upper limit
of unique proteins and number of highly conserved motifs. Motifs deemed highly
conserved will, more than likely represent important structural components of basic
proteins. Results were gathered from two large data sets: all of the currently available
microbial genome sequences available from NCBI and the Global Ocean Survey data set.
There were 6.6 million unique proteins at 95% amino acid identity. The majority of
unique motifs in these data sets were only found once. The motifs deemed highly
conserved in lifestyle groupings of organisms and individual organisms were analyzed
for function based on a conserved domain search. The importance between pathogenicity
and cell motility and secretion related genes and proteins was observed. These motifs
represent potential new drug targets or areas of future experimentation