Development of new generation sequencers enabled genome sequencing feasible for every organism in a laboratory. A typical data flow of de novo seuqencing includes (1) assembly of sequence reads, (2) estimation of open reading frames, (3) annotation of proteins, and (4) finding RNA genes. The annotation is normally performed by BLASTP searches against several different databases. However, it is usually hard to find a plausible annotation by just looking at the results of BLASTP searches.

Here I propose a potentially automatic method of annotation that exploits automatic protein clustering using the software GCLUST, which estimates proper similarity threshold for each list of homologs using ‘entropy-optimized organism count’ method (Sato 2009). The software has been used to construct a homolog database including both prokaryotic and eukaryotic proteins ("http://gclust.c.u-tokyo.ac.jp/":http://gclust.c.u-tokyo.ac.jp/). For use in genome annotation, we need de novo clustering including many genomes of related organisms as well as genomes of representative organisms. Application of protein clustering in the annotation in Arthrospira platensis was the first successful case (Fujisawa et al. 2010). I present here results of protein clustering of total predicted proteins in two draft genomes of cyanobacteria along with total predicted proteins of 41 cyanobacteria available at NCBI. For each of the resultant protein clusters, an alignment and a phylogenetic tree were also prepared for assistance in functional annotation. The quality of alignments was evaluated by counting ill-aligned proteins (missing N- or C-terminus, or insertion/deletion), which was 4-13% of total predicted proteins in most cyanobacterial genomes. Annotation may be automated by extracting significant key words alreadly assigned for member proteins of clusters or by comparison with reference protein clusters