Rapid annotation of nifH gene sequences using classification and regression trees facilitates environmental functional gene analysis.

Abstract

The nifH gene is a widely used molecular proxy for studying nitrogen fixation. Phylogenetic classification of nifH gene sequences is an essential step in diazotroph community analysis that requires a fast automated solution due to increasing size of environmental sequence libraries and increasing yield of nifH sequences from high-throughput technologies. A novel approach to rapidly classify nifH amino acid sequences into well-defined phylogenetic clusters that provides a common platform for comparative analysis across studies is presented. Phylogenetic group membership can be accurately predicted with decision tree-type statistical models that identify and utilize signature residues in the amino acid sequences. Our classification models were trained and evaluated with a publicly available and manually curated nifH gene database containing cluster annotations. Model-independent sequence sets from diverse ecosystems were used for further assessment of the models' prediction accuracy. The utility of this novel sequence binning approach was demonstrated in a comparative study where joint treatment of diazotroph assemblages from a wide range of habitats identified habitat-specific and widely-distributed diazotrophs and revealed a marine - terrestrial distinction in community composition. Our rapid and automated phylogenetic cluster assignment circumvents extensive phylogenetic analysis of nifH sequences; hence, it saves substantial time and resources in nitrogen fixation studies

    Similar works