The most widely used classification system describing enzyme-catalysed reactions
is the Enzyme Commission (EC) number. Understanding enzyme
function is important for both fundamental scientific and pharmaceutical
reasons. The EC classification is essentially unrelated to the reaction mechanism.
In this work we address two important questions related to enzyme
function diversity. First, to investigate the relationship between the reaction
mechanisms as described in the MACiE (Mechanism, Annotation,
and Classification in Enzymes) database and the main top-level class of the
EC classification. Second, how well these enzymes biocatalysis are adapted
in nature.
In this thesis, we have retrieved 335 enzyme reactions from the MACiE
database. We consider two ways of encoding the reaction mechanism in
descriptors, and three approaches that encode only the overall chemical
reaction.
To proceed through my work, we first develop a basic model to cluster
the enzymatic reactions. Global study of enzyme reaction mechanism
may provide important insights for better understanding of the diversity of
chemical reactions of enzymes. Clustering analysis in such research is very
common practice. Clustering algorithms suffer from various issues, such as
requiring determination of the input parameters and stopping criteria, and
very often a need to specify the number of clusters in advance.
Using several well known metrics, we tried to optimize the clustering
outputs for each of the algorithms, with equivocal results that suggested the
existence of between two and over a hundred clusters. This motivated us to
design and implement our algorithm, PFClust (Parameter-Free Clustering),
where no prior information is required to determine the number of cluster. The analysis highlights the structure of the enzyme overall and mechanistic
reaction. This suggests that mechanistic similarity can influence approaches
for function prediction and automatic annotation of newly discovered protein
and gene sequences.
We then develop and evaluate the method for enzyme function prediction
using machine learning methods. Our results suggest that pairs of similar
enzyme reactions tend to proceed by different mechanisms. The machine
learning method needs only chemoinformatics descriptors as an input and
is applicable for regression analysis.
The last phase of this work is to test the evolution of chemical mechanisms
mapped onto ancestral enzymes. This domain occurrence and abundance
in modern proteins has showed that the / architecture is probably
the oldest fold design. These observations have important implications for
the origins of biochemistry and for exploring structure-function relationships.
Over half of the known mechanisms are introduced before architectural
diversification over the evolutionary time. The other halves of the mechanisms
are invented gradually over the evolutionary timeline just after organismal
diversification. Moreover, many common mechanisms includes fundamental
building blocks of enzyme chemistry were found to be associated
with the ancestral fold