Protein–protein
interactions (PPIs) play a central role
in nearly all cellular processes. The strength of the binding in a
PPI is characterized by the binding affinity (BA) and is a key factor
in controlling protein–protein complex formation and defining
the structure–function relationship. Despite advancements in
understanding protein–protein binding, much remains unknown
about the interfacial region and its association with BA. New models
are needed to predict BA with improved accuracy for therapeutic design.
Here, we use machine learning approaches to examine how well different
types of interfacial contacts can be used to predict experimentally
determined BA and to reveal the impact of the specific amino acids
at the binding interface on BA. We create a series of multivariate
linear regression models incorporating different contact features
at both residue and atomic levels and examine how different methods
of identifying and characterizing these properties impact the performance
of these models. Particularly, we introduce a new and simple approach
to predict BA based on the quantities of specific amino acids at the
protein–protein interface. We found that the numbers of specific
amino acids at the protein–protein interface were correlated
with BA. We show that the interfacial numbers of amino acids can be
used to produce models with consistently good performance across different
data sets, indicating the importance of the identities of interfacial
amino acids in underlying BA. When trained on a diverse set of complexes
from two benchmark data sets, the best performing BA model was generated
with an explicit linear equation involving six amino acids. Tyrosine,
in particular, was identified as the key amino acid in controlling
BA, as it had the strongest correlation with BA and was consistently
identified as the most important amino acid in feature importance
studies. Glycine and serine were identified as the next two most important
amino acids in predicting BA. The results from this study further
our understanding of PPIs and can be used to make improved predictions
of BA, giving them implications for drug design and screening in the
pharmaceutical industry