Proteins are composed of twenty different types of amino acids, small organic molecules
with different chemical and physical properties resulting from different groups of atoms.
Protein interactions are mediated by the affinity between groups of atoms belonging to
amino acid residues at the surface of each protein, in the interface region. However, it
is not clear at what level these contacts are best evaluated, whether by grouping similar
amino acids together, considering parts of each amino acid or even individual atoms.
The number of databanks and extracted features continue to increase, this means very
rich data, but that also brings the problem of the sheer amount of different features and
what do they really represent in the big picture of protein interactions.Since the data itself
is collected by scientific communities all around the globe, there is a vast amount
of information but with that there is also a great diversity of the measured or calculated
attributes. This creates a need to learn at which level these contacts occur and what is the
best way to combine the information in the literature to learn a valuable representation.
With the rise of machine learning algorithms making possible to work with data in various
ways that were not previously possible due to practical limitations, various areas are
using these algorithms to capture information about the data that was inaccessible before,
bioinformatics being one of them. The goal of this work is to use unsupervised deep learning
techniques that transform the data in a way that is intended to be informative and
non-redundant, facilitating the subsequent learning for other algorithms of classification
or regression that will perform better on processed data like this. The transformation
involves finding encodings for the collected features that best capture which are the ones
that are actually relevant to construct these encodings. These encondings can be latent in
relation to the already known information in the area, meaning that they most likely will
not be human friendly, in the sense that they will lack interpretability for humans, but
can increase the performance of machine learning algorithms