Applications of Machine Learning for Predicting Selection Outcomes in Antibody Phage Display

Abstract

Antibodies form an essential component of the adaptive immune system, but they also have important scientific and clinical applications. These applications exploit the proven ability of antibodies to bind strongly and specifically to nearly any biomolecular target (e.g. protein) of interest. To produce antibodies for scientific and clinical applications, researchers can use a wet-lab technique called antibody phage display. Antibody phage display starts with a library of diverse antibody fragments and selects and amplifies those fragments that bind to the target. Antibody phage display combined with next-generation sequencing (NGS) technology has the potential to yield greater insight into the selection process. Machine learning is an area of artificial intelligence uniquely suited to recognizing patterns in large datasets, like those produced by NGS. The research goals of this thesis were to (1) train machine learning models to predict the selection of antibody fragments in antibody phage display using only the sequence of the fragment; (2) validate the ability of the trained models to generalize to different experiments; and (3) reverse engineer the trained models to gain greater insight into the learned patterns and the selection process. Antibody phage display data produced by the Geyer lab (University of Saskatchewan, SK) using two libraries called F and S was used to train a set of machine learning models: naive Bayes network (NB), linear model (LM), artificial neural network (ANN), support vector machine (SVM) with a radial basis function kernel (RBF-SVM), a SVM with a string kernel (SSK-SVM), and a random forest (RF). In addition, key parameters of the RBF- and SSK-SVM were tuned using a gridsearch. The trained models were then used to predict which antibody-displaying phage would be observed after the 5th round of panning, and their prediction accuracy on this data was used to help select models for subsequent analyses. The models selected were the RBF- and SSK-SVM. To achieve the second research goal, data originating from library F was used to train the two SVMs while library S data was used to test them. Finally, the two SVM models trained on library F were deconstructed to understand what features of the input correspond to negative predictions, and what features correspond to positive predictions. The ANN, SVMs, and RF models had the best average classification accuracy (81.5%), but of this group, there was not one classifier that performed significantly better than the others. These classifiers could be used to help non-experts select clones from either library F or S for further wet-lab analyses. The SVMs trained on library F and tested on library S achieved an average classification accuracy of 66.7%, significantly better than would be achieved by relying on chance. These two SVMs could be used to help non-experts select clones for further wet-lab analyses, provided the library being used is not too different from library S. Finally, deconstructing the SVMs trained on library F yielded insight into the basis for their predictions. The predictions of the RBF-SVM were found to be highly dependent on the molecular weight of the relevant binding region (i.e. CDRH3)

    Similar works