Robust learning to rank models and their biomedical applications

Abstract

There exist many real-world applications such as recommendation systems, document retrieval, and computational biology where the correct ordering of instances is of equal or greater importance than predicting the exact value of some discrete or continuous outcome. Learning-to-Rank (LTR) refers to a group of algorithms that apply machine learning techniques to tackle these ranking problems. Despite their empirical success, most existing LTR models are not built to be robust to errors in labeling or annotation, distributional data shift, or adversarial data perturbations. To fill this gap, we develop four LTR frameworks that are robust to various types of perturbations. First, Pairwise Elastic Net Regression Ranking (PENRR) is an elastic-net-based regression method for drug sensitivity prediction. PENRR infers robust predictors of drug responses from patient genomic information. The special design of this model (comparing each drug with other drugs in the same cell line and comparing that drug with itself in other cell lines) significantly enhances the accuracy of the drug prediction model under limited data. This approach is also able to solve the problem of fitting on the insensitive drugs that is commonly encountered in regression-based models. Second, Regression-based Ranking by Pairwise Cluster Comparisons (RRPCC) is a ridge-regression-based method for ranking clusters of similar protein complex conformations generated by an underlying docking program (i.e., ClusPro). Rather than using regression to predict scores, which would equally penalize deviations for either low-quality and high-quality clusters, we seek to predict the difference of scores for any pair of clusters corresponding to the same complex. RRPCC combines these pairwise assessments to form a ranked list of clusters, from higher to lower quality. We apply RRPCC to clusters produced by the automated docking server ClusPro and, depending on the training/validation strategy, we show. improvement by 24%–100% in ranking acceptable or better quality clusters first, and by 15%–100% in ranking medium or better quality clusters first. Third, Distributionally Robust Multi-Output Regression Ranking (DRMRR) is a listwise LTR model that induces robustness into LTR problems using the Distributionally Robust Optimization framework. Contrasting to existing methods, the scoring function of DRMRR was designed as a multivariate mapping from a feature vector to a vector of deviation scores, which captures local context information and cross-document interactions. DRMRR employs ranking metrics (i.e., NDCG) in its output. Particularly, we used the notion of position deviation to define a vector of relevance score instead of a scalar one. We then adopted the DRO framework to minimize a worst-case expected multi-output loss function over a probabilistic ambiguity set that is defined by the Wasserstein metric. We also presented an equivalent convex reformulation of the DRO problem, which is shown to be tighter than the ones proposed by the previous studies. Fourth, Inversion Transformer-based Neural Ranking (ITNR) is a Transformer-based model to predict drug responses using RNAseq gene expression profiles, drug descriptors, and drug fingerprints. It utilizes a Context-Aware-Transformer architecture as its scoring function that ensures the modeling of inter-item dependencies. We also introduced a new loss function using the concept of Inversion and approximate permutation matrices. The accuracy and robustness of these LTR models are verified through three medical applications, namely cluster ranking in protein-protein docking, medical document retrieval, and drug response prediction

    Similar works