636 research outputs found
Stochastic Optimization of Areas UnderPrecision-Recall Curves with Provable Convergence
Areas under ROC (AUROC) and precision-recall curves (AUPRC) are common
metrics for evaluating classification performance for imbalanced problems.
Compared with AUROC, AUPRC is a more appropriate metric for highly imbalanced
datasets. While stochastic optimization of AUROC has been studied extensively,
principled stochastic optimization of AUPRC has been rarely explored. In this
work, we propose a principled technical method to optimize AUPRC for deep
learning. Our approach is based on maximizing the averaged precision (AP),
which is an unbiased point estimator of AUPRC. We cast the objective into a sum
of {\it dependent compositional functions} with inner functions dependent on
random variables of the outer level. We propose efficient adaptive and
non-adaptive stochastic algorithms named SOAP with {\it provable convergence
guarantee under mild conditions} by leveraging recent advances in stochastic
compositional optimization. Extensive experimental results on image and graph
datasets demonstrate that our proposed method outperforms prior methods on
imbalanced problems in terms of AUPRC. To the best of our knowledge, our work
represents the first attempt to optimize AUPRC with provable convergence. The
SOAP has been implemented in the libAUC library at~\url{https://libauc.org/}.Comment: 24 pages, 10 figure
Recommended from our members
Dynamic Machine Learning with Least Square Objectives
As of the writing of this thesis, machine learning has become one of the most active research fields. The interest comes from a variety of disciplines which include computer science, statistics, engineering, and medicine. The main idea behind learning from data is that, when an analytical model explaining the observations is hard to find ---often in contrast to the models in physics such as Newton's laws--- a statistical approach can be taken where one or more candidate models are tuned using data.
Since the early 2000's this challenge has grown in two ways: (i) The amount of collected data has seen a massive growth due to the proliferation of digital media, and (ii) the data has become more complex. One example for the latter is the high dimensional datasets, which can for example correspond to dyadic interactions between two large groups (such as customer and product information a retailer collects), or to high resolution image/video recordings.
Another important issue is the study of dynamic data, which exhibits dependence on time. Virtually all datasets fall into this category as all data collection is performed over time, however I use the term dynamic to hint at a system with an explicit temporal dependence. A traditional example is target tracking from signal processing literature. Here the position of a target is modeled using Newton's laws of motion, which relates it to time via the target's velocity and acceleration.
Dynamic data, as I defined above, poses two important challenges. Firstly, the learning setup is different from the standard theoretical learning setup, also known as Probably Approximately Correct (PAC) learning. To derive PAC learning bounds one assumes a collection of data points sampled independently and identically from a distribution which generates the data. On the other hand, dynamic systems produce correlated outputs. The learning systems we use should accordingly take this difference into consideration. Secondly, as the system is dynamic, it might be necessary to perform the learning online. In this case the learning has to be done in a single pass. Typical applications include target tracking and electricity usage forecasting.
In this thesis I investigate several important dynamic and online learning problems, where I develop novel tools to address the shortcomings of the previous solutions in the literature. The work is divided into three parts for convenience. The first part is about matrix factorization for time series analysis which is further divided into two chapters. In the first chapter, matrix factorization is used within a Bayesian framework to model time-varying dyadic interactions, with examples in predicting user-movie ratings and stock prices. In the next chapter, a matrix factorization which uses autoregressive models to forecast future values of multivariate time series is proposed, with applications in predicting electricity usage and traffic conditions. Inspired by the machinery we use in the first part, the second part is about nonlinear Kalman filtering, where a hidden state is estimated over time given observations. The nonlinearity of the system generating the observations is the main challenge here, where a divergence minimization approach is used to unify the seemingly unrelated methods in the literature, and propose new ones. This has applications in target tracking and options pricing. The third and last part is about cost sensitive learning, where a novel method for maximizing area under receiver operating characteristics curve is proposed. Our method has theoretical guarantees and favorable sample complexity. The method is tested on a variety of benchmark datasets, and also has applications in online advertising
Provable Multi-instance Deep AUC Maximization with Stochastic Pooling
This paper considers a novel application of deep AUC maximization (DAM) for
multi-instance learning (MIL), in which a single class label is assigned to a
bag of instances (e.g., multiple 2D slices of a CT scan for a patient). We
address a neglected yet non-negligible computational challenge of MIL in the
context of DAM, i.e., bag size is too large to be loaded into {GPU} memory for
backpropagation, which is required by the standard pooling methods of MIL. To
tackle this challenge, we propose variance-reduced stochastic pooling methods
in the spirit of stochastic optimization by formulating the loss function over
the pooled prediction as a multi-level compositional function. By synthesizing
techniques from stochastic compositional optimization and non-convex min-max
optimization, we propose a unified and provable muli-instance DAM (MIDAM)
algorithm with stochastic smoothed-max pooling or stochastic attention-based
pooling, which only samples a few instances for each bag to compute a
stochastic gradient estimator and to update the model parameter. We establish a
similar convergence rate of the proposed MIDAM algorithm as the
state-of-the-art DAM algorithms. Our extensive experiments on conventional MIL
datasets and medical datasets demonstrate the superiority of our MIDAM
algorithm.Comment: 22 page
Differentially Private SGDA for Minimax Problems
Stochastic gradient descent ascent (SGDA) and its variants have been the
workhorse for solving minimax problems. However, in contrast to the
well-studied stochastic gradient descent (SGD) with differential privacy (DP)
constraints, there is little work on understanding the generalization (utility)
of SGDA with DP constraints. In this paper, we use the algorithmic stability
approach to establish the generalization (utility) of DP-SGDA in different
settings. In particular, for the convex-concave setting, we prove that the
DP-SGDA can achieve an optimal utility rate in terms of the weak primal-dual
population risk in both smooth and non-smooth cases. To our best knowledge,
this is the first-ever-known result for DP-SGDA in the non-smooth case. We
further provide its utility analysis in the nonconvex-strongly-concave setting
which is the first-ever-known result in terms of the primal population risk.
The convergence and generalization results for this nonconvex setting are new
even in the non-private setting. Finally, numerical experiments are conducted
to demonstrate the effectiveness of DP-SGDA for both convex and nonconvex
cases
Scalable large margin pairwise learning algorithms
2019 Summer.Includes bibliographical references.Classification is a major task in machine learning and data mining applications. Many of these applications involve building a classification model using a large volume of imbalanced data. In such an imbalanced learning scenario, the area under the ROC curve (AUC) has proven to be a reliable performance measure to evaluate a classifier. Therefore, it is desirable to develop scalable learning algorithms that maximize the AUC metric directly. The kernelized AUC maximization machines have established a superior generalization ability compared to linear AUC machines. However, the computational cost of the kernelized machines hinders their scalability. To address this problem, we propose a large-scale nonlinear AUC maximization algorithm that learns a batch linear classifier on approximate feature space computed via the k-means Nyström method. The proposed algorithm is shown empirically to achieve comparable AUC classification performance or even better than the kernel AUC machines, while its training time is faster by several orders of magnitude. However, the computational complexity of the linear batch model compromises its scalability when training sizable datasets. Hence, we develop a second-order online AUC maximization algorithms based on a confidence-weighted model. The proposed algorithms exploit the second-order information to improve the convergence rate and implement a fixed-size buffer to address the multivariate nature of the AUC objective function. We also extend our online linear algorithms to consider an approximate feature map constructed using random Fourier features in an online setting. The results show that our proposed algorithms outperform or are at least comparable to the competing online AUC maximization methods. Despite their scalability, we notice that online first and second-order AUC maximization methods are prone to suboptimal convergence. This can be attributed to the limitation of the hypothesis space. A potential improvement can be attained by learning stochastic online variants. However, the vanilla stochastic methods also suffer from slow convergence because of the high variance introduced by the stochastic process. We address the problem of slow convergence by developing a fast convergence stochastic AUC maximization algorithm. The proposed stochastic algorithm is accelerated using a unique combination of scheduled regularization update and scheduled averaging. The experimental results show that the proposed algorithm performs better than the state-of-the-art online and stochastic AUC maximization methods in terms of AUC classification accuracy. Moreover, we develop a proximal variant of our accelerated stochastic AUC maximization algorithm. The proposed method applies the proximal operator to the hinge loss function. Therefore, it evaluates the gradient of the loss function at the approximated weight vector. Experiments on several benchmark datasets show that our proximal algorithm converges to the optimal solution faster than the previous AUC maximization algorithms
Benchmarking Deep AUROC Optimization: Loss Functions and Algorithmic Choices
The area under the ROC curve (AUROC) has been vigorously applied for
imbalanced classification and moreover combined with deep learning techniques.
However, there is no existing work that provides sound information for peers to
choose appropriate deep AUROC maximization techniques. In this work, we fill
this gap from three aspects. (i) We benchmark a variety of loss functions with
different algorithmic choices for deep AUROC optimization problem. We study the
loss functions in two categories: pairwise loss and composite loss, which
includes a total of 10 loss functions. Interestingly, we find composite loss,
as an innovative loss function class, shows more competitive performance than
pairwise loss from both training convergence and testing generalization
perspectives. Nevertheless, data with more corrupted labels favors a pairwise
symmetric loss. (ii) Moreover, we benchmark and highlight the essential
algorithmic choices such as positive sampling rate, regularization,
normalization/activation, and optimizers. Key findings include: higher positive
sampling rate is likely to be beneficial for deep AUROC maximization; different
datasets favors different weights of regularizations; appropriate normalization
techniques, such as sigmoid and score normalization, could improve
model performance. (iii) For optimization aspect, we benchmark SGD-type,
Momentum-type, and Adam-type optimizers for both pairwise and composite loss.
Our findings show that although Adam-type method is more competitive from
training perspective, but it does not outperform others from testing
perspective.Comment: 32 page
- …