    A Simple Proximal Stochastic Gradient Method for Nonsmooth Nonconvex Optimization

    We analyze stochastic gradient algorithms for optimizing nonconvex, nonsmooth finite-sum problems. In particular, the objective function is given by the summation of a differentiable (possibly nonconvex) component, together with a possibly non-differentiable but convex component. We propose a proximal stochastic gradient algorithm based on variance reduction, called ProxSVRG+. Our main contribution lies in the analysis of ProxSVRG+. It recovers several existing convergence results and improves/generalizes them (in terms of the number of stochastic gradient oracle calls and proximal oracle calls). In particular, ProxSVRG+ generalizes the best results given by the SCSG algorithm, recently proposed by [Lei et al., 2017] for the smooth nonconvex case. ProxSVRG+ is also more straightforward than SCSG and yields simpler analysis. Moreover, ProxSVRG+ outperforms the deterministic proximal gradient descent (ProxGD) for a wide range of minibatch sizes, which partially solves an open problem proposed in [Reddi et al., 2016b]. Also, ProxSVRG+ uses much less proximal oracle calls than ProxSVRG [Reddi et al., 2016b]. Moreover, for nonconvex functions satisfied Polyak-\L{}ojasiewicz condition, we prove that ProxSVRG+ achieves a global linear convergence rate without restart unlike ProxSVRG. Thus, it can \emph{automatically} switch to the faster linear convergence in some regions as long as the objective function satisfies the PL condition locally in these regions. ProxSVRG+ also improves ProxGD and ProxSVRG/SAGA, and generalizes the results of SCSG in this case. Finally, we conduct several experiments and the experimental results are consistent with the theoretical results.Comment: 32nd Conference on Neural Information Processing Systems (NeurIPS 2018

    Convergence of Nonconvex PnP-ADMM with MMSE Denoisers

    Plug-and-Play Alternating Direction Method of Multipliers (PnP-ADMM) is a widely-used algorithm for solving inverse problems by integrating physical measurement models and convolutional neural network (CNN) priors. PnP-ADMM has been theoretically proven to converge for convex data-fidelity terms and nonexpansive CNNs. It has however been observed that PnP-ADMM often empirically converges even for expansive CNNs. This paper presents a theoretical explanation for the observed stability of PnP-ADMM based on the interpretation of the CNN prior as a minimum mean-squared error (MMSE) denoiser. Our explanation parallels a similar argument recently made for the iterative shrinkage/thresholding algorithm variant of PnP (PnP-ISTA) and relies on the connection between MMSE denoisers and proximal operators. We also numerically evaluate the performance gap between PnP-ADMM using a nonexpansive DnCNN denoiser and expansive DRUNet denoiser, thus motivating the use of expansive CNNs

    Comparative Analysis of Parallel Brain Activity Mapping Algorithms for High Resolution Brain Models

    En este artículo se propone un análisis comparativo entre versiones regulares y en paralelo de métodos de optimización FISTA y Tikhonov, para resolver el problema de mapeo cerebral a partir de EEG. La comparación se realiza en términos de la reducción del tiempo computacional y el error de estimación obtenido por los métodos paralelizados. Dos modelos de cabeza con alta y baja resolución son usados para la comparación de los algoritmos. Como resultado se puede ver que, si el número de procesos en paralelo se incrementa, el tiempo computacional disminuye significativamente para todos los modelos de cabeza, sin comprometer la calidad de la reconstrucción. Adicionalmente, se puede concluir que el uso de un modelo de cabeza de alta resolución resulta en una mejora de cualquier método de reconstrucción en términos de la resolución espacial.This paper proposes a comparative analysis between regular and parallel versions of FISTA and Tikhonov-like optimizations for solving the EEG brain mapping problem. Such comparison is performed in terms of computational time reduction and estimation error achieved by the parallelized methods. Two brain models (high- and low-resolution) are used to compare the algorithms. As a result, it can be seen that, if the number of parallel processes increases, computational time decreases significantly for all the head models used in this work, without compromising the reconstruction quality. In addition, it can be concluded that the use of a high-resolution head model produces an improvement in any source reconstruction method in terms of spatial resolution.

    Asynchronous Optimization Methods for Efficient Training of Deep Neural Networks with Guarantees

    Asynchronous distributed algorithms are a popular way to reduce synchronization costs in large-scale optimization, and in particular for neural network training. However, for nonsmooth and nonconvex objectives, few convergence guarantees exist beyond cases where closed-form proximal operator solutions are available. As most popular contemporary deep neural networks lead to nonsmooth and nonconvex objectives, there is now a pressing need for such convergence guarantees. In this paper, we analyze for the first time the convergence of stochastic asynchronous optimization for this general class of objectives. In particular, we focus on stochastic subgradient methods allowing for block variable partitioning, where the shared-memory-based model is asynchronously updated by concurrent processes. To this end, we first introduce a probabilistic model which captures key features of real asynchronous scheduling between concurrent processes; under this model, we establish convergence with probability one to an invariant set for stochastic subgradient methods with momentum. From the practical perspective, one issue with the family of methods we consider is that it is not efficiently supported by machine learning frameworks, as they mostly focus on distributed data-parallel strategies. To address this, we propose a new implementation strategy for shared-memory based training of deep neural networks, whereby concurrent parameter servers are utilized to train a partitioned but shared model in single- and multi-GPU settings. Based on this implementation, we achieve on average 1.2x speed-up in comparison to state-of-the-art training methods for popular image classification tasks without compromising accuracy