2,204 research outputs found

    Visualising Basins of Attraction for the Cross-Entropy and the Squared Error Neural Network Loss Functions

    Get PDF
    Quantification of the stationary points and the associated basins of attraction of neural network loss surfaces is an important step towards a better understanding of neural network loss surfaces at large. This work proposes a novel method to visualise basins of attraction together with the associated stationary points via gradient-based random sampling. The proposed technique is used to perform an empirical study of the loss surfaces generated by two different error metrics: quadratic loss and entropic loss. The empirical observations confirm the theoretical hypothesis regarding the nature of neural network attraction basins. Entropic loss is shown to exhibit stronger gradients and fewer stationary points than quadratic loss, indicating that entropic loss has a more searchable landscape. Quadratic loss is shown to be more resilient to overfitting than entropic loss. Both losses are shown to exhibit local minima, but the number of local minima is shown to decrease with an increase in dimensionality. Thus, the proposed visualisation technique successfully captures the local minima properties exhibited by the neural network loss surfaces, and can be used for the purpose of fitness landscape analysis of neural networks.Comment: Preprint submitted to the Neural Networks journa

    Mish: A Self Regularized Non-Monotonic Activation Function

    Full text link
    We propose Mish\textit{Mish}, a novel self-regularized non-monotonic activation function which can be mathematically defined as: f(x)=xtanh(softplus(x))f(x)=x\tanh(softplus(x)). As activation functions play a crucial role in the performance and training dynamics in neural networks, we validated experimentally on several well-known benchmarks against the best combinations of architectures and activation functions. We also observe that data augmentation techniques have a favorable effect on benchmarks like ImageNet-1k and MS-COCO across multiple architectures. For example, Mish outperformed Leaky ReLU on YOLOv4 with a CSP-DarkNet-53 backbone on average precision (AP50valAP_{50}^{val}) by 2.1%\% in MS-COCO object detection and ReLU on ResNet-50 on ImageNet-1k in Top-1 accuracy by \approx1%\% while keeping all other network parameters and hyperparameters constant. Furthermore, we explore the mathematical formulation of Mish in relation with the Swish family of functions and propose an intuitive understanding on how the first derivative behavior may be acting as a regularizer helping the optimization of deep neural networks. Code is publicly available at https://github.com/digantamisra98/Mish.Comment: Accepted to BMVC 202

    Deep Built-Structure Counting in Satellite Imagery Using Attention Based Re-Weighting

    Full text link
    In this paper, we attempt to address the challenging problem of counting built-structures in the satellite imagery. Building density is a more accurate estimate of the population density, urban area expansion and its impact on the environment, than the built-up area segmentation. However, building shape variances, overlapping boundaries, and variant densities make this a complex task. To tackle this difficult problem, we propose a deep learning based regression technique for counting built-structures in satellite imagery. Our proposed framework intelligently combines features from different regions of satellite image using attention based re-weighting techniques. Multiple parallel convolutional networks are designed to capture information at different granulates. These features are combined into the FusionNet which is trained to weigh features from different granularity differently, allowing us to predict a precise building count. To train and evaluate the proposed method, we put forward a new large-scale and challenging built-structure-count dataset. Our dataset is constructed by collecting satellite imagery from diverse geographical areas (planes, urban centers, deserts, etc.,) across the globe (Asia, Europe, North America, and Africa) and captures the wide density of built structures. Detailed experimental results and analysis validate the proposed technique. FusionNet has Mean Absolute Error of 3.65 and R-squared measure of 88% over the testing data. Finally, we perform the test on the 274:3 ? 103 m2 of the unseen region, with the error of 19 buildings off the 656 buildings in that area

    Denoising Autoencoders for fast Combinatorial Black Box Optimization

    Full text link
    Estimation of Distribution Algorithms (EDAs) require flexible probability models that can be efficiently learned and sampled. Autoencoders (AE) are generative stochastic networks with these desired properties. We integrate a special type of AE, the Denoising Autoencoder (DAE), into an EDA and evaluate the performance of DAE-EDA on several combinatorial optimization problems with a single objective. We asses the number of fitness evaluations as well as the required CPU times. We compare the results to the performance to the Bayesian Optimization Algorithm (BOA) and RBM-EDA, another EDA which is based on a generative neural network which has proven competitive with BOA. For the considered problem instances, DAE-EDA is considerably faster than BOA and RBM-EDA, sometimes by orders of magnitude. The number of fitness evaluations is higher than for BOA, but competitive with RBM-EDA. These results show that DAEs can be useful tools for problems with low but non-negligible fitness evaluation costs.Comment: corrected typos and small inconsistencie

    An Approximate Backpropagation Learning Rule for Memristor Based Neural Networks Using Synaptic Plasticity

    Full text link
    We describe an approximation to backpropagation algorithm for training deep neural networks, which is designed to work with synapses implemented with memristors. The key idea is to represent the values of both the input signal and the backpropagated delta value with a series of pulses that trigger multiple positive or negative updates of the synaptic weight, and to use the min operation instead of the product of the two signals. In computational simulations, we show that the proposed approximation to backpropagation is well converged and may be suitable for memristor implementations of multilayer neural networks.Comment: 21 pages, 6 figures, 1 table, title changed, manuscript thoroughly rewritte

    Application of Deep Learning on Predicting Prognosis of Acute Myeloid Leukemia with Cytogenetics, Age, and Mutations

    Full text link
    We explore how Deep Learning (DL) can be utilized to predict prognosis of acute myeloid leukemia (AML). Out of TCGA (The Cancer Genome Atlas) database, 94 AML cases are used in this study. Input data include age, 10 common cytogenetic and 23 most common mutation results; output is the prognosis (diagnosis to death, DTD). In our DL network, autoencoders are stacked to form a hierarchical DL model from which raw data are compressed and organized and high-level features are extracted. The network is written in R language and is designed to predict prognosis of AML for a given case (DTD of more than or less than 730 days). The DL network achieves an excellent accuracy of 83% in predicting prognosis. As a proof-of-concept study, our preliminary results demonstrate a practical application of DL in future practice of prognostic prediction using next-gen sequencing (NGS) data.Comment: 11 pages, 1 table, 1 figure. arXiv admin note: substantial text overlap with arXiv:1801.0101

    Learning, Memory, and the Role of Neural Network Architecture

    Get PDF
    The performance of information processing systems, from artificial neural networks to natural neuronal ensembles, depends heavily on the underlying system architecture. In this study, we compare the performance of parallel and layered network architectures during sequential tasks that require both acquisition and retention of information, thereby identifying tradeoffs between learning and memory processes. During the task of supervised, sequential function approximation, networks produce and adapt representations of external information. Performance is evaluated by statistically analyzing the error in these representations while varying the initial network state, the structure of the external information, and the time given to learn the information. We link performance to complexity in network architecture by characterizing local error landscape curvature. We find that variations in error landscape structure give rise to tradeoffs in performance; these include the ability of the network to maximize accuracy versus minimize inaccuracy and produce specific versus generalizable representations of information. Parallel networks generate smooth error landscapes with deep, narrow minima, enabling them to find highly specific representations given sufficient time. While accurate, however, these representations are difficult to generalize. In contrast, layered networks generate rough error landscapes with a variety of local minima, allowing them to quickly find coarse representations. Although less accurate, these representations are easily adaptable. The presence of measurable performance tradeoffs in both layered and parallel networks has implications for understanding the behavior of a wide variety of natural and artificial learning systems

    On the energy landscape of deep networks

    Full text link
    We introduce "AnnealSGD", a regularized stochastic gradient descent algorithm motivated by an analysis of the energy landscape of a particular class of deep networks with sparse random weights. The loss function of such networks can be approximated by the Hamiltonian of a spherical spin glass with Gaussian coupling. While different from currently-popular architectures such as convolutional ones, spin glasses are amenable to analysis, which provides insights on the topology of the loss function and motivates algorithms to minimize it. Specifically, we show that a regularization term akin to a magnetic field can be modulated with a single scalar parameter to transition the loss function from a complex, non-convex landscape with exponentially many local minima, to a phase with a polynomial number of minima, all the way down to a trivial landscape with a unique minimum. AnnealSGD starts training in the relaxed polynomial regime and gradually tightens the regularization parameter to steer the energy towards the original exponential regime. Even for convolutional neural networks, which are quite unlike sparse random networks, we empirically show that AnnealSGD improves the generalization error using competitive baselines on MNIST and CIFAR-10

    Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior

    Full text link
    We describe an approach to understand the peculiar and counterintuitive generalization properties of deep neural networks. The approach involves going beyond worst-case theoretical capacity control frameworks that have been popular in machine learning in recent years to revisit old ideas in the statistical mechanics of neural networks. Within this approach, we present a prototypical Very Simple Deep Learning (VSDL) model, whose behavior is controlled by two control parameters, one describing an effective amount of data, or load, on the network (that decreases when noise is added to the input), and one with an effective temperature interpretation (that increases when algorithms are early stopped). Using this model, we describe how a very simple application of ideas from the statistical mechanics theory of generalization provides a strong qualitative description of recently-observed empirical results regarding the inability of deep neural networks not to overfit training data, discontinuous learning and sharp transitions in the generalization properties of learning algorithms, etc.Comment: 31 pages; added brief discussion of recent papers that use/extend these idea

    High Dimensional Spaces, Deep Learning and Adversarial Examples

    Full text link
    In this paper, we analyze deep learning from a mathematical point of view and derive several novel results. The results are based on intriguing mathematical properties of high dimensional spaces. We first look at perturbation based adversarial examples and show how they can be understood using topological and geometrical arguments in high dimensions. We point out mistake in an argument presented in prior published literature, and we present a more rigorous, general and correct mathematical result to explain adversarial examples in terms of topology of image manifolds. Second, we look at optimization landscapes of deep neural networks and examine the number of saddle points relative to that of local minima. Third, we show how multiresolution nature of images explains perturbation based adversarial examples in form of a stronger result. Our results state that expectation of L2L_2-norm of adversarial perturbations is O(1n)O\left(\frac{1}{\sqrt{n}}\right) and therefore shrinks to 0 as image resolution nn becomes arbitrarily large. Finally, by incorporating the parts-whole manifold learning hypothesis for natural images, we investigate the working of deep neural networks and root causes of adversarial examples and discuss how future improvements can be made and how adversarial examples can be eliminated.Comment: 29 pages, 15 figure
    corecore