11,865 research outputs found
On Extending Neural Networks with Loss Ensembles for Text Classification
Ensemble techniques are powerful approaches that combine several weak
learners to build a stronger one. As a meta learning framework, ensemble
techniques can easily be applied to many machine learning techniques. In this
paper we propose a neural network extended with an ensemble loss function for
text classification. The weight of each weak loss function is tuned within the
training phase through the gradient propagation optimization method of the
neural network. The approach is evaluated on several text classification
datasets. We also evaluate its performance in various environments with several
degrees of label noise. Experimental results indicate an improvement of the
results and strong resilience against label noise in comparison with other
methods.Comment: 5 pages, 5 tables, 1 figure. Camera-ready submitted to The 2017
Australasian Language Technology Association Workshop (ALTA 2017
Towards a theory of machine learning
We define a neural network as a septuple consisting of (1) a state vector,
(2) an input projection, (3) an output projection, (4) a weight matrix, (5) a
bias vector, (6) an activation map and (7) a loss function. We argue that the
loss function can be imposed either on the boundary (i.e. input and/or output
neurons) or in the bulk (i.e. hidden neurons) for both supervised and
unsupervised systems. We apply the principle of maximum entropy to derive a
canonical ensemble of the state vectors subject to a constraint imposed on the
bulk loss function by a Lagrange multiplier (or an inverse temperature
parameter). We show that in an equilibrium the canonical partition function
must be a product of two factors: a function of the temperature and a function
of the bias vector and weight matrix. Consequently, the total Shannon entropy
consists of two terms which represent respectively a thermodynamic entropy and
a complexity of the neural network. We derive the first and second laws of
learning: during learning the total entropy must decrease until the system
reaches an equilibrium (i.e. the second law), and the increment in the loss
function must be proportional to the increment in the thermodynamic entropy
plus the increment in the complexity (i.e. the first law). We calculate the
entropy destruction to show that the efficiency of learning is given by the
Laplacian of the total free energy which is to be maximized in an optimal
neural architecture, and explain why the optimization condition is better
satisfied in a deep network with a large number of hidden layers. The key
properties of the model are verified numerically by training a supervised
feedforward neural network using the method of stochastic gradient descent. We
also discuss a possibility that the entire universe on its most fundamental
level is a neural network.Comment: 32 pages, 6 figures, accepted for publication in Machine Learning:
Science and Technolog
Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning
Random Matrix Theory (RMT) is applied to analyze weight matrices of Deep
Neural Networks (DNNs), including both production quality, pre-trained models
such as AlexNet and Inception, and smaller models trained from scratch, such as
LeNet5 and a miniature-AlexNet. Empirical and theoretical results clearly
indicate that the DNN training process itself implicitly implements a form of
Self-Regularization. The empirical spectral density (ESD) of DNN layer matrices
displays signatures of traditionally-regularized statistical models, even in
the absence of exogenously specifying traditional forms of explicit
regularization. Building on relatively recent results in RMT, most notably its
extension to Universality classes of Heavy-Tailed matrices, we develop a theory
to identify 5+1 Phases of Training, corresponding to increasing amounts of
Implicit Self-Regularization. These phases can be observed during the training
process as well as in the final learned DNNs. For smaller and/or older DNNs,
this Implicit Self-Regularization is like traditional Tikhonov regularization,
in that there is a "size scale" separating signal from noise. For
state-of-the-art DNNs, however, we identify a novel form of Heavy-Tailed
Self-Regularization, similar to the self-organization seen in the statistical
physics of disordered systems. This results from correlations arising at all
size scales, which arises implicitly due to the training process itself. This
implicit Self-Regularization can depend strongly on the many knobs of the
training process. By exploiting the generalization gap phenomena, we
demonstrate that we can cause a small model to exhibit all 5+1 phases of
training simply by changing the batch size. This demonstrates that---all else
being equal---DNN optimization with larger batch sizes leads to less-well
implicitly-regularized models, and it provides an explanation for the
generalization gap phenomena.Comment: 59 pages, 31 figure
Interpretable Convolutional Neural Networks via Feedforward Design
The model parameters of convolutional neural networks (CNNs) are determined
by backpropagation (BP). In this work, we propose an interpretable feedforward
(FF) design without any BP as a reference. The FF design adopts a data-centric
approach. It derives network parameters of the current layer based on data
statistics from the output of the previous layer in a one-pass manner. To
construct convolutional layers, we develop a new signal transform, called the
Saab (Subspace Approximation with Adjusted Bias) transform. It is a variant of
the principal component analysis (PCA) with an added bias vector to annihilate
activation's nonlinearity. Multiple Saab transforms in cascade yield multiple
convolutional layers. As to fully-connected (FC) layers, we construct them
using a cascade of multi-stage linear least squared regressors (LSRs). The
classification and robustness (against adversarial attacks) performances of BP-
and FF-designed CNNs applied to the MNIST and the CIFAR-10 datasets are
compared. Finally, we comment on the relationship between BP and FF designs.Comment: 32 page
Calibrated Top-1 Uncertainty estimates for classification by score based models
While the accuracy of modern deep learning models has significantly improved
in recent years, the ability of these models to generate uncertainty estimates
has not progressed to the same degree. Uncertainty methods are designed to
provide an estimate of class probabilities when predicting class assignment.
While there are a number of proposed methods for estimating uncertainty, they
all suffer from a lack of calibration: predicted probabilities can be off from
empirical ones by a few percent or more. By restricting the scope of our
predictions to only the probability of Top-1 error, we can decrease the
calibration error of existing methods to less than one percent. As a result,
the scores of the methods also improve significantly over benchmarks.Comment: 12 pages, 5 figures, 6 tables (major revision, new benchmark allows
us to show model calibration is better
Evidential Deep Learning to Quantify Classification Uncertainty
Deterministic neural nets have been shown to learn effective predictors on a
wide range of machine learning problems. However, as the standard approach is
to train the network to minimize a prediction loss, the resultant model remains
ignorant to its prediction confidence. Orthogonally to Bayesian neural nets
that indirectly infer prediction uncertainty through weight uncertainties, we
propose explicit modeling of the same using the theory of subjective logic. By
placing a Dirichlet distribution on the class probabilities, we treat
predictions of a neural net as subjective opinions and learn the function that
collects the evidence leading to these opinions by a deterministic neural net
from data. The resultant predictor for a multi-class classification problem is
another Dirichlet distribution whose parameters are set by the continuous
output of a neural net. We provide a preliminary analysis on how the
peculiarities of our new loss function drive improved uncertainty estimation.
We observe that our method achieves unprecedented success on detection of
out-of-distribution queries and endurance against adversarial perturbations
Off-Policy Actor-Critic in an Ensemble: Achieving Maximum General Entropy and Effective Environment Exploration in Deep Reinforcement Learning
We propose a new policy iteration theory as an important extension of soft
policy iteration and Soft Actor-Critic (SAC), one of the most efficient model
free algorithms for deep reinforcement learning. Supported by the new theory,
arbitrary entropy measures that generalize Shannon entropy, such as Tsallis
entropy and Renyi entropy, can be utilized to properly randomize action
selection while fulfilling the goal of maximizing expected long-term rewards.
Our theory gives birth to two new algorithms, i.e., Tsallis entropy
Actor-Critic (TAC) and Renyi entropy Actor-Critic (RAC). Theoretical analysis
shows that these algorithms can be more effective than SAC. Moreover, they pave
the way for us to develop a new Ensemble Actor-Critic (EAC) algorithm in this
paper that features the use of a bootstrap mechanism for deep environment
exploration as well as a new value-function based mechanism for high-level
action selection. Empirically we show that TAC, RAC and EAC can achieve
state-of-the-art performance on a range of benchmark control tasks,
outperforming SAC and several cutting-edge learning algorithms in terms of both
sample efficiency and effectiveness
Evolution and Analysis of Embodied Spiking Neural Networks Reveals Task-Specific Clusters of Effective Networks
Elucidating principles that underlie computation in neural networks is
currently a major research topic of interest in neuroscience. Transfer Entropy
(TE) is increasingly used as a tool to bridge the gap between network
structure, function, and behavior in fMRI studies. Computational models allow
us to bridge the gap even further by directly associating individual neuron
activity with behavior. However, most computational models that have analyzed
embodied behaviors have employed non-spiking neurons. On the other hand,
computational models that employ spiking neural networks tend to be restricted
to disembodied tasks. We show for the first time the artificial evolution and
TE-analysis of embodied spiking neural networks to perform a
cognitively-interesting behavior. Specifically, we evolved an agent controlled
by an Izhikevich neural network to perform a visual categorization task. The
smallest networks capable of performing the task were found by repeating
evolutionary runs with different network sizes. Informational analysis of the
best solution revealed task-specific TE-network clusters, suggesting that
within-task homogeneity and across-task heterogeneity were key to behavioral
success. Moreover, analysis of the ensemble of solutions revealed that
task-specificity of TE-network clusters correlated with fitness. This provides
an empirically testable hypothesis that links network structure to behavior.Comment: Camera ready version of accepted for GECCO'1
Lightweight Adaptive Mixture of Neural and N-gram Language Models
It is often the case that the best performing language model is an ensemble
of a neural language model with n-grams. In this work, we propose a method to
improve how these two models are combined. By using a small network which
predicts the mixture weight between the two models, we adapt their relative
importance at each time step. Because the gating network is small, it trains
quickly on small amounts of held out data, and does not add overhead at scoring
time. Our experiments carried out on the One Billion Word benchmark show a
significant improvement over the state of the art ensemble without retraining
of the basic modules
A Brain-like Cognitive Process with Shared Methods
This paper describes a new entropy-style of equation that may be useful in a
general sense, but can be applied to a cognitive model with related processes.
The model is based on the human brain, with automatic and distributed pattern
activity. Methods for carrying out the different processes are suggested. The
main purpose of this paper is to reaffirm earlier research on different
knowledge-based and experience-based clustering techniques. The overall
architecture has stayed essentially the same and so it is the localised
processes or smaller details that have been updated. For example, a counting
mechanism is used slightly differently, to measure a level of 'cohesion'
instead of a 'correct' classification, over pattern instances. The introduction
of features has further enhanced the architecture and the new entropy-style
equation is proposed. While an earlier paper defined three levels of functional
requirement, this paper re-defines the levels in a more human vernacular, with
higher-level goals described in terms of action-result pairs
- …