44 research outputs found
Deep Roto-Translation Scattering for Object Classification
Dictionary learning algorithms or supervised deep convolution networks have
considerably improved the efficiency of predefined feature representations such
as SIFT. We introduce a deep scattering convolution network, with predefined
wavelet filters over spatial and angular variables. This representation brings
an important improvement to results previously obtained with predefined
features over object image databases such as Caltech and CIFAR. The resulting
accuracy is comparable to results obtained with unsupervised deep learning and
dictionary based representations. This shows that refining image
representations by using geometric priors is a promising direction to improve
image classification and its understanding.Comment: 9 pages, 3 figures, CVPR 2015 pape
Cyclic Data Parallelism for Efficient Parallelism of Deep Neural Networks
Training large deep learning models requires parallelization techniques to
scale. In existing methods such as Data Parallelism or ZeRO-DP, micro-batches
of data are processed in parallel, which creates two drawbacks: the total
memory required to store the model's activations peaks at the end of the
forward pass, and gradients must be simultaneously averaged at the end of the
backpropagation step. We propose Cyclic Data Parallelism, a novel paradigm
shifting the execution of the micro-batches from simultaneous to sequential,
with a uniform delay. At the cost of a slight gradient delay, the total memory
taken by activations is constant, and the gradient communications are balanced
during the training step. With Model Parallelism, our technique reduces the
number of GPUs needed, by sharing GPUs across micro-batches. Within the ZeRO-DP
framework, our technique allows communication of the model states with
point-to-point operations rather than a collective broadcast operation. We
illustrate the strength of our approach on the CIFAR-10 and ImageNet datasets
DADAO: Decoupled Accelerated Decentralized Asynchronous Optimization
This work introduces DADAO: the first decentralized, accelerated,
asynchronous, primal, first-order algorithm to minimize a sum of -smooth and
-strongly convex functions distributed over a given network of size .
Our key insight is based on modeling the local gradient updates and gossip
communication procedures with separate independent Poisson Point Processes.
This allows us to decouple the computation and communication steps, which can
be run in parallel, while making the whole approach completely asynchronous,
leading to communication acceleration compared to synchronous approaches. Our
new method employs primal gradients and does not use a multi-consensus inner
loop nor other ad-hoc mechanisms such as Error Feedback, Gradient Tracking, or
a Proximal operator. By relating the inverse of the smallest positive
eigenvalue of the Laplacian matrix and the maximal resistance
of the graph to a sufficient minimal communication rate
between the nodes of the network, we show that our algorithm requires
local gradients
and only
communications to reach a precision , up to logarithmic terms. Thus,
we simultaneously obtain an accelerated rate for both computations and
communications, leading to an improvement over state-of-the-art works, our
simulations further validating the strength of our relatively unconstrained
method. We also propose a SDP relaxation to find the optimal gossip rate of
each edge minimizing the total number of communications for a given graph,
resulting in faster convergence compared to standard approaches relying on
uniform communication weights. Our source code is released on a public
repository
Why do tree-based models still outperform deep learning on tabular data?
While deep learning has enabled tremendous progress on text and image
datasets, its superiority on tabular data is not clear. We contribute extensive
benchmarks of standard and novel deep learning methods as well as tree-based
models such as XGBoost and Random Forests, across a large number of datasets
and hyperparameter combinations. We define a standard set of 45 datasets from
varied domains with clear characteristics of tabular data and a benchmarking
methodology accounting for both fitting models and finding good
hyperparameters. Results show that tree-based models remain state-of-the-art on
medium-sized data (10K samples) even without accounting for their
superior speed. To understand this gap, we conduct an empirical investigation
into the differing inductive biases of tree-based models and Neural Networks
(NNs). This leads to a series of challenges which should guide researchers
aiming to build tabular-specific NNs: 1. be robust to uninformative features,
2. preserve the orientation of the data, and 3. be able to easily learn
irregular functions. To stimulate research on tabular architectures, we
contribute a standard benchmark and raw data for baselines: every point of a 20
000 compute hours hyperparameter search for each learner
Low-Rank Projections of GCNs Laplacian
In this work, we study the behavior of standard models for community
detection under spectral manipulations. Through various ablation experiments,
we evaluate the impact of bandpass filtering on the performance of a GCN: we
empirically show that most of the necessary and used information for nodes
classification is contained in the low-frequency domain, and thus contrary to
images, high frequencies are less crucial to community detection. In
particular, it is sometimes possible to obtain accuracies at a state-of-the-art
level with simple classifiers that rely only on a few low frequencies