144 research outputs found
Image zooming based on sampling theorems
In this paper we introduce two digital zoom methods based on sampling theory and we study their mathematical foundation. The first one (usually known by the names of "sinc interpolation", "zero-padding" and "Fourier zoom") is commonly used by the image processing community
SDSS-IV MaNGA IFS Galaxy Survey—Survey Design, Execution, and Initial Data Quality
The MaNGA Survey (Mapping Nearby Galaxies at Apache Point Observatory) is one of three core programs in the Sloan Digital Sky Survey IV. It is obtaining integral field spectroscopy for 10,000 nearby galaxies at a spectral resolution of R ~ 2000 from 3622 to 10354 Å. The design of the survey is driven by a set of science requirements on the precision of estimates of the following properties: star formation rate surface density, gas metallicity, stellar population age, metallicity, and abundance ratio, and their gradients; stellar and gas kinematics; and enclosed gravitational mass as a function of radius. We describe how these science requirements set the depth of the observations and dictate sample selection. The majority of targeted galaxies are selected to ensure uniform spatial coverage in units of effective radius (Re) while maximizing spatial resolution. About two-thirds of the sample is covered out to 1.5Re (Primary sample), and one-third of the sample is covered to 2.5Re (Secondary sample). We describe the survey execution with details that would be useful in the design of similar future surveys. We also present statistics on the achieved data quality, specifically the point-spread function, sampling uniformity, spectral resolution, sky subtraction, and flux calibration. For our Primary sample, the median r-band signal-to-noise ratio is ~70 per 1.4 Å pixel for spectra stacked between 1R e and 1.5Re. Measurements of various galaxy properties from the first-year data show that we are meeting or exceeding the defined requirements for the majority of our science goals
Auto-Encoders, Distributed Training and Information Representation in Deep Neural Networks
L'objectif de cette thèse est de présenter ma modeste contribution à l'effort collectif de l'humanité pour comprendre l'intelligence et construire des machines intelligentes.
Ceci est une thèse par articles (cinq au total), tous représentant une entreprise personnelle dans laquelle j'ai consacré beaucoup d'énergie.
Les articles sont présentés en ordre chronologique, et ils touchent principalement à deux sujets : l'apprentissage de représentations et l'optimisation. Les articles des chapitres 3, 5 et 9
sont dans la première catégorie, et ceux des chapitres 7 et 11 sont dans la seconde catégorie.
Dans le premier article, nous partons de l'idée de modéliser la géométrie des données en entraînant un auto-encodeur débruitant qui reconstruit les données après qu'on les ait perturbées. Nous établissons un lien entre les auto-encodeurs contractifs et les auto-encodeurs débruitants. Notre contribution majeure consiste à démontrer mathématiquement une propriété intéressante qu'ont les solutions optimales aux auto-encodeurs débruitants lorsqu'ils sont définis à partir de bruit additif gaussien. Plus spécifiquement, nous démontrons qu'ils apprennent le score de la densité de probabilité. Nous présentons un ensemble de méthodes pratiques par lesquelles ce résultat nous permet de transformer un auto-encodeur en modèle génératif. Nous menons certaines expériences dans le but d'apprendre la géométrie locale des distributions de données.
Dans le second article, nous continuons dans la même ligne d'idées en construisant un modèle génératif basé sur l'apprentissage de distributions conditionnelles. Cet exercice se fait dans un cadre plus général et nous nous concentrons sur les propriétés de la chaine de Markov obtenu par échantillonnage de Gibbs. à l'aide d'une petite modification lors de la construction de la chaine de Markov, nous obtenons un modèle que l'on nomme "Generative Stochastic Networks". Plusieurs copies de ce modèle peuvent se combiner pour créer une hiérarchie de représentations abstraites servant à mieux représenter la nature des données. Nous présentons des expériences sur l'ensemble de données MNIST et sur le remplissage d'images trouées.
Dans notre troisième article, nous présentons un nouveau paradigme pour l'optimisation parallèle. Nous proposons d'utiliser un ensemble de noeuds de calcul pour évaluer les coefficients nécessaires à faire de l'échantillonnage préférentiel sur les données d'entraînement. Cette idée ressemble beaucoup à l'apprentissage avec curriculum qui est une méthode dans laquelle l'ordre des données fournies au modèle est choisi avec beaucoup de soin dans le but de faciliter l'apprentissage.
Nous comparons les résultats expérimentaux observés à ceux anticipés en terme de réduction de variance sur les gradients.
Dans notre quatrième article, nous revenons au concept d'apprentissage de représentations et nous cherchons à savoir s'il serait possible de définir une notion utile de "contenu en information" dans le contexte de couches de réseaux neuronaux.
Ceci nous intéresse en particulier parce qu'il y a une sorte de paradoxe avec les réseaux profonds qui sont déterministes. Les couches les plus profondes ont des meilleures représentations que les premières couches, mais si l'on regarde strictement avec le point de vue de l'entropie (venant de la théorie de l'information) il est impossible qu'une couche plus profonde contienne plus d'information qu'une couche à l'entrée.
Nous développons une méthode d'entraînement de classifieur linéaire sur chaque couche du modèle étudié (dont les paramètres sont maintenant figés pendant l'étude). Nous appelons ces classifeurs des "sondes linéaires de classification", et nous nous en servons pour mieux comprendre la dynamique particulière de l'entraînement d'un réseau profond.
Nous présentons des expériences menées sur des gros modèles (Inception v3 et ResNet-50), et nous découvrons une propriété étonnante : la performance de ces sondes augmente de manière monotone lorsque l'on descend dans les couches plus profondes.
Dans le cinquième article, nous retournons à l'optimisation, et nous étudions la courbure de l'espace de la fonction de perte. Nous regardons les vecteurs propres dominants de la matrice hessienne, et nous explorons les gains potentiels dans ces directions s'il était possible de faire un pas d'une longueur optimale.
Nous sommes principalement intéressés par les gains dans les directions associées aux valeurs propres négatives car celles-ci sont généralement ignorées par les méthodes populaire d'optimisation convexes. L'étude de la matrice hessienne demande des coûts énormes en calcul, et nous devons nous limiter à des expérience sur les données MNIST. Nous découvrons que des gains très importants peuvent être réalisés dans les directions de courbure négative, et que les longueurs de pas optimales sont beaucoup plus grandes que celles suggérées par la littérature existante.The goal of this thesis is to present a body of work that serves as my modest
contribution to humanity's
quest to understand intelligence and to implement intelligent systems.
This is a thesis by articles, containing five articles, not all of equal impact,
but all representing a very meaningful personal endeavor.
The articles are presented in chronological order, and they cluster
around two general topics : representation learning and optimization. Articles from chapters 3, 5, and 9
are in the former category, whereas articles from chapters 7 and 11 are in the latter.
In the first article, we start with the idea of manifold learning through training
a denoising auto-encoder to locally reconstruct data after perturbations.
We establish a connection between contractive auto-encoders and denoising auto-encoders.
More importantly, we prove mathematically a very interesting property from the
optimal solution to denoising auto-encoders with additive gaussian noise.
Namely, the fact that they learn exactly the score of the probability density function
of the training distribution. We present a collection of ways in which this
allows us to turn an auto-encoder into a generative model.
We provide experiments all related to the goal of local manifold learning.
In the second article, we continue with that idea of building a generative model
by learning conditional distributions. We do that in a more general setting
and we focus more on the properties of the Markov chain obtained by Gibbs sampling.
With a small modification in the construction of the Markov chain,
we obtain the more general "Generative Stochastic Networks",
which we can then stack together into a structure that can represent more
accurately the different levels of abstraction of the data modeled.
We present experiments involving the generation of MNIST digits and image inpainting.
In the third article, we present a novel idea for distributed optimization.
Our proposal uses
a collection of worker nodes to compute the importance weights to be used
by one master node to perform Importance Sampling.
This paradigm has a lot in common with the idea of curriculum learning,
whereby the order of training examples is taken to have a significant impact on the
training performance.
We present results to compare the potential reduction in variance
for gradient estimates with the practical reduction in variance observed.
In the fourth article, we go back to the concept of representation learning
by asking whether there would be any measurable quantity in a neural network layer
that would correspond intuitively to its "information contents".
This is particularly interesting because there is a kind of paradox in
deterministic neural networks : deeper layers encode better representations of the
input signal, but they carry less (or equal) information than the raw inputs (in terms of entropy).
By training a linear classifier on every layer in a neural network (with frozen parameters),
we are able to measure linearly separability of the representations at every layer.
We call these "linear classifier probes", and we show how they
can be used to better understand the dynamics of training a neural network.
We present experiments with large models (Inception v3 and ResNet-50) and uncover
a surprizing property : linear separability increases in a strictly monotonic
relationship with the layer depth.
In the fifth article, we revisit optimization again, but now we study the negative
curvature of the loss function. We look at the most dominant eigenvalues and
eigenvectors of the Hessian matrix, and we explore the gains to be made
by modifying the model parameters along that direction with an optimal step size.
We are mainly interested in the potential gains for directions of negative curvature,
because those are ignored by the very popular convex optimization
methods used by the deep learning community.
Due to the large computational costs of anything dealing with
the Hessian matrix, we run a small model on MNIST. We find that large gains
can be made in directions of negative curvature, and that the optimal step sizes
involved are larger than the current literature would recommend
Random ergodic theorems with universally representative sequences
When elements of a measure-preserving action of Rd or Zd are selected in a random way, according to a stationary stochastic process, a.e. convergence of the averages of an LP function along the resulting orbits may almost surely hold, in every system; in such a case we call the sampling scheme universally representative. We show that i.i.d. integervalued sampling schemes are universally representative (with p > 1) if and only if they have nonzero mean, and we discuss a variety of other sampling schemes which have or lack this property
Lifting Weak Supervision To Structured Prediction
Weak supervision (WS) is a rich set of techniques that produce pseudolabels
by aggregating easily obtained but potentially noisy label estimates from a
variety of sources. WS is theoretically well understood for binary
classification, where simple approaches enable consistent estimation of
pseudolabel noise rates. Using this result, it has been shown that downstream
models trained on the pseudolabels have generalization guarantees nearly
identical to those trained on clean labels. While this is exciting, users often
wish to use WS for structured prediction, where the output space consists of
more than a binary or multi-class label set: e.g. rankings, graphs, manifolds,
and more. Do the favorable theoretical properties of WS for binary
classification lift to this setting? We answer this question in the affirmative
for a wide range of scenarios. For labels taking values in a finite metric
space, we introduce techniques new to weak supervision based on
pseudo-Euclidean embeddings and tensor decompositions, providing a
nearly-consistent noise rate estimator. For labels in constant-curvature
Riemannian manifolds, we introduce new invariants that also yield consistent
noise rate estimation. In both cases, when using the resulting pseudolabels in
concert with a flexible downstream model, we obtain generalization guarantees
nearly identical to those for models trained on clean data. Several of our
results, which can be viewed as robustness guarantees in structured prediction
with noisy labels, may be of independent interest. Empirical evaluation
validates our claims and shows the merits of the proposed method
Toeplitz Low-Rank Approximation with Sublinear Query Complexity
We present a sublinear query algorithm for outputting a near-optimal low-rank
approximation to any positive semidefinite Toeplitz matrix . In particular, for any integer rank and , our algorithm makes queries to the entries of and outputs a
rank matrix
such that . Here, is the
Frobenius norm and is the optimal rank- approximation to , given by
projection onto its top eigenvectors. hides
factors. Our algorithm is \emph{structure-preserving}, in
that the approximation is also Toeplitz. A key technical
contribution is a proof that any positive semidefinite Toeplitz matrix in fact
has a near-optimal low-rank approximation which is itself Toeplitz.
Surprisingly, this basic existence result was not previously known. Building on
this result, along with the well-established off-grid Fourier structure of
Toeplitz matrices [Cybenko'82], we show that Toeplitz with near
optimal error can be recovered with a small number of random queries via a
leverage-score-based off-grid sparse Fourier sampling scheme.Comment: Accepted in SODA 202
Federated Hypergradient Descent
In this work, we explore combining automatic hyperparameter tuning and
optimization for federated learning (FL) in an online, one-shot procedure. We
apply a principled approach on a method for adaptive client learning rate,
number of local steps, and batch size. In our federated learning applications,
our primary motivations are minimizing communication budget as well as local
computational resources in the training pipeline. Conventionally,
hyperparameter tuning methods involve at least some degree of trial-and-error,
which is known to be sample inefficient. In order to address our motivations,
we propose FATHOM (Federated AuTomatic Hyperparameter OptiMization) as a
one-shot online procedure. We investigate the challenges and solutions of
deriving analytical gradients with respect to the hyperparameters of interest.
Our approach is inspired by the fact that, with the exception of local data, we
have full knowledge of all components involved in our training process, and
this fact can be exploited in our algorithm impactfully. We show that FATHOM is
more communication efficient than Federated Averaging (FedAvg) with optimized,
static valued hyperparameters, and is also more computationally efficient
overall. As a communication efficient, one-shot online procedure, FATHOM solves
the bottleneck of costly communication and limited local computation, by
eliminating a potentially wasteful tuning process, and by optimizing the
hyperparamters adaptively throughout the training procedure without
trial-and-error. We show our numerical results through extensive empirical
experiments with the Federated EMNIST-62 (FEMNIST) and Federated Stack Overflow
(FSO) datasets, using FedJAX as our baseline framework
Identifying Spurious Biases Early in Training through the Lens of Simplicity Bias
Neural networks trained with (stochastic) gradient descent have an inductive
bias towards learning simpler solutions. This makes them highly prone to
learning simple spurious features that are highly correlated with a label
instead of the predictive but more complex core features. In this work, we show
that, interestingly, the simplicity bias of gradient descent can be leveraged
to identify spurious correlations, early in training. First, we prove on a
two-layer neural network, that groups of examples with high spurious
correlation are separable based on the model's output, in the initial training
iterations. We further show that if spurious features have a small enough
noise-to-signal ratio, the network's output on the majority of examples in a
class will be almost exclusively determined by the spurious features and will
be nearly invariant to the core feature. Finally, we propose SPARE, which
separates large groups with spurious correlations early in training, and
utilizes importance sampling to alleviate the spurious correlation, by
balancing the group sizes. We show that SPARE achieves up to 5.6% higher
worst-group accuracy than state-of-the-art methods, while being up to 12x
faster. We also show the applicability of SPARE to discover and mitigate
spurious correlations in Restricted ImageNet
Checking Trustworthiness of Probabilistic Computations in a Typed Natural Deduction System
In this paper we present the probabilistic typed natural deduction calculus
TPTND, designed to reason about and derive trustworthiness properties of
probabilistic computational processes, like those underlying current AI
applications. Derivability in TPTND is interpreted as the process of extracting
samples of possibly complex outputs with a certain frequency from a given
categorical distribution. We formalize trust for such outputs as a form of
hypothesis testing on the distance between such frequency and the intended
probability. The main advantage of the calculus is to render such notion of
trustworthiness checkable. We present a computational semantics for the terms
over which we reason and then the semantics of TPTND, where logical operators
as well as a Trust operator are defined through introduction and elimination
rules. We illustrate structural and metatheoretical properties, with particular
focus on the ability to establish under which term evolutions and logical rules
applications the notion of trustworhtiness can be preserved
Algorithmic Fairness in Business Analytics: Directions for Research and Practice
The extensive adoption of business analytics (BA) has brought financial gains
and increased efficiencies. However, these advances have simultaneously drawn
attention to rising legal and ethical challenges when BA inform decisions with
fairness implications. As a response to these concerns, the emerging study of
algorithmic fairness deals with algorithmic outputs that may result in
disparate outcomes or other forms of injustices for subgroups of the
population, especially those who have been historically marginalized. Fairness
is relevant on the basis of legal compliance, social responsibility, and
utility; if not adequately and systematically addressed, unfair BA systems may
lead to societal harms and may also threaten an organization's own survival,
its competitiveness, and overall performance. This paper offers a
forward-looking, BA-focused review of algorithmic fairness. We first review the
state-of-the-art research on sources and measures of bias, as well as bias
mitigation algorithms. We then provide a detailed discussion of the
utility-fairness relationship, emphasizing that the frequent assumption of a
trade-off between these two constructs is often mistaken or short-sighted.
Finally, we chart a path forward by identifying opportunities for business
scholars to address impactful, open challenges that are key to the effective
and responsible deployment of BA
- …