1,340 research outputs found
Juggling Functions Inside a Database
We define and study the Functional Aggregate Query (FAQ) problem, which
captures common computational tasks across a very wide range of domains
including relational databases, logic, matrix and tensor computation,
probabilistic graphical models, constraint satisfaction, and signal processing.
Simply put, an FAQ is a declarative way of defining a new function from a
database of input functions.
We present "InsideOut", a dynamic programming algorithm, to evaluate an FAQ.
The algorithm rewrites the input query into a set of easier-to-compute FAQ
sub-queries. Each sub-query is then evaluated using a worst-case optimal
relational join algorithm. The topic of designing algorithms to optimally
evaluate the classic multiway join problem has seen exciting developments in
the past few years. Our framework tightly connects these new ideas in database
theory with a vast number of application areas in a coherent manner, showing
potentially that a good database engine can be a general-purpose constraint
solver, relational data store, graphical model inference engine, and
matrix/tensor computation processor all at once.
The InsideOut algorithm is very simple, as shall be described in this paper.
Yet, in spite of solving an extremely general problem, its runtime either is as
good as or improves upon the best known algorithm for the applications that FAQ
specializes to. These corollaries include computational tasks in graphical
model inference, matrix/tensor operations, relational joins, and logic. Better
yet, InsideOut can be used within any database engine, because it is basically
a principled way of rewriting queries. Indeed, it is already part of the
LogicBlox database engine, helping efficiently answer traditional database
queries, graphical model inference queries, and train a large class of machine
learning models inside the database itself.Comment: arXiv admin note: text overlap with arXiv:1504.0404
FAQ: Questions Asked Frequently
We define and study the Functional Aggregate Query (FAQ) problem, which
encompasses many frequently asked questions in constraint satisfaction,
databases, matrix operations, probabilistic graphical models and logic. This is
our main conceptual contribution.
We then present a simple algorithm called "InsideOut" to solve this general
problem. InsideOut is a variation of the traditional dynamic programming
approach for constraint programming based on variable elimination. Our
variation adds a couple of simple twists to basic variable elimination in order
to deal with the generality of FAQ, to take full advantage of Grohe and Marx's
fractional edge cover framework, and of the analysis of recent worst-case
optimal relational join algorithms.
As is the case with constraint programming and graphical model inference, to
make InsideOut run efficiently we need to solve an optimization problem to
compute an appropriate 'variable ordering'. The main technical contribution of
this work is a precise characterization of when a variable ordering is
'semantically equivalent' to the variable ordering given by the input FAQ
expression. Then, we design an approximation algorithm to find an equivalent
variable ordering that has the best 'fractional FAQ-width'. Our results imply a
host of known and a few new results in graphical model inference, matrix
operations, relational joins, and logic.
We also briefly explain how recent algorithms on beyond worst-case analysis
for joins and those for solving SAT and #SAT can be viewed as variable
elimination to solve FAQ over compactly represented input functions
Worst-Case Optimal Join Algorithms: Techniques, Results, and Open Problems
Worst-case optimal join algorithms are the class of join algorithms whose
runtime match the worst-case output size of a given join query. While the first
provably worst-case optimal join algorithm was discovered relatively recently,
the techniques and results surrounding these algorithms grow out of decades of
research from a wide range of areas, intimately connecting graph theory,
algorithms, information theory, constraint satisfaction, database theory, and
geometric inequalities. These ideas are not just paperware: in addition to
academic project implementations, two variations of such algorithms are the
work-horse join algorithms of commercial database and data analytics engines.
This paper aims to be a brief introduction to the design and analysis of
worst-case optimal join algorithms. We discuss the key techniques for proving
runtime and output size bounds. We particularly focus on the fascinating
connection between join algorithms and information theoretic inequalities, and
the idea of how one can turn a proof into an algorithm. Finally, we conclude
with a representative list of fundamental open problems in this area
Average size of 2-Selmer groups of elliptic curves over function fields
Employing a geometric setting inspired by the proof of the Fundamental Lemma,
we study some counting problems related to the average size of 2-Selmer groups
and hence obtain an estimate for it.Comment: Thoroughly revised to improve the expositio
Interpreting Chest X-rays via CNNs that Exploit Hierarchical Disease Dependencies and Uncertainty Labels
The chest X-rays (CXRs) is one of the views most commonly ordered by
radiologists (NHS),which is critical for diagnosis of many different thoracic
diseases. Accurately detecting thepresence of multiple diseases from CXRs is
still a challenging task. We present a multi-labelclassification framework
based on deep convolutional neural networks (CNNs) for diagnos-ing the presence
of 14 common thoracic diseases and observations. Specifically, we trained
astrong set of CNNs that exploit dependencies among abnormality labels and used
the labelsmoothing regularization (LSR) for a better handling of uncertain
samples. Our deep net-works were trained on over 200,000 CXRs of the recently
released CheXpert dataset (Irvinandal., 2019) and the final model, which was an
ensemble of the best performing networks,achieved a mean area under the curve
(AUC) of 0.940 in predicting 5 selected pathologiesfrom the validation set. To
the best of our knowledge, this is the highest AUC score yetreported to date.
More importantly, the proposed method was also evaluated on an inde-pendent
test set of the CheXpert competition, containing 500 CXR studies annotated by
apanel of 5 experienced radiologists. The reported performance was on average
better than2.6 out of 3 other individual radiologists with a mean AUC of 0.930,
which had led to thecurrent state-of-the-art performance on the CheXpert test
set.Comment: MIDL 2020 Accepted Short Paper. arXiv admin note: substantial text
overlap with arXiv:1911.0647
Sparse Approximation, List Decoding, and Uncertainty Principles
We consider list versions of sparse approximation problems, where unlike the
existing results in sparse approximation that consider situations with unique
solutions, we are interested in multiple solutions. We introduce these problems
and present the first combinatorial results on the output list size. These
generalize and enhance some of the existing results on threshold phenomenon and
uncertainty principles in sparse approximations. Our definitions and results
are inspired by similar results in list decoding. We also present lower bound
examples that bolster our results and show they are of the appropriate size
Analyzing Nonblocking Switching Networks using Linear Programming (Duality)
The main task in analyzing a switching network design (including circuit-,
multirate-, and photonic-switching) is to determine the minimum number of some
switching components so that the design is non-blocking in some sense (e.g.,
strict- or wide-sense). We show that, in many cases, this task can be
accomplished with a simple two-step strategy: (1) formulate a linear program
whose optimum value is a bound for the minimum number we are seeking, and (2)
specify a solution to the dual program, whose objective value by weak duality
immediately yields a sufficient condition for the design to be non-blocking.
We illustrate this technique through a variety of examples, ranging from
circuit to multirate to photonic switching, from unicast to -cast and
multicast, and from strict- to wide-sense non-blocking. The switching
architectures in the examples are of Clos-type and Banyan-type, which are the
two most popular architectural choices for designing non-blocking switching
networks.
To prove the result in the multirate Clos network case, we formulate a new
problem called {\sc dynamic weighted edge coloring} which generalizes the {\sc
dynamic bin packing} problem. We then design an algorithm with competitive
ratio 5.6355 for the problem. The algorithm is analyzed using the linear
programming technique. A new upper-bound for multirate wide-sense non-blocking
Clos networks follow, improving upon a decade-old bound on the same problem
How to Scale Up the Spectral Efficiency of Multi-way Massive MIMO Relaying?
This paper considers a decode-and-forward (DF) multi-way massive
multiple-input multiple-output (MIMO) relay system where many users exchange
their data with the aid of a relay station equipped with a massive antenna
array. We propose a new transmission protocol which leverages successive
cancelation decoding and zero-forcing (ZF) at the users. By using properties of
massive MIMO, a tight analytical approximation of the spectral efficiency is
derived. We show that our proposed scheme uses only half of the time-slots
required in the conventional scheme (in which the number of time-slots is equal
to the number of users [1]), to exchange data across different users. As a
result, the sum spectral efficiency of our proposed scheme is nearly double the
one of the conventional scheme, thereby boosting the performance of multi-way
massive MIMO to unprecedented levels
On Optimality Conditions for Auto-Encoder Signal Recovery
Auto-Encoders are unsupervised models that aim to learn patterns from
observed data by minimizing a reconstruction cost. The useful representations
learned are often found to be sparse and distributed. On the other hand,
compressed sensing and sparse coding assume a data generating process, where
the observed data is generated from some true latent signal source, and try to
recover the corresponding signal from measurements. Looking at auto-encoders
from this \textit{signal recovery perspective} enables us to have a more
coherent view of these techniques. In this paper, in particular, we show that
the \textit{true} hidden representation can be approximately recovered if the
weight matrices are highly incoherent with unit row length and the
bias vectors takes the value (approximately) equal to the negative of the data
mean. The recovery also becomes more and more accurate as the sparsity in
hidden signals increases. Additionally, we empirically demonstrate that
auto-encoders are capable of recovering the data generating dictionary when
only data samples are given
AC/DC: In-Database Learning Thunderstruck
We report on the design and implementation of the AC/DC gradient descent
solver for a class of optimization problems over normalized databases. AC/DC
decomposes an optimization problem into a set of aggregates over the join of
the database relations. It then uses the answers to these aggregates to
iteratively improve the solution to the problem until it converges.
The challenges faced by AC/DC are the large database size, the mixture of
continuous and categorical features, and the large number of aggregates to
compute. AC/DC addresses these challenges by employing a sparse data
representation, factorized computation, problem reparameterization under
functional dependencies, and a data structure that supports shared computation
of aggregates.
To train polynomial regression models and factorization machines of up to
154K features over the natural join of all relations from a real-world dataset
of up to 86M tuples, AC/DC needs up to 30 minutes on one core of a commodity
machine. This is up to three orders of magnitude faster than its competitors R,
MadLib, libFM, and TensorFlow whenever they finish and thus do not exceed
memory limitation, 24-hour timeout, or internal design limitations.Comment: 10 pages, 3 figure
- …