1,340 research outputs found

    Juggling Functions Inside a Database

    Full text link
    We define and study the Functional Aggregate Query (FAQ) problem, which captures common computational tasks across a very wide range of domains including relational databases, logic, matrix and tensor computation, probabilistic graphical models, constraint satisfaction, and signal processing. Simply put, an FAQ is a declarative way of defining a new function from a database of input functions. We present "InsideOut", a dynamic programming algorithm, to evaluate an FAQ. The algorithm rewrites the input query into a set of easier-to-compute FAQ sub-queries. Each sub-query is then evaluated using a worst-case optimal relational join algorithm. The topic of designing algorithms to optimally evaluate the classic multiway join problem has seen exciting developments in the past few years. Our framework tightly connects these new ideas in database theory with a vast number of application areas in a coherent manner, showing potentially that a good database engine can be a general-purpose constraint solver, relational data store, graphical model inference engine, and matrix/tensor computation processor all at once. The InsideOut algorithm is very simple, as shall be described in this paper. Yet, in spite of solving an extremely general problem, its runtime either is as good as or improves upon the best known algorithm for the applications that FAQ specializes to. These corollaries include computational tasks in graphical model inference, matrix/tensor operations, relational joins, and logic. Better yet, InsideOut can be used within any database engine, because it is basically a principled way of rewriting queries. Indeed, it is already part of the LogicBlox database engine, helping efficiently answer traditional database queries, graphical model inference queries, and train a large class of machine learning models inside the database itself.Comment: arXiv admin note: text overlap with arXiv:1504.0404

    FAQ: Questions Asked Frequently

    Full text link
    We define and study the Functional Aggregate Query (FAQ) problem, which encompasses many frequently asked questions in constraint satisfaction, databases, matrix operations, probabilistic graphical models and logic. This is our main conceptual contribution. We then present a simple algorithm called "InsideOut" to solve this general problem. InsideOut is a variation of the traditional dynamic programming approach for constraint programming based on variable elimination. Our variation adds a couple of simple twists to basic variable elimination in order to deal with the generality of FAQ, to take full advantage of Grohe and Marx's fractional edge cover framework, and of the analysis of recent worst-case optimal relational join algorithms. As is the case with constraint programming and graphical model inference, to make InsideOut run efficiently we need to solve an optimization problem to compute an appropriate 'variable ordering'. The main technical contribution of this work is a precise characterization of when a variable ordering is 'semantically equivalent' to the variable ordering given by the input FAQ expression. Then, we design an approximation algorithm to find an equivalent variable ordering that has the best 'fractional FAQ-width'. Our results imply a host of known and a few new results in graphical model inference, matrix operations, relational joins, and logic. We also briefly explain how recent algorithms on beyond worst-case analysis for joins and those for solving SAT and #SAT can be viewed as variable elimination to solve FAQ over compactly represented input functions

    Worst-Case Optimal Join Algorithms: Techniques, Results, and Open Problems

    Full text link
    Worst-case optimal join algorithms are the class of join algorithms whose runtime match the worst-case output size of a given join query. While the first provably worst-case optimal join algorithm was discovered relatively recently, the techniques and results surrounding these algorithms grow out of decades of research from a wide range of areas, intimately connecting graph theory, algorithms, information theory, constraint satisfaction, database theory, and geometric inequalities. These ideas are not just paperware: in addition to academic project implementations, two variations of such algorithms are the work-horse join algorithms of commercial database and data analytics engines. This paper aims to be a brief introduction to the design and analysis of worst-case optimal join algorithms. We discuss the key techniques for proving runtime and output size bounds. We particularly focus on the fascinating connection between join algorithms and information theoretic inequalities, and the idea of how one can turn a proof into an algorithm. Finally, we conclude with a representative list of fundamental open problems in this area

    Average size of 2-Selmer groups of elliptic curves over function fields

    Full text link
    Employing a geometric setting inspired by the proof of the Fundamental Lemma, we study some counting problems related to the average size of 2-Selmer groups and hence obtain an estimate for it.Comment: Thoroughly revised to improve the expositio

    Interpreting Chest X-rays via CNNs that Exploit Hierarchical Disease Dependencies and Uncertainty Labels

    Full text link
    The chest X-rays (CXRs) is one of the views most commonly ordered by radiologists (NHS),which is critical for diagnosis of many different thoracic diseases. Accurately detecting thepresence of multiple diseases from CXRs is still a challenging task. We present a multi-labelclassification framework based on deep convolutional neural networks (CNNs) for diagnos-ing the presence of 14 common thoracic diseases and observations. Specifically, we trained astrong set of CNNs that exploit dependencies among abnormality labels and used the labelsmoothing regularization (LSR) for a better handling of uncertain samples. Our deep net-works were trained on over 200,000 CXRs of the recently released CheXpert dataset (Irvinandal., 2019) and the final model, which was an ensemble of the best performing networks,achieved a mean area under the curve (AUC) of 0.940 in predicting 5 selected pathologiesfrom the validation set. To the best of our knowledge, this is the highest AUC score yetreported to date. More importantly, the proposed method was also evaluated on an inde-pendent test set of the CheXpert competition, containing 500 CXR studies annotated by apanel of 5 experienced radiologists. The reported performance was on average better than2.6 out of 3 other individual radiologists with a mean AUC of 0.930, which had led to thecurrent state-of-the-art performance on the CheXpert test set.Comment: MIDL 2020 Accepted Short Paper. arXiv admin note: substantial text overlap with arXiv:1911.0647

    Sparse Approximation, List Decoding, and Uncertainty Principles

    Full text link
    We consider list versions of sparse approximation problems, where unlike the existing results in sparse approximation that consider situations with unique solutions, we are interested in multiple solutions. We introduce these problems and present the first combinatorial results on the output list size. These generalize and enhance some of the existing results on threshold phenomenon and uncertainty principles in sparse approximations. Our definitions and results are inspired by similar results in list decoding. We also present lower bound examples that bolster our results and show they are of the appropriate size

    Analyzing Nonblocking Switching Networks using Linear Programming (Duality)

    Full text link
    The main task in analyzing a switching network design (including circuit-, multirate-, and photonic-switching) is to determine the minimum number of some switching components so that the design is non-blocking in some sense (e.g., strict- or wide-sense). We show that, in many cases, this task can be accomplished with a simple two-step strategy: (1) formulate a linear program whose optimum value is a bound for the minimum number we are seeking, and (2) specify a solution to the dual program, whose objective value by weak duality immediately yields a sufficient condition for the design to be non-blocking. We illustrate this technique through a variety of examples, ranging from circuit to multirate to photonic switching, from unicast to ff-cast and multicast, and from strict- to wide-sense non-blocking. The switching architectures in the examples are of Clos-type and Banyan-type, which are the two most popular architectural choices for designing non-blocking switching networks. To prove the result in the multirate Clos network case, we formulate a new problem called {\sc dynamic weighted edge coloring} which generalizes the {\sc dynamic bin packing} problem. We then design an algorithm with competitive ratio 5.6355 for the problem. The algorithm is analyzed using the linear programming technique. A new upper-bound for multirate wide-sense non-blocking Clos networks follow, improving upon a decade-old bound on the same problem

    How to Scale Up the Spectral Efficiency of Multi-way Massive MIMO Relaying?

    Full text link
    This paper considers a decode-and-forward (DF) multi-way massive multiple-input multiple-output (MIMO) relay system where many users exchange their data with the aid of a relay station equipped with a massive antenna array. We propose a new transmission protocol which leverages successive cancelation decoding and zero-forcing (ZF) at the users. By using properties of massive MIMO, a tight analytical approximation of the spectral efficiency is derived. We show that our proposed scheme uses only half of the time-slots required in the conventional scheme (in which the number of time-slots is equal to the number of users [1]), to exchange data across different users. As a result, the sum spectral efficiency of our proposed scheme is nearly double the one of the conventional scheme, thereby boosting the performance of multi-way massive MIMO to unprecedented levels

    On Optimality Conditions for Auto-Encoder Signal Recovery

    Full text link
    Auto-Encoders are unsupervised models that aim to learn patterns from observed data by minimizing a reconstruction cost. The useful representations learned are often found to be sparse and distributed. On the other hand, compressed sensing and sparse coding assume a data generating process, where the observed data is generated from some true latent signal source, and try to recover the corresponding signal from measurements. Looking at auto-encoders from this \textit{signal recovery perspective} enables us to have a more coherent view of these techniques. In this paper, in particular, we show that the \textit{true} hidden representation can be approximately recovered if the weight matrices are highly incoherent with unit â„“2 \ell^{2} row length and the bias vectors takes the value (approximately) equal to the negative of the data mean. The recovery also becomes more and more accurate as the sparsity in hidden signals increases. Additionally, we empirically demonstrate that auto-encoders are capable of recovering the data generating dictionary when only data samples are given

    AC/DC: In-Database Learning Thunderstruck

    Full text link
    We report on the design and implementation of the AC/DC gradient descent solver for a class of optimization problems over normalized databases. AC/DC decomposes an optimization problem into a set of aggregates over the join of the database relations. It then uses the answers to these aggregates to iteratively improve the solution to the problem until it converges. The challenges faced by AC/DC are the large database size, the mixture of continuous and categorical features, and the large number of aggregates to compute. AC/DC addresses these challenges by employing a sparse data representation, factorized computation, problem reparameterization under functional dependencies, and a data structure that supports shared computation of aggregates. To train polynomial regression models and factorization machines of up to 154K features over the natural join of all relations from a real-world dataset of up to 86M tuples, AC/DC needs up to 30 minutes on one core of a commodity machine. This is up to three orders of magnitude faster than its competitors R, MadLib, libFM, and TensorFlow whenever they finish and thus do not exceed memory limitation, 24-hour timeout, or internal design limitations.Comment: 10 pages, 3 figure
    • …