6 research outputs found

    Finding Subcube Heavy Hitters in Analytics Data Streams

    Full text link
    Data streams typically have items of large number of dimensions. We study the fundamental heavy-hitters problem in this setting. Formally, the data stream consists of dd-dimensional items x1,…,xm∈[n]dx_1,\ldots,x_m \in [n]^d. A kk-dimensional subcube TT is a subset of distinct coordinates {T1,⋯ ,Tk}⊆[d]\{ T_1,\cdots,T_k \} \subseteq [d]. A subcube heavy hitter query Query(T,v){\rm Query}(T,v), v∈[n]kv \in [n]^k, outputs YES if fT(v)≥γf_T(v) \geq \gamma and NO if fT(v)<γ/4f_T(v) < \gamma/4, where fTf_T is the ratio of number of stream items whose coordinates TT have joint values vv. The all subcube heavy hitters query AllQuery(T){\rm AllQuery}(T) outputs all joint values vv that return YES to Query(T,v){\rm Query}(T,v). The one dimensional version of this problem where d=1d=1 was heavily studied in data stream theory, databases, networking and signal processing. The subcube heavy hitters problem is applicable in all these cases. We present a simple reservoir sampling based one-pass streaming algorithm to solve the subcube heavy hitters problem in O~(kd/γ)\tilde{O}(kd/\gamma) space. This is optimal up to poly-logarithmic factors given the established lower bound. In the worst case, this is Θ(d2/γ)\Theta(d^2/\gamma) which is prohibitive for large dd, and our goal is to circumvent this quadratic bottleneck. Our main contribution is a model-based approach to the subcube heavy hitters problem. In particular, we assume that the dimensions are related to each other via the Naive Bayes model, with or without a latent dimension. Under this assumption, we present a new two-pass, O~(d/γ)\tilde{O}(d/\gamma)-space algorithm for our problem, and a fast algorithm for answering AllQuery(T){\rm AllQuery}(T) in O(k/γ2)O(k/\gamma^2) time. Our work develops the direction of model-based data stream analysis, with much that remains to be explored.Comment: To appear in WWW 201

    Graphically Structured Diffusion Models

    Full text link
    We introduce a framework for automatically defining and learning deep generative models with problem-specific structure. We tackle problem domains that are more traditionally solved by algorithms such as sorting, constraint satisfaction for Sudoku, and matrix factorization. Concretely, we train diffusion models with an architecture tailored to the problem specification. This problem specification should contain a graphical model describing relationships between variables, and often benefits from explicit representation of subcomputations. Permutation invariances can also be exploited. Across a diverse set of experiments we improve the scaling relationship between problem dimension and our model's performance, in terms of both training time and final accuracy. Our code can be found at https://github.com/plai-group/gsdm

    Holographic Generative Memory: Neurally Inspired One-Shot Learning with Memory Augmented Neural Networks

    Get PDF
    Humans quickly parse and categorize stimuli by combining perceptual information and previously learned knowledge. We are capable of learning new information quickly with only a few observations, and sometimes even a single observation. This one-shot learning (OSL) capability is still very difficult to realize in machine learning models. Novelty is commonly thought to be the primary driver for OSL. However, neuroscience literature shows that biological OSL mechanisms are guided by uncertainty, rather than novelty, motivating us to explore this idea for machine learning. In this work, we investigate OSL for neural networks using more robust compositional knowledge representations and a biologically inspired uncertainty mechanism to modulate the rate of learning. We introduce several new neural network models that combine Holographic Reduced Representation (HRR) and Variational Autoencoders. Extending these new models culminates in the Holographic Generative Memory (HGMEM) model. HGMEM is a novel unsupervised memory augmented neural network. It offers solutions to many of the practical drawbacks associated with HRRs while also providing storage, recall, and generation of latent compositional knowledge representations. Uncertainty is measured as a native part of HGMEM operation by applying trained probabilistic dropout to fully-connected layers. During training, the learning rate is modulated using these uncertainty measurements in a manner inspired by our motivating neuroscience mechanism for OSL. Model performance is demonstrated on several image datasets with experiments that reflect our theoretical approach

    A Survey of Machine Learning for Big Code and Naturalness

    Get PDF
    Research at the intersection of machine learning, programming languages, and software engineering has recently taken important steps in proposing learnable probabilistic models of source code that exploit code's abundance of patterns. In this article, we survey this work. We contrast programming languages against natural languages and discuss how these similarities and differences drive the design of probabilistic models. We present a taxonomy based on the underlying design principles of each model and use it to navigate the literature. Then, we review how researchers have adapted these models to application areas and discuss cross-cutting and application-specific challenges and opportunities.Comment: Website accompanying this survey paper can be found at https://ml4code.github.i
    corecore