10,486 research outputs found

    Characterizing the impact of geometric properties of word embeddings on task performance

    Get PDF
    Analysis of word embedding properties to inform their use in downstream NLP tasks has largely been studied by assessing nearest neighbors. However, geometric properties of the continuous feature space contribute directly to the use of embedding features in downstream models, and are largely unexplored. We consider four properties of word embedding geometry, namely: position relative to the origin, distribution of features in the vector space, global pairwise distances, and local pairwise distances. We define a sequence of transformations to generate new embeddings that expose subsets of these properties to downstream models and evaluate change in task performance to understand the contribution of each property to NLP models. We transform publicly available pretrained embeddings from three popular toolkits (word2vec, GloVe, and FastText) and evaluate on a variety of intrinsic tasks, which model linguistic information in the vector space, and extrinsic tasks, which use vectors as input to machine learning models. We find that intrinsic evaluations are highly sensitive to absolute position, while extrinsic tasks rely primarily on local similarity. Our findings suggest that future embedding models and post-processing techniques should focus primarily on similarity to nearby points in vector space.Comment: Appearing in the Third Workshop on Evaluating Vector Space Representations for NLP (RepEval 2019). 7 pages + reference

    Defining Equitable Geographic Districts in Road Networks via Stable Matching

    Full text link
    We introduce a novel method for defining geographic districts in road networks using stable matching. In this approach, each geographic district is defined in terms of a center, which identifies a location of interest, such as a post office or polling place, and all other network vertices must be labeled with the center to which they are associated. We focus on defining geographic districts that are equitable, in that every district has the same number of vertices and the assignment is stable in terms of geographic distance. That is, there is no unassigned vertex-center pair such that both would prefer each other over their current assignments. We solve this problem using a version of the classic stable matching problem, called symmetric stable matching, in which the preferences of the elements in both sets obey a certain symmetry. In our case, we study a graph-based version of stable matching in which nodes are stably matched to a subset of nodes denoted as centers, prioritized by their shortest-path distances, so that each center is apportioned a certain number of nodes. We show that, for a planar graph or road network with nn nodes and kk centers, the problem can be solved in O(nnlogn)O(n\sqrt{n}\log n) time, which improves upon the O(nk)O(nk) runtime of using the classic Gale-Shapley stable matching algorithm when kk is large. Finally, we provide experimental results on road networks for these algorithms and a heuristic algorithm that performs better than the Gale-Shapley algorithm for any range of values of kk.Comment: 9 pages, 4 figures, to appear in 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM SIGSPATIAL 2017) November 7-10, 2017, Redondo Beach, California, US

    Low-shot learning with large-scale diffusion

    Full text link
    This paper considers the problem of inferring image labels from images when only a few annotated examples are available at training time. This setup is often referred to as low-shot learning, where a standard approach is to re-train the last few layers of a convolutional neural network learned on separate classes for which training examples are abundant. We consider a semi-supervised setting based on a large collection of images to support label propagation. This is possible by leveraging the recent advances on large-scale similarity graph construction. We show that despite its conceptual simplicity, scaling label propagation up to hundred millions of images leads to state of the art accuracy in the low-shot learning regime

    The Branched Polymer Growth Model Revisited

    Full text link
    The Branched Polymer Growth Model (BPGM) has been employed to study the kinetic growth of ramified polymers in the presence of impurities. In this article, the BPGM is revisited on the square lattice and a subtle modification in its dynamics is proposed in order to adapt it to a scenario closer to reality and experimentation. This new version of the model is denominated the Adapted Branched Polymer Growth Model (ABPGM). It is shown that the ABPGM preserves the functionalities of the monomers and so recovers the branching probability b as an input parameter which effectively controls the relative incidence of bifurcations. The critical locus separating infinite from finite growth regimes of the ABPGM is obtained in the (b,c) space (where c is the impurity concentration). Unlike the original model, the phase diagram of the ABPGM exhibits a peculiar reentrance.Comment: 8 pages, 10 figures. To be published in PHYSICA

    Net and Prune: A Linear Time Algorithm for Euclidean Distance Problems

    Full text link
    We provide a general framework for getting expected linear time constant factor approximations (and in many cases FPTAS's) to several well known problems in Computational Geometry, such as kk-center clustering and farthest nearest neighbor. The new approach is robust to variations in the input problem, and yet it is simple, elegant and practical. In particular, many of these well studied problems which fit easily into our framework, either previously had no linear time approximation algorithm, or required rather involved algorithms and analysis. A short list of the problems we consider include farthest nearest neighbor, kk-center clustering, smallest disk enclosing kk points, kkth largest distance, kkth smallest mm-nearest neighbor distance, kkth heaviest edge in the MST and other spanning forest type problems, problems involving upward closed set systems, and more. Finally, we show how to extend our framework such that the linear running time bound holds with high probability

    Fast Construction of Nets in Low Dimensional Metrics, and Their Applications

    Full text link
    We present a near linear time algorithm for constructing hierarchical nets in finite metric spaces with constant doubling dimension. This data-structure is then applied to obtain improved algorithms for the following problems: Approximate nearest neighbor search, well-separated pair decomposition, compact representation scheme, doubling measure, and computation of the (approximate) Lipschitz constant of a function. In all cases, the running (preprocessing) time is near-linear and the space being used is linear.Comment: 41 pages. Extensive clean-up of minor English error
    corecore