35,022 research outputs found

    APPROXIMATION ALGORITHMS FOR POINT PATTERN MATCHING AND SEARCHI NG

    Get PDF
    Point pattern matching is a fundamental problem in computational geometry. For given a reference set and pattern set, the problem is to find a geometric transformation applied to the pattern set that minimizes some given distance measure with respect to the reference set. This problem has been heavily researched under various distance measures and error models. Point set similarity searching is variation of this problem in which a large database of point sets is given, and the task is to preprocess this database into a data structure so that, given a query point set, it is possible to rapidly find the nearest point set among elements of the database. Here, the term nearest is understood in above sense of pattern matching, where the elements of the database may be transformed to match the given query set. The approach presented here is to compute a low distortion embedding of the pattern matching problem into an (ideally) low dimensional metric space and then apply any standard algorithm for nearest neighbor searching over this metric space. This main focus of this dissertation is on two problems in the area of point pattern matching and searching algorithms: (i) improving the accuracy of alignment-based point pattern matching and (ii) computing low-distortion embeddings of point sets into vector spaces. For the first problem, new methods are presented for matching point sets based on alignments of small subsets of points. It is shown that these methods lead to better approximation bounds for alignment-based planar point pattern matching algorithms under the Hausdorff distance. Furthermore, it is shown that these approximation bounds are nearly the best achievable by alignment-based methods. For the second problem, results are presented for two different distance measures. First, point pattern similarity search under translation for point sets in multidimensional integer space is considered, where the distance function is the symmetric difference. A randomized embedding into real space under the L1 metric is given. The algorithm achieves an expected distortion of O(log2 n). Second, an algorithm is given for embedding Rd under the Earth Mover's Distance (EMD) into multidimensional integer space under the symmetric difference distance. This embedding achieves a distortion of O(log D), where D is the diameter of the point set. Combining this with the above result implies that point pattern similarity search with translation under the EMD can be embedded in to real space in the L1 metric with an expected distortion of O(log2 n log D)

    An Empirical Analysis of NMT-Derived Interlingual Embeddings and their Use in Parallel Sentence Identification

    Get PDF
    End-to-end neural machine translation has overtaken statistical machine translation in terms of translation quality for some language pairs, specially those with large amounts of parallel data. Besides this palpable improvement, neural networks provide several new properties. A single system can be trained to translate between many languages at almost no additional cost other than training time. Furthermore, internal representations learned by the network serve as a new semantic representation of words -or sentences- which, unlike standard word embeddings, are learned in an essentially bilingual or even multilingual context. In view of these properties, the contribution of the present work is two-fold. First, we systematically study the NMT context vectors, i.e. output of the encoder, and their power as an interlingua representation of a sentence. We assess their quality and effectiveness by measuring similarities across translations, as well as semantically related and semantically unrelated sentence pairs. Second, as extrinsic evaluation of the first point, we identify parallel sentences in comparable corpora, obtaining an F1=98.2% on data from a shared task when using only NMT context vectors. Using context vectors jointly with similarity measures F1 reaches 98.9%.Comment: 11 pages, 4 figure

    Density Matching for Bilingual Word Embedding

    Full text link
    Recent approaches to cross-lingual word embedding have generally been based on linear transformations between the sets of embedding vectors in the two languages. In this paper, we propose an approach that instead expresses the two monolingual embedding spaces as probability densities defined by a Gaussian mixture model, and matches the two densities using a method called normalizing flow. The method requires no explicit supervision, and can be learned with only a seed dictionary of words that have identical strings. We argue that this formulation has several intuitively attractive properties, particularly with the respect to improving robustness and generalization to mappings between difficult language pairs or word pairs. On a benchmark data set of bilingual lexicon induction and cross-lingual word similarity, our approach can achieve competitive or superior performance compared to state-of-the-art published results, with particularly strong results being found on etymologically distant and/or morphologically rich languages.Comment: Accepted by NAACL-HLT 201

    Characterizing the impact of geometric properties of word embeddings on task performance

    Get PDF
    Analysis of word embedding properties to inform their use in downstream NLP tasks has largely been studied by assessing nearest neighbors. However, geometric properties of the continuous feature space contribute directly to the use of embedding features in downstream models, and are largely unexplored. We consider four properties of word embedding geometry, namely: position relative to the origin, distribution of features in the vector space, global pairwise distances, and local pairwise distances. We define a sequence of transformations to generate new embeddings that expose subsets of these properties to downstream models and evaluate change in task performance to understand the contribution of each property to NLP models. We transform publicly available pretrained embeddings from three popular toolkits (word2vec, GloVe, and FastText) and evaluate on a variety of intrinsic tasks, which model linguistic information in the vector space, and extrinsic tasks, which use vectors as input to machine learning models. We find that intrinsic evaluations are highly sensitive to absolute position, while extrinsic tasks rely primarily on local similarity. Our findings suggest that future embedding models and post-processing techniques should focus primarily on similarity to nearby points in vector space.Comment: Appearing in the Third Workshop on Evaluating Vector Space Representations for NLP (RepEval 2019). 7 pages + reference
    corecore