528,418 research outputs found

    What a Nerd! Beating Students and Vector Cosine in the ESL and TOEFL Datasets

    Full text link
    In this paper, we claim that Vector Cosine, which is generally considered one of the most efficient unsupervised measures for identifying word similarity in Vector Space Models, can be outperformed by a completely unsupervised measure that evaluates the extent of the intersection among the most associated contexts of two target words, weighting such intersection according to the rank of the shared contexts in the dependency ranked lists. This claim comes from the hypothesis that similar words do not simply occur in similar contexts, but they share a larger portion of their most relevant contexts compared to other related words. To prove it, we describe and evaluate APSyn, a variant of Average Precision that, independently of the adopted parameters, outperforms the Vector Cosine and the co-occurrence on the ESL and TOEFL test sets. In the best setting, APSyn reaches 0.73 accuracy on the ESL dataset and 0.70 accuracy in the TOEFL dataset, beating therefore the non-English US college applicants (whose average, as reported in the literature, is 64.50%) and several state-of-the-art approaches.Comment: in LREC 201

    Principal Component Analysis and Higher Correlations for Distributed Data

    Full text link
    We consider algorithmic problems in the setting in which the input data has been partitioned arbitrarily on many servers. The goal is to compute a function of all the data, and the bottleneck is the communication used by the algorithm. We present algorithms for two illustrative problems on massive data sets: (1) computing a low-rank approximation of a matrix A=A1+A2++AsA=A^1 + A^2 + \ldots + A^s, with matrix AtA^t stored on server tt and (2) computing a function of a vector a1+a2++asa_1 + a_2 + \ldots + a_s, where server tt has the vector ata_t; this includes the well-studied special case of computing frequency moments and separable functions, as well as higher-order correlations such as the number of subgraphs of a specified type occurring in a graph. For both problems we give algorithms with nearly optimal communication, and in particular the only dependence on nn, the size of the data, is in the number of bits needed to represent indices and words (O(logn)O(\log n)).Comment: rewritten with focus on two main results (distributed PCA, higher-order moments and correlations) in the arbitrary partition mode

    Retrieval with gene queries

    Get PDF
    BACKGROUND: Accuracy of document retrieval from MEDLINE for gene queries is crucially important for many applications in bioinformatics. We explore five information retrieval-based methods to rank documents retrieved by PubMed gene queries for the human genome. The aim is to rank relevant documents higher in the retrieved list. We address the special challenges faced due to ambiguity in gene nomenclature: gene terms that refer to multiple genes, gene terms that are also English words, and gene terms that have other biological meanings. RESULTS: Our two baseline ranking strategies are quite similar in performance. Two of our three LocusLink-based strategies offer significant improvements. These methods work very well even when there is ambiguity in the gene terms. Our best ranking strategy offers significant improvements on three different kinds of ambiguities over our two baseline strategies (improvements range from 15.9% to 17.7% and 11.7% to 13.3% depending on the baseline). For most genes the best ranking query is one that is built from the LocusLink (now Entrez Gene) summary and product information along with the gene names and aliases. For others, the gene names and aliases suffice. We also present an approach that successfully predicts, for a given gene, which of these two ranking queries is more appropriate. CONCLUSION: We explore the effect of different post-retrieval strategies on the ranking of documents returned by PubMed for human gene queries. We have successfully applied some of these strategies to improve the ranking of relevant documents in the retrieved sets. This holds true even when various kinds of ambiguity are encountered. We feel that it would be very useful to apply strategies like ours on PubMed search results as these are not ordered by relevance in any way. This is especially so for queries that retrieve a large number of documents

    Borel Ranks and Wadge Degrees of Context Free Omega Languages

    Get PDF
    We show that, from a topological point of view, considering the Borel and the Wadge hierarchies, 1-counter B\"uchi automata have the same accepting power than Turing machines equipped with a B\"uchi acceptance condition. In particular, for every non null recursive ordinal alpha, there exist some Sigma^0_alpha-complete and some Pi^0_alpha-complete omega context free languages accepted by 1-counter B\"uchi automata, and the supremum of the set of Borel ranks of context free omega languages is the ordinal gamma^1_2 which is strictly greater than the first non recursive ordinal. This very surprising result gives answers to questions of H. Lescow and W. Thomas [Logical Specifications of Infinite Computations, In:"A Decade of Concurrency", LNCS 803, Springer, 1994, p. 583-621]

    On Nonnegative Integer Matrices and Short Killing Words

    Full text link
    Let nn be a natural number and M\mathcal{M} a set of n×nn \times n-matrices over the nonnegative integers such that the joint spectral radius of M\mathcal{M} is at most one. We show that if the zero matrix 00 is a product of matrices in M\mathcal{M}, then there are M1,,Mn5MM_1, \ldots, M_{n^5} \in \mathcal{M} with M1Mn5=0M_1 \cdots M_{n^5} = 0. This result has applications in automata theory and the theory of codes. Specifically, if XΣX \subset \Sigma^* is a finite incomplete code, then there exists a word wΣw \in \Sigma^* of length polynomial in xXx\sum_{x \in X} |x| such that ww is not a factor of any word in XX^*. This proves a weak version of Restivo's conjecture.Comment: This version is a journal submission based on a STACS'19 paper. It extends the conference version as follows. (1) The main result has been generalized to apply to monoids generated by finite sets whose joint spectral radius is at most 1. (2) The use of Carpi's theorem is avoided to make the paper more self-contained. (3) A more precise result is offered on Restivo's conjecture for finite code

    Multi-copy and stochastic transformation of multipartite pure states

    Full text link
    Characterizing the transformation and classification of multipartite entangled states is a basic problem in quantum information. We study the problem under two most common environments, local operations and classical communications (LOCC), stochastic LOCC and two more general environments, multi-copy LOCC (MCLOCC) and multi-copy SLOCC (MCSLOCC). We show that two transformable multipartite states under LOCC or SLOCC are also transformable under MCLOCC and MCSLOCC. What's more, these two environments are equivalent in the sense that two transformable states under MCLOCC are also transformable under MCSLOCC, and vice versa. Based on these environments we classify the multipartite pure states into a few inequivalent sets and orbits, between which we build the partial order to decide their transformation. In particular, we investigate the structure of SLOCC-equivalent states in terms of tensor rank, which is known as the generalized Schmidt rank. Given the tensor rank, we show that GHZ states can be used to generate all states with a smaller or equivalent tensor rank under SLOCC, and all reduced separable states with a cardinality smaller or equivalent than the tensor rank under LOCC. Using these concepts, we extended the concept of "maximally entangled state" in the multi-partite system.Comment: 8 pages, 1 figure, revised version according to colleagues' comment
    corecore