528,418 research outputs found
What a Nerd! Beating Students and Vector Cosine in the ESL and TOEFL Datasets
In this paper, we claim that Vector Cosine, which is generally considered one
of the most efficient unsupervised measures for identifying word similarity in
Vector Space Models, can be outperformed by a completely unsupervised measure
that evaluates the extent of the intersection among the most associated
contexts of two target words, weighting such intersection according to the rank
of the shared contexts in the dependency ranked lists. This claim comes from
the hypothesis that similar words do not simply occur in similar contexts, but
they share a larger portion of their most relevant contexts compared to other
related words. To prove it, we describe and evaluate APSyn, a variant of
Average Precision that, independently of the adopted parameters, outperforms
the Vector Cosine and the co-occurrence on the ESL and TOEFL test sets. In the
best setting, APSyn reaches 0.73 accuracy on the ESL dataset and 0.70 accuracy
in the TOEFL dataset, beating therefore the non-English US college applicants
(whose average, as reported in the literature, is 64.50%) and several
state-of-the-art approaches.Comment: in LREC 201
Principal Component Analysis and Higher Correlations for Distributed Data
We consider algorithmic problems in the setting in which the input data has
been partitioned arbitrarily on many servers. The goal is to compute a function
of all the data, and the bottleneck is the communication used by the algorithm.
We present algorithms for two illustrative problems on massive data sets: (1)
computing a low-rank approximation of a matrix ,
with matrix stored on server and (2) computing a function of a vector
, where server has the vector ; this
includes the well-studied special case of computing frequency moments and
separable functions, as well as higher-order correlations such as the number of
subgraphs of a specified type occurring in a graph. For both problems we give
algorithms with nearly optimal communication, and in particular the only
dependence on , the size of the data, is in the number of bits needed to
represent indices and words ().Comment: rewritten with focus on two main results (distributed PCA,
higher-order moments and correlations) in the arbitrary partition mode
Retrieval with gene queries
BACKGROUND: Accuracy of document retrieval from MEDLINE for gene queries is crucially important for many applications in bioinformatics. We explore five information retrieval-based methods to rank documents retrieved by PubMed gene queries for the human genome. The aim is to rank relevant documents higher in the retrieved list. We address the special challenges faced due to ambiguity in gene nomenclature: gene terms that refer to multiple genes, gene terms that are also English words, and gene terms that have other biological meanings. RESULTS: Our two baseline ranking strategies are quite similar in performance. Two of our three LocusLink-based strategies offer significant improvements. These methods work very well even when there is ambiguity in the gene terms. Our best ranking strategy offers significant improvements on three different kinds of ambiguities over our two baseline strategies (improvements range from 15.9% to 17.7% and 11.7% to 13.3% depending on the baseline). For most genes the best ranking query is one that is built from the LocusLink (now Entrez Gene) summary and product information along with the gene names and aliases. For others, the gene names and aliases suffice. We also present an approach that successfully predicts, for a given gene, which of these two ranking queries is more appropriate. CONCLUSION: We explore the effect of different post-retrieval strategies on the ranking of documents returned by PubMed for human gene queries. We have successfully applied some of these strategies to improve the ranking of relevant documents in the retrieved sets. This holds true even when various kinds of ambiguity are encountered. We feel that it would be very useful to apply strategies like ours on PubMed search results as these are not ordered by relevance in any way. This is especially so for queries that retrieve a large number of documents
Borel Ranks and Wadge Degrees of Context Free Omega Languages
We show that, from a topological point of view, considering the Borel and the
Wadge hierarchies, 1-counter B\"uchi automata have the same accepting power
than Turing machines equipped with a B\"uchi acceptance condition. In
particular, for every non null recursive ordinal alpha, there exist some
Sigma^0_alpha-complete and some Pi^0_alpha-complete omega context free
languages accepted by 1-counter B\"uchi automata, and the supremum of the set
of Borel ranks of context free omega languages is the ordinal gamma^1_2 which
is strictly greater than the first non recursive ordinal. This very surprising
result gives answers to questions of H. Lescow and W. Thomas [Logical
Specifications of Infinite Computations, In:"A Decade of Concurrency", LNCS
803, Springer, 1994, p. 583-621]
On Nonnegative Integer Matrices and Short Killing Words
Let be a natural number and a set of -matrices
over the nonnegative integers such that the joint spectral radius of
is at most one. We show that if the zero matrix is a product
of matrices in , then there are with . This result has applications in
automata theory and the theory of codes. Specifically, if
is a finite incomplete code, then there exists a word of
length polynomial in such that is not a factor of any
word in . This proves a weak version of Restivo's conjecture.Comment: This version is a journal submission based on a STACS'19 paper. It
extends the conference version as follows. (1) The main result has been
generalized to apply to monoids generated by finite sets whose joint spectral
radius is at most 1. (2) The use of Carpi's theorem is avoided to make the
paper more self-contained. (3) A more precise result is offered on Restivo's
conjecture for finite code
Multi-copy and stochastic transformation of multipartite pure states
Characterizing the transformation and classification of multipartite
entangled states is a basic problem in quantum information. We study the
problem under two most common environments, local operations and classical
communications (LOCC), stochastic LOCC and two more general environments,
multi-copy LOCC (MCLOCC) and multi-copy SLOCC (MCSLOCC). We show that two
transformable multipartite states under LOCC or SLOCC are also transformable
under MCLOCC and MCSLOCC. What's more, these two environments are equivalent in
the sense that two transformable states under MCLOCC are also transformable
under MCSLOCC, and vice versa. Based on these environments we classify the
multipartite pure states into a few inequivalent sets and orbits, between which
we build the partial order to decide their transformation. In particular, we
investigate the structure of SLOCC-equivalent states in terms of tensor rank,
which is known as the generalized Schmidt rank. Given the tensor rank, we show
that GHZ states can be used to generate all states with a smaller or equivalent
tensor rank under SLOCC, and all reduced separable states with a cardinality
smaller or equivalent than the tensor rank under LOCC. Using these concepts, we
extended the concept of "maximally entangled state" in the multi-partite
system.Comment: 8 pages, 1 figure, revised version according to colleagues' comment
- …