Search CORE

528,418 research outputs found

What a Nerd! Beating Students and Vector Cosine in the ESL and TOEFL Datasets

Author: Chiu Tin-Shing
Huang Chu-Ren
Lenci Alessandro
Lu Qin
Santus Enrico
Publication venue
Publication date: 01/01/2016
Field of study

In this paper, we claim that Vector Cosine, which is generally considered one of the most efficient unsupervised measures for identifying word similarity in Vector Space Models, can be outperformed by a completely unsupervised measure that evaluates the extent of the intersection among the most associated contexts of two target words, weighting such intersection according to the rank of the shared contexts in the dependency ranked lists. This claim comes from the hypothesis that similar words do not simply occur in similar contexts, but they share a larger portion of their most relevant contexts compared to other related words. To prove it, we describe and evaluate APSyn, a variant of Average Precision that, independently of the adopted parameters, outperforms the Vector Cosine and the co-occurrence on the ESL and TOEFL test sets. In the best setting, APSyn reaches 0.73 accuracy on the ESL dataset and 0.70 accuracy in the TOEFL dataset, beating therefore the non-English US college applicants (whose average, as reported in the literature, is 64.50%) and several state-of-the-art approaches.Comment: in LREC 201

arXiv.org e-Print Archive

Principal Component Analysis and Higher Correlations for Distributed Data

Author: Kannan Ravindran
Vempala Santosh
Woodruff David
Publication venue
Publication date: 29/06/2014
Field of study

We consider algorithmic problems in the setting in which the input data has been partitioned arbitrarily on many servers. The goal is to compute a function of all the data, and the bottleneck is the communication used by the algorithm. We present algorithms for two illustrative problems on massive data sets: (1) computing a low-rank approximation of a matrix

A=A^1 + A^2 + \ldots + A^s

, with matrix

A^t

stored on server

t

and (2) computing a function of a vector

a_1 + a_2 + \ldots + a_s

, where server

t

has the vector

a_t

; this includes the well-studied special case of computing frequency moments and separable functions, as well as higher-order correlations such as the number of subgraphs of a specified type occurring in a graph. For both problems we give algorithms with nearly optimal communication, and in particular the only dependence on

n

, the size of the data, is in the number of bits needed to represent indices and words (

O(\log n)

).Comment: rewritten with focus on two main results (distributed PCA, higher-order moments and correlations) in the arbitrary partition mode

arXiv.org e-Print Archive

CiteSeerX

Retrieval with gene queries

Author: Sehgal Aditya K
Srinivasan Padmini
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Accuracy of document retrieval from MEDLINE for gene queries is crucially important for many applications in bioinformatics. We explore five information retrieval-based methods to rank documents retrieved by PubMed gene queries for the human genome. The aim is to rank relevant documents higher in the retrieved list. We address the special challenges faced due to ambiguity in gene nomenclature: gene terms that refer to multiple genes, gene terms that are also English words, and gene terms that have other biological meanings. RESULTS: Our two baseline ranking strategies are quite similar in performance. Two of our three LocusLink-based strategies offer significant improvements. These methods work very well even when there is ambiguity in the gene terms. Our best ranking strategy offers significant improvements on three different kinds of ambiguities over our two baseline strategies (improvements range from 15.9% to 17.7% and 11.7% to 13.3% depending on the baseline). For most genes the best ranking query is one that is built from the LocusLink (now Entrez Gene) summary and product information along with the gene names and aliases. For others, the gene names and aliases suffice. We also present an approach that successfully predicts, for a given gene, which of these two ranking queries is more appropriate. CONCLUSION: We explore the effect of different post-retrieval strategies on the ranking of documents returned by PubMed for human gene queries. We have successfully applied some of these strategies to improve the ranking of relevant documents in the retrieved sets. This holds true even when various kinds of ambiguity are encountered. We feel that it would be very useful to apply strategies like ours on PubMed search results as these are not ordered by relevance in any way. This is especially so for queries that retrieve a large number of documents

Springer - Publisher Connector

Directory of Open Access Journals

Borel Ranks and Wadge Degrees of Context Free Omega Languages

Author: Finkel Olivier
Publication venue
Publication date: 01/01/2006
Field of study

We show that, from a topological point of view, considering the Borel and the Wadge hierarchies, 1-counter B\"uchi automata have the same accepting power than Turing machines equipped with a B\"uchi acceptance condition. In particular, for every non null recursive ordinal alpha, there exist some Sigma^0_alpha-complete and some Pi^0_alpha-complete omega context free languages accepted by 1-counter B\"uchi automata, and the supremum of the set of Borel ranks of context free omega languages is the ordinal gamma^1_2 which is strictly greater than the first non recursive ordinal. This very surprising result gives answers to questions of H. Lescow and W. Thomas [Logical Specifications of Infinite Computations, In:"A Decade of Concurrency", LNCS 803, Springer, 1994, p. 583-621]

arXiv.org e-Print Archive

Hal-Diderot

On Nonnegative Integer Matrices and Short Killing Words

Author: Kiefer Stefan
Mascle Corto
Publication venue
Publication date: 18/03/2019
Field of study

Let

n

be a natural number and

\mathcal{M}

a set of

n \times n

-matrices over the nonnegative integers such that the joint spectral radius of

\mathcal{M}

is at most one. We show that if the zero matrix

0

is a product of matrices in

\mathcal{M}

, then there are

M_1, \ldots, M_{n^5} \in \mathcal{M}

with

M_1 \cdots M_{n^5} = 0

. This result has applications in automata theory and the theory of codes. Specifically, if

X \subset \Sigma^*

is a finite incomplete code, then there exists a word

w \in \Sigma^*

of length polynomial in

\sum_{x \in X} |x|

such that

w

is not a factor of any word in

X^*

. This proves a weak version of Restivo's conjecture.Comment: This version is a journal submission based on a STACS'19 paper. It extends the conference version as follows. (1) The main result has been generalized to apply to monoids generated by finite sets whose joint spectral radius is at most 1. (2) The use of Carpi's theorem is avoided to make the paper more self-contained. (3) A more precise result is offered on Restivo's conjecture for finite code

arXiv.org e-Print Archive

Oxford University Research Archive

Multi-copy and stochastic transformation of multipartite pure states

Author: D. M. Greenberger
J. Ja’ Ja’
Lin Chen
Masahito Hayashi
P. Bürgisser
Publication venue: 'American Physical Society (APS)'
Publication date: 20/12/2010
Field of study

Characterizing the transformation and classification of multipartite entangled states is a basic problem in quantum information. We study the problem under two most common environments, local operations and classical communications (LOCC), stochastic LOCC and two more general environments, multi-copy LOCC (MCLOCC) and multi-copy SLOCC (MCSLOCC). We show that two transformable multipartite states under LOCC or SLOCC are also transformable under MCLOCC and MCSLOCC. What's more, these two environments are equivalent in the sense that two transformable states under MCLOCC are also transformable under MCSLOCC, and vice versa. Based on these environments we classify the multipartite pure states into a few inequivalent sets and orbits, between which we build the partial order to decide their transformation. In particular, we investigate the structure of SLOCC-equivalent states in terms of tensor rank, which is known as the generalized Schmidt rank. Given the tensor rank, we show that GHZ states can be used to generate all states with a smaller or equivalent tensor rank under SLOCC, and all reduced separable states with a cardinality smaller or equivalent than the tensor rank under LOCC. Using these concepts, we extended the concept of "maximally entangled state" in the multi-partite system.Comment: 8 pages, 1 figure, revised version according to colleagues' comment

arXiv.org e-Print Archive