72 research outputs found

    A Study of Metrics of Distance and Correlation Between Ranked Lists for Compositionality Detection

    Full text link
    Compositionality in language refers to how much the meaning of some phrase can be decomposed into the meaning of its constituents and the way these constituents are combined. Based on the premise that substitution by synonyms is meaning-preserving, compositionality can be approximated as the semantic similarity between a phrase and a version of that phrase where words have been replaced by their synonyms. Different ways of representing such phrases exist (e.g., vectors [1] or language models [2]), and the choice of representation affects the measurement of semantic similarity. We propose a new compositionality detection method that represents phrases as ranked lists of term weights. Our method approximates the semantic similarity between two ranked list representations using a range of well-known distance and correlation metrics. In contrast to most state-of-the-art approaches in compositionality detection, our method is completely unsupervised. Experiments with a publicly available dataset of 1048 human-annotated phrases shows that, compared to strong supervised baselines, our approach provides superior measurement of compositionality using any of the distance and correlation metrics considered

    How Many Topics? Stability Analysis for Topic Models

    Full text link
    Topic modeling refers to the task of discovering the underlying thematic structure in a text corpus, where the output is commonly presented as a report of the top terms appearing in each topic. Despite the diversity of topic modeling algorithms that have been proposed, a common challenge in successfully applying these techniques is the selection of an appropriate number of topics for a given corpus. Choosing too few topics will produce results that are overly broad, while choosing too many will result in the "over-clustering" of a corpus into many small, highly-similar topics. In this paper, we propose a term-centric stability analysis strategy to address this issue, the idea being that a model with an appropriate number of topics will be more robust to perturbations in the data. Using a topic modeling approach based on matrix factorization, evaluations performed on a range of corpora show that this strategy can successfully guide the model selection process.Comment: Improve readability of plots. Add minor clarification

    Degree-degree correlations in random graphs with heavy-tailed degrees

    Get PDF
    We investigate degree-degree correlations for scale-free graph sequences. The main conclusion of this paper is that the assortativity coefficient is not the appropriate way to describe degree-dependences in scale-free random graphs. Indeed, we study the infinite volume limit of the assortativity coefficient, and show that this limit is always non-negative when the degrees have finite first but infinite third moment, i.e., when the degree exponent Ī³+1\gamma + 1 of the density satisfies Ī³āˆˆ(1,3)\gamma \in (1,3). More generally, our results show that the correlation coefficient is inappropriate to describe dependencies between random variables having infinite variance. We start with a simple model of the sample correlation of random variables XX and YY, which are linear combinations with non-negative coefficients of the same infinite variance random variables. In this case, the correlation coefficient of XX and YY is not defined, and the sample covariance converges to a proper random variable with support that is a subinterval of (āˆ’1,1)(-1,1). Further, for any joint distribution (X,Y)(X,Y) with equal marginals being non-negative power-law distributions with infinite variance (as in the case of degree-degree correlations), we show that the limit is non-negative. We next adapt these results to the assortativity in networks as described by the degree-degree correlation coefficient, and show that it is non-negative in the large graph limit when the degree distribution has an infinite third moment. We illustrate these results with several examples where the assortativity behaves in a non-sensible way. We further discuss alternatives for describing assortativity in networks based on rank correlations that are appropriate for infinite variance variables. We support these mathematical results by simulations

    A Weighted Correlation Index for Rankings with Ties

    Full text link
    Understanding the correlation between two different scores for the same set of items is a common problem in information retrieval, and the most commonly used statistics that quantifies this correlation is Kendall's Ļ„\tau. However, the standard definition fails to capture that discordances between items with high rank are more important than those between items with low rank. Recently, a new measure of correlation based on average precision has been proposed to solve this problem, but like many alternative proposals in the literature it assumes that there are no ties in the scores. This is a major deficiency in a number of contexts, and in particular while comparing centrality scores on large graphs, as the obvious baseline, indegree, has a very large number of ties in web and social graphs. We propose to extend Kendall's definition in a natural way to take into account weights in the presence of ties. We prove a number of interesting mathematical properties of our generalization and describe an O(nlogā”n)O(n\log n) algorithm for its computation. We also validate the usefulness of our weighted measure of correlation using experimental data

    How Many Pairwise Preferences Do We Need to Rank A Graph Consistently?

    Full text link
    We consider the problem of optimal recovery of true ranking of nn items from a randomly chosen subset of their pairwise preferences. It is well known that without any further assumption, one requires a sample size of Ī©(n2)\Omega(n^2) for the purpose. We analyze the problem with an additional structure of relational graph G([n],E)G([n],E) over the nn items added with an assumption of \emph{locality}: Neighboring items are similar in their rankings. Noting the preferential nature of the data, we choose to embed not the graph, but, its \emph{strong product} to capture the pairwise node relationships. Furthermore, unlike existing literature that uses Laplacian embedding for graph based learning problems, we use a richer class of graph embeddings---\emph{orthonormal representations}---that includes (normalized) Laplacian as its special case. Our proposed algorithm, {\it Pref-Rank}, predicts the underlying ranking using an SVM based approach over the chosen embedding of the product graph, and is the first to provide \emph{statistical consistency} on two ranking losses: \emph{Kendall's tau} and \emph{Spearman's footrule}, with a required sample complexity of O(n2Ļ‡(GĖ‰))23O(n^2 \chi(\bar{G}))^{\frac{2}{3}} pairs, Ļ‡(GĖ‰)\chi(\bar{G}) being the \emph{chromatic number} of the complement graph GĖ‰\bar{G}. Clearly, our sample complexity is smaller for dense graphs, with Ļ‡(GĖ‰)\chi(\bar G) characterizing the degree of node connectivity, which is also intuitive due to the locality assumption e.g. O(n43)O(n^\frac{4}{3}) for union of kk-cliques, or O(n53)O(n^\frac{5}{3}) for random and power law graphs etc.---a quantity much smaller than the fundamental limit of Ī©(n2)\Omega(n^2) for large nn. This, for the first time, relates ranking complexity to structural properties of the graph. We also report experimental evaluations on different synthetic and real datasets, where our algorithm is shown to outperform the state-of-the-art methods.Comment: In Thirty-Third AAAI Conference on Artificial Intelligence, 201
    • ā€¦
    corecore