8,143 research outputs found
Experimental Support for a Categorical Compositional Distributional Model of Meaning
Modelling compositional meaning for sentences using empirical distributional
methods has been a challenge for computational linguists. We implement the
abstract categorical model of Coecke et al. (arXiv:1003.4394v1 [cs.CL]) using
data from the BNC and evaluate it. The implementation is based on unsupervised
learning of matrices for relational words and applying them to the vectors of
their arguments. The evaluation is based on the word disambiguation task
developed by Mitchell and Lapata (2008) for intransitive sentences, and on a
similar new experiment designed for transitive sentences. Our model matches the
results of its competitors in the first experiment, and betters them in the
second. The general improvement in results with increase in syntactic
complexity showcases the compositional power of our model.Comment: 11 pages, to be presented at EMNLP 2011, to be published in
Proceedings of the 2011 Conference on Empirical Methods in Natural Language
Processin
Multiple factor analysis of distributional data
In the framework of Symbolic Data Analysis (SDA), distribution-variables are
a particular case of multi-valued variables: each unit is represented by a set
of distributions (e.g. histograms, density functions or quantile functions),
one for each variable. Factor analysis (FA) methods are primary exploratory
tools for dimension reduction and visualization. In the present work, we use
Multiple Factor Analysis (MFA) approach for the analysis of data described by
distributional variables. Each distributional variable induces a set new
numeric variable related to the quantiles of each distribution. We call these
new variables as \textit{quantile variables} and the set of quantile variables
related to a distributional one is a block in the MFA approach. Thus, MFA is
performed on juxtaposed tables of quantile variables. \\ We show that the
criterion decomposed in the analysis is an approximation of the variability
based on a suitable metrics between distributions: the squared
Wasserstein distance. \\ Applications on simulated and real distributional data
corroborate the method. The interpretation of the results on the factorial
planes is performed by new interpretative tools that are related to the several
characteristics of the distributions (location, scale and shape).Comment: Accepted from STATSTICA APPLICATA: Italian Journal of Applied
Statistics on 12/201
Mathematical Foundations for a Compositional Distributional Model of Meaning
We propose a mathematical framework for a unification of the distributional
theory of meaning in terms of vector space models, and a compositional theory
for grammatical types, for which we rely on the algebra of Pregroups,
introduced by Lambek. This mathematical framework enables us to compute the
meaning of a well-typed sentence from the meanings of its constituents.
Concretely, the type reductions of Pregroups are `lifted' to morphisms in a
category, a procedure that transforms meanings of constituents into a meaning
of the (well-typed) whole. Importantly, meanings of whole sentences live in a
single space, independent of the grammatical structure of the sentence. Hence
the inner-product can be used to compare meanings of arbitrary sentences, as it
is for comparing the meanings of words in the distributional model. The
mathematical structure we employ admits a purely diagrammatic calculus which
exposes how the information flows between the words in a sentence in order to
make up the meaning of the whole sentence. A variation of our `categorical
model' which involves constraining the scalars of the vector spaces to the
semiring of Booleans results in a Montague-style Boolean-valued semantics.Comment: to appea
Incremental dimension reduction of tensors with random index
We present an incremental, scalable and efficient dimension reduction
technique for tensors that is based on sparse random linear coding. Data is
stored in a compactified representation with fixed size, which makes memory
requirements low and predictable. Component encoding and decoding are performed
on-line without computationally expensive re-analysis of the data set. The
range of tensor indices can be extended dynamically without modifying the
component representation. This idea originates from a mathematical model of
semantic memory and a method known as random indexing in natural language
processing. We generalize the random-indexing algorithm to tensors and present
signal-to-noise-ratio simulations for representations of vectors and matrices.
We present also a mathematical analysis of the approximate orthogonality of
high-dimensional ternary vectors, which is a property that underpins this and
other similar random-coding approaches to dimension reduction. To further
demonstrate the properties of random indexing we present results of a synonym
identification task. The method presented here has some similarities with
random projection and Tucker decomposition, but it performs well at high
dimensionality only (n>10^3). Random indexing is useful for a range of complex
practical problems, e.g., in natural language processing, data mining, pattern
recognition, event detection, graph searching and search engines. Prototype
software is provided. It supports encoding and decoding of tensors of order >=
1 in a unified framework, i.e., vectors, matrices and higher order tensors.Comment: 36 pages, 9 figure
- …