1,424 research outputs found
INRIASAC: Simple Hypernym Extraction Methods
Given a set of terms from a given domain, how can we structure them into a
taxonomy without manual intervention? This is the task 17 of SemEval 2015. Here
we present our simple taxonomy structuring techniques which, despite their
simplicity, ranked first in this 2015 benchmark. We use large quantities of
text (English Wikipedia) and simple heuristics such as term overlap and
document and sentence co-occurrence to produce hypernym lists. We describe
these techniques and pre-sent an initial evaluation of results.Comment: SemEval 2015, Jun 2015, Denver, United State
Experimenting with Transitive Verbs in a DisCoCat
Formal and distributional semantic models offer complementary benefits in
modeling meaning. The categorical compositional distributional (DisCoCat) model
of meaning of Coecke et al. (arXiv:1003.4394v1 [cs.CL]) combines aspected of
both to provide a general framework in which meanings of words, obtained
distributionally, are composed using methods from the logical setting to form
sentence meaning. Concrete consequences of this general abstract setting and
applications to empirical data are under active study (Grefenstette et al.,
arxiv:1101.0309; Grefenstette and Sadrzadeh, arXiv:1106.4058v1 [cs.CL]). . In
this paper, we extend this study by examining transitive verbs, represented as
matrices in a DisCoCat. We discuss three ways of constructing such matrices,
and evaluate each method in a disambiguation task developed by Grefenstette and
Sadrzadeh (arXiv:1106.4058v1 [cs.CL]).Comment: 5 pages, to be presented at GEMS 2011, as part of EMNLP'11 workshop
Experimental Support for a Categorical Compositional Distributional Model of Meaning
Modelling compositional meaning for sentences using empirical distributional
methods has been a challenge for computational linguists. We implement the
abstract categorical model of Coecke et al. (arXiv:1003.4394v1 [cs.CL]) using
data from the BNC and evaluate it. The implementation is based on unsupervised
learning of matrices for relational words and applying them to the vectors of
their arguments. The evaluation is based on the word disambiguation task
developed by Mitchell and Lapata (2008) for intransitive sentences, and on a
similar new experiment designed for transitive sentences. Our model matches the
results of its competitors in the first experiment, and betters them in the
second. The general improvement in results with increase in syntactic
complexity showcases the compositional power of our model.Comment: 11 pages, to be presented at EMNLP 2011, to be published in
Proceedings of the 2011 Conference on Empirical Methods in Natural Language
Processin
Determining the Characteristic Vocabulary for a Specialized Dictionary using Word2vec and a Directed Crawler
Specialized dictionaries are used to understand concepts in specific domains,
especially where those concepts are not part of the general vocabulary, or
having meanings that differ from ordinary languages. The first step in creating
a specialized dictionary involves detecting the characteristic vocabulary of
the domain in question. Classical methods for detecting this vocabulary involve
gathering a domain corpus, calculating statistics on the terms found there, and
then comparing these statistics to a background or general language corpus.
Terms which are found significantly more often in the specialized corpus than
in the background corpus are candidates for the characteristic vocabulary of
the domain. Here we present two tools, a directed crawler, and a distributional
semantics package, that can be used together, circumventing the need of a
background corpus. Both tools are available on the web
On the Place of Text Data in Lifelogs, and Text Analysis via Semantic Facets
Current research in lifelog data has not paid enough attention to analysis of
cognitive activities in comparison to physical activities. We argue that as we
look into the future, wearable devices are going to be cheaper and more
prevalent and textual data will play a more significant role. Data captured by
lifelogging devices will increasingly include speech and text, potentially
useful in analysis of intellectual activities. Analyzing what a person hears,
reads, and sees, we should be able to measure the extent of cognitive activity
devoted to a certain topic or subject by a learner. Test-based lifelog records
can benefit from semantic analysis tools developed for natural language
processing. We show how semantic analysis of such text data can be achieved
through the use of taxonomic subject facets and how these facets might be
useful in quantifying cognitive activity devoted to various topics in a
person's day. We are currently developing a method to automatically create
taxonomic topic vocabularies that can be applied to this detection of
intellectual activity
Lambek vs. Lambek: Functorial Vector Space Semantics and String Diagrams for Lambek Calculus
The Distributional Compositional Categorical (DisCoCat) model is a
mathematical framework that provides compositional semantics for meanings of
natural language sentences. It consists of a computational procedure for
constructing meanings of sentences, given their grammatical structure in terms
of compositional type-logic, and given the empirically derived meanings of
their words. For the particular case that the meaning of words is modelled
within a distributional vector space model, its experimental predictions,
derived from real large scale data, have outperformed other empirically
validated methods that could build vectors for a full sentence. This success
can be attributed to a conceptually motivated mathematical underpinning, by
integrating qualitative compositional type-logic and quantitative modelling of
meaning within a category-theoretic mathematical framework.
The type-logic used in the DisCoCat model is Lambek's pregroup grammar.
Pregroup types form a posetal compact closed category, which can be passed, in
a functorial manner, on to the compact closed structure of vector spaces,
linear maps and tensor product. The diagrammatic versions of the equational
reasoning in compact closed categories can be interpreted as the flow of word
meanings within sentences. Pregroups simplify Lambek's previous type-logic, the
Lambek calculus, which has been extensively used to formalise and reason about
various linguistic phenomena. The apparent reliance of the DisCoCat on
pregroups has been seen as a shortcoming. This paper addresses this concern, by
pointing out that one may as well realise a functorial passage from the
original type-logic of Lambek, a monoidal bi-closed category, to vector spaces,
or to any other model of meaning organised within a monoidal bi-closed
category. The corresponding string diagram calculus, due to Baez and Stay, now
depicts the flow of word meanings.Comment: 29 pages, pending publication in Annals of Pure and Applied Logi
A Convolutional Neural Network for Modelling Sentences
The ability to accurately represent sentences is central to language
understanding. We describe a convolutional architecture dubbed the Dynamic
Convolutional Neural Network (DCNN) that we adopt for the semantic modelling of
sentences. The network uses Dynamic k-Max Pooling, a global pooling operation
over linear sequences. The network handles input sentences of varying length
and induces a feature graph over the sentence that is capable of explicitly
capturing short and long-range relations. The network does not rely on a parse
tree and is easily applicable to any language. We test the DCNN in four
experiments: small scale binary and multi-class sentiment prediction, six-way
question classification and Twitter sentiment prediction by distant
supervision. The network achieves excellent performance in the first three
tasks and a greater than 25% error reduction in the last task with respect to
the strongest baseline
"Not not bad" is not "bad": A distributional account of negation
With the increasing empirical success of distributional models of
compositional semantics, it is timely to consider the types of textual logic
that such models are capable of capturing. In this paper, we address
shortcomings in the ability of current models to capture logical operations
such as negation. As a solution we propose a tripartite formulation for a
continuous vector space representation of semantics and subsequently use this
representation to develop a formal compositional notion of negation within such
models.Comment: 9 pages, to appear in Proceedings of the 2013 Workshop on Continuous
Vector Space Models and their Compositionalit
- …