150,402 research outputs found
Geometric Methods for Context Sensitive Distributional Semantics
PhDThis thesis describes a novel methodology, grounded in the distributional semantic paradigm,
for building context sensitive models of word meaning, affording an empirical exploration
of the relationship between words and concepts. Anchored in theoretical linguistic insight
regarding the contextually specified nature of lexical semantics, the work presented here
explores a range of techniques for the selection of subspaces of word co-occurrence dimensions
based on a statistical analysis of input terms as observed within large-scale textual
corpora. The relationships between word-vectors that emerge in the projected subspaces
can be analysed in terms of a mapping between their geometric features and their semantic
properties. The power of this modelling technique is its ability to generate ad hoc
semantic relationships in response to an extemporaneous linguistic or conceptual situation.
The product of this approach is a generalisable computational linguistic methodology,
capable of taking input in various forms, including word groupings and sentential context,
and dynamically generating output from a broad base model of word co-occurrence
data. To demonstrate the versatility of the method, this thesis will present competitive
empirical results on a range of established natural language tasks including word similarity
and relatedness rating, metaphor and metonymy detection, and analogy completion.
A range of techniques will be applied in order to explore the ways in which different
aspects of projected geometries can be mapped to different semantic relationships, allowing
for the discovery of a range of lexical and conceptual properties for any given input
and providing a basis for an empirical exploration of distinctions between the semantic
phenomena under analysis. The case made here is that the flexibility of these models
and their ability to extend output to evaluations of unattested linguistic relationships
constitutes the groundwork for a method for the extrapolation of dynamic conceptual
relationships from large-scale textual corpora.
This method is presented as a complement and a counterpoint to established distributional
methods for generating lexically productive word-vectors. Where contemporary
vector space models of distributional semantics have almost universally involved either
the factorisation of co-occurrence matrices or the incremental learning of abstract representations
using neural networks, the approach described in this thesis preserves the
connection between the individual dimensions of word-vectors and statistics pertaining
to observations in a textual corpus. The hypothesis tested here is that the maintenance
of actual, interpretable information about underlying linguistic data allows for the contextual
selection of non-normalised subspaces with more nuanced geometric features. In
addition to presenting competitive results for various computational linguistic targets,
the thesis will suggest that the transparency of its representations indicates scope for
the application of this model to various real-world problems where an interpretable relationship
between data and output is highly desirable. This, finally, demonstrates a way
towards the productive application of the theory and philosophy of language to computational
linguistic practice.Engineering and Physical Sciences Research Council of the UK, EP/L50483X/1
Recommended from our members
Cluster damage robustness analysis and space independent community detection in complex networks
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.This thesis investigates the evolution of two very different complex systems using network theory. This multi-disciplinary technique is widely used to model and analyse vastly diverse systems of multiple interacting components, and therefore, it is applied in this thesis to study the complexity of the systems. This complexity is rooted in the components’ interactions such that the whole system is more than the sum of all the individual parts. The first novelty in this research is the proposal of a new type of structural perturbation, cluster damage, for measuring another dimension of network robustness. The second novelty is the first application of a community detection method, which uncovers space-independent communities in spatial networks, to airport and linguistic networks.
A critical property of complex systems – robustness – is explored within a partial model of the Internet, by demonstrating a novel perturbation strategy based on the iterative removal of clusters. The main contribution of this theoretical case study is the methodology for cluster damage, which has not been investigated by literature on the robustness of complex networks. The model, part of the Internet at the Autonomous System level, only serves as a domain where the novel methodology is demonstrated, and it is chosen because the Internet is known to be robust due to its distributed (non-centralised) nature, even though it is often subjected to large perturbations and failures. The first applied case study is in the field of air transportation. Specifically, it explores the topology and passenger flows of the United States Airport Network (USAN) over two decades. The network model consists of a time-series of six network snapshots for the years 1990, 2000 and 2010, which capture bi-monthly passenger flows among US airports. Since the network is embedded in space, the volume of these flows is naturally affected by spatial proximity, and therefore, a model (recently proposed in the literature) accounting for this phenomenon is used to identify the communities of airports that have particularly high flows among them, given their spatial separation. The second applied case study – in the field of language acquisition – investigates the word co-occurrence network of children, as they develop their linguistic abilities at an early age. Similarly to the previous case study, the network model consists of six children and three discrete developmental stages. These networks are not embedded in physical space, but they are mapped to an artificial semantic space that defines the semantic distance between pairs of words. This novel approach allows for an additional dimension of network information that results in a more complete dataset. Then, community detection identifies groups of words that have particularly high co-occurrence frequency, given their semantic distance. This research highlights the fact that some general techniques from network theory, such as network modelling and analysis, can be successfully applied for the study of diverse systems, while others, such as community detection, need to be tailored for the specific system. However, methods originally developed for one domain may be applied somewhere completely new, as illustrated by the application of spatial community detection to a non-spatial network. This underlines the importance of inter-disciplinary research
Measuring Societal Biases in Text Corpora via First-Order Co-occurrence
Text corpora are used to study societal biases, typically through statistical
models such as word embeddings. The bias of a word towards a concept is
typically estimated using vectors similarity, measuring whether the word and
concept words share other words in their contexts. We argue that this
second-order relationship introduces unrelated concepts into the measure, which
causes an imprecise measurement of the bias. We propose instead to measure bias
using the direct normalized co-occurrence associations between the word and the
representative concept words, a first-order measure, by reconstructing the
co-occurrence estimates inherent in the word embedding models. To study our
novel corpus bias measurement method, we calculate the correlation of the
gender bias values estimated from the text to the actual gender bias statistics
of the U.S. job market, provided by two recent collections. The results show a
consistently higher correlation when using the proposed first-order measure
with a variety of word embedding models, as well as a more severe degree of
bias, especially to female in a few specific occupations
- …