157 research outputs found
On link predictions in complex networks with an application to ontologies and semantics
It is assumed that ontologies can be represented and treated as networks and that these networks show properties of so-called complex networks. Just like ontologies Âour current pictures of many networks are substantially incomplete (Clauset et al., 2008, p. 3ff.). For this reason, networks have been analyzed and methods for identifying missing edges have been proposed. The goal of this thesis is to show how treating and understanding an ontology as a network can be used to extend and improve existing ontologies, and how measures from graph theory and techniques developed in social network analysis and other complex networks in recent years can be applied to semantic networks in the form of ontologies. Given a large enough amount of data, here data organized according to an ontology, and the relations defined in the ontology, the goal is to find patterns that help reveal implicitly given information in an ontology. The approach does not, unlike reasoning and methods of inference, rely on predefined patterns of relations, but it is meant to identify patterns of relations or of other structural information taken from the ontology graph, to calculate probabilities of yet unknown relations between entities.
The methods adopted from network theory and social sciences presented in this thesis are expected to reduce the work and time necessary to build an ontology considerably by automating it. They are believed to be applicable to any ontology and can be used in either supervised or unsupervised fashion to automatically identify missing relations, add new information, and thereby enlarge the data set and increase the information explicitly available in an ontology. As seen in the IBM Watson example, different knowledge bases are applied in NLP tasks. An ontology like WordNet contains lexical and semantic knowl- edge on lexemes while general knowledge ontologies like Freebase and DBpedia contain information on entities of the non-linguistic world. In this thesis, examples from both kinds of ontologies are used: WordNet and DBpedia.
WordNet is a manually crafted resource that establishes a network of representations of word senses, connected to the word forms used to express these, and connect these senses and forms with lexical and semantic relations in a machine-readable form. As will be shown, although a lot of work has been put into WordNet, it can still be improved.
While it already contains many lexical and semantical relations, it is not possible to distinguish between polysemous and homonymous words. As will be explained later, this can be useful for NLP problems regarding word sense disambiguation and hence QA.
Using graph- and network-based centrality and path measures, the goal is to train a machine learning model that is able to identify new, missing relations in the ontology and assign this new relation to the whole data set (i.e., WordNet). The approach presented here will be based on a deep analysis of the ontology and the network structure it exposes. Using different measures from graph theory as features and a set of manually created examples, a so-called training set, a supervised machine learning approach will be presented and evaluated that will show what the benefit of interpreting an ontology as a network is compared to other approaches that do not take the network structure into account.
DBpedia is an ontology derived from Wikipedia. The structured information given in Wikipedia infoboxes is parsed and relations according to an underlying ontology are extracted. Unlike Wikipedia, it only contains the small amount of structured information (e.g., the infoboxes of each page) and not the large amount of unstructured information (i.e., the free text) of Wikipedia pages. Hence DBpedia is missing a large number of possible relations that are described in Wikipedia. Also compared to Freebase, an ontology used and maintained by Google, DBpedia is quite incomplete. This, and the fact that Wikipedia is expected to be usable to compare possible results to, makes DBpedia a good subject of investigation.
The approach used to extend DBpedia presented in this thesis will be based on a thorough analysis of the network structure and the assumed evolution of the network, which will point to the locations of the network where information is most likely to be missing. Since the structure of the ontology and the resulting network is assumed to reveal patterns that are connected to certain relations defined in the ontology, these patterns can be used to identify what kind of relation is missing between two entities of the ontology. This will be done using unsupervised methods from the field of data mining and machine learning
Resolving Regular Polysemy in Named Entities
Word sense disambiguation primarily addresses the lexical ambiguity of common
words based on a predefined sense inventory. Conversely, proper names are
usually considered to denote an ad-hoc real-world referent. Once the reference
is decided, the ambiguity is purportedly resolved. However, proper names also
exhibit ambiguities through appellativization, i.e., they act like common words
and may denote different aspects of their referents. We proposed to address the
ambiguities of proper names through the light of regular polysemy, which we
formalized as dot objects. This paper introduces a combined word sense
disambiguation (WSD) model for disambiguating common words against Chinese
Wordnet (CWN) and proper names as dot objects. The model leverages the
flexibility of a gloss-based model architecture, which takes advantage of the
glosses and example sentences of CWN. We show that the model achieves
competitive results on both common and proper nouns, even on a relatively
sparse sense dataset. Aside from being a performant WSD tool, the model further
facilitates the future development of the lexical resource
Unsupervised Machine Learning Approach for Tigrigna Word Sense Disambiguation
All human languages have words that can mean different things in different contexts. Word sense disambiguation (WSD) is an open problem of natural language processing, which governs the process of identifying which sense of a word (i.e. meaning) is used in a sentence, when the word has multiple meanings (polysemy). We use unsupervised machine learning techniques to address the problem of automatically deciding the correct sense of an ambiguous word Tigrigna texts based on its surrounding context. And we report experiments on four selected Tigrigna ambiguous words due to lack of sufficient training data; these are áá°á„ read as âmedebâ has three different meaning (Program, Traditional bed and Grouping), ááá read as âhalefeâ; has four dissimilar meanings (Pass, Promote, Boss and Pass away), áá°á read as âhademeâ; has two different meaning (Running and Building house) and, ášá áš read as âkebereâ; has two different meaning (Respecting and Expensive).Finally we tested five clustering algorithms (simple k means, hierarchical agglomerative: Single, Average and complete link and Expectation Maximization algorithms) in the existing implementation of Weka 3.8.1 package. âUse training setâ evaluation mode was selected to learn the selected algorithms in the preprocessed dataset. We have evaluated the algorithms for the four ambiguous words and achieved the best accuracy within the range of 67 to 83.3 for EM which is encouraging result. Keywords: Attribute- Relation File Format, Cross Validation, Consonant Vowel, Machine Readable Dictionary, Natural Language Processing, System for Ethiopic Representation in ASCII, Word Sense Disambiguatio
Morphologically Aware Word-Level Translation
We propose a novel morphologically aware probability model for bilingual
lexicon induction, which jointly models lexeme translation and inflectional
morphology in a structured way. Our model exploits the basic linguistic
intuition that the lexeme is the key lexical unit of meaning, while
inflectional morphology provides additional syntactic information. This
approach leads to substantial performance improvements - 19% average
improvement in accuracy across 6 language pairs over the state of the art in
the supervised setting and 16% in the weakly supervised setting. As another
contribution, we highlight issues associated with modern BLI that stem from
ignoring inflectional morphology, and propose three suggestions for improving
the task.Comment: COLING 202
Adjusting Sense Representations for Word Sense Disambiguation and Automatic Pun Interpretation
Word sense disambiguation (WSD)âthe task of determining which meaning a word carries in a particular contextâis a core research problem in computational linguistics. Though it has long been recognized that supervised (machine learningâbased) approaches to WSD can yield impressive results, they require an amount of manually annotated training data that is often too expensive or impractical to obtain. This is a particular problem for under-resourced languages and domains, and is also a hurdle in well-resourced languages when processing the sort of lexical-semantic anomalies employed for deliberate effect in humour and wordplay. In contrast to supervised systems are knowledge-based techniques, which rely only on pre-existing lexical-semantic resources (LSRs). These techniques are of more general applicability but tend to suffer from lower performance due to the informational gap between the target word's context and the sense descriptions provided by the LSR.
This dissertation is concerned with extending the efficacy and applicability of knowledge-based word sense disambiguation. First, we investigate two approaches for bridging the information gap and thereby improving the performance of knowledge-based WSD. In the first approach we supplement the word's context and the LSR's sense descriptions with entries from a distributional thesaurus. The second approach enriches an LSR's sense information by aligning it to other, complementary LSRs.
Our next main contribution is to adapt techniques from word sense disambiguation to a novel task: the interpretation of puns. Traditional NLP applications, including WSD, usually treat the source text as carrying a single meaning, and therefore cannot cope with the intentionally ambiguous constructions found in humour and wordplay. We describe how algorithms and evaluation methodologies from traditional word sense disambiguation can be adapted for the "disambiguation" of puns, or rather for the identification of their double meanings.
Finally, we cover the design and construction of technological and linguistic resources aimed at supporting the research and application of word sense disambiguation. Development and comparison of WSD systems has long been hampered by a lack of standardized data formats, language resources, software components, and workflows. To address this issue, we designed and implemented a modular, extensible framework for WSD. It implements, encapsulates, and aggregates reusable, interoperable components using UIMA, an industry-standard information processing architecture. We have also produced two large sense-annotated data sets for under-resourced languages or domains: one of these targets German-language text, and the other English-language puns
Recommended from our members
Acquiring and Harnessing Verb Knowledge for Multilingual Natural Language Processing
Advances in representation learning have enabled natural language processing models to derive non-negligible linguistic information directly from text corpora in an unsupervised fashion. However, this signal is underused in downstream tasks, where they tend to fall back on superficial cues and heuristics to solve the problem at hand. Further progress relies on identifying and filling the gaps in linguistic knowledge captured in their parameters. The objective of this thesis is to address these challenges focusing on the issues of resource scarcity, interpretability, and lexical knowledge injection, with an emphasis on the category of verbs.
To this end, I propose a novel paradigm for efficient acquisition of lexical knowledge leveraging native speakersâ intuitions about verb meaning to support development and downstream performance of NLP models across languages. First, I investigate the potential of acquiring semantic verb classes from non-experts through manual clustering. This subsequently informs the development of a two-phase semantic dataset creation methodology, which combines semantic clustering with fine-grained semantic similarity judgments collected through spatial arrangements of lexical stimuli. The method is tested on English and then applied to a typologically diverse sample of languages to produce the first large-scale multilingual verb dataset of this kind. I demonstrate its utility as a diagnostic tool by carrying out a comprehensive evaluation of state-of-the-art NLP models, probing representation quality across languages and domains of verb meaning, and shedding light on their deficiencies. Subsequently, I directly address these shortcomings by injecting lexical knowledge into large pretrained language models. I demonstrate that external manually curated information about verbsâ lexical properties can support data-driven models in tasks where accurate verb processing is key. Moreover, I examine the potential of extending these benefits from resource-rich to resource-poor languages through translation-based transfer. The results emphasise the usefulness of human-generated lexical knowledge in supporting NLP models and suggest that time-efficient construction of lexicons similar to those developed in this work, especially in under-resourced languages, can play an important role in boosting their linguistic capacity.ESRC Doctoral Fellowship [ES/J500033/1], ERC Consolidator Grant LEXICAL [648909
Semantics-based approach for generating partial views from linked life-cycle highway project data
The purpose of this dissertation is to develop methods that can assist data integration and extraction from heterogeneous sources generated throughout the life-cycle of a highway project. In the era of computerized technologies, project data is largely available in digital format. Due to the fragmented nature of the civil infrastructure sector, digital data are created and managed separately by different project actors in proprietary data warehouses. The differences in the data structure and semantics greatly hinder the exchange and fully reuse of digital project data. In order to address those issues, this dissertation carries out the following three individual studies.
The first study aims to develop a framework for interconnecting heterogeneous life cycle project data into an unified and linked data space. This is an ontology-based framework that consists of two phases: (1) translating proprietary datasets into homogeneous RDF data graphs; and (2) connecting separate data networks to each other. Three domain ontologies for design, construction, and asset condition survey phases are developed to support data transformation. A merged ontology that integrates the domain ontologies is constructed to provide guidance on how to connect data nodes from domain graphs.
The second study is to deal with the terminology inconsistency between data sources. An automated method is developed that employs Natural Language Processing (NLP) and machine learning techniques to support constructing a domain specific lexicon from design manuals. The method utilizes pattern rules to extract technical terms from texts and learns their representation vectors using a neural network based word embedding approach. The study also includes the development of an integrated method of minimal-supervised machine learning, clustering analysis, and word vectors, for computing the term semantics and classifying the relations between terms in the target lexicon.
In the last study, a data retrieval technique for extracting subsets of an XML civil data schema is designed and tested. The algorithm takes a keyword input of the end user and returns a ranked list of the most relevant XML branches. This study utilizes a lexicon of the highway domain generated from the second study to analyze the semantics of the end user keywords. A context-based similarity measure is introduced to evaluate the relevance between a certain branch in the source schema and the user query.
The methods and algorithms resulting from this research were tested using case studies and empirical experiments.
The results indicate that the study successfully address the heterogeneity in the structure and terminology of data and enable a fast extraction of sub-models of data. The study is expected to enhance the efficiency in reusing digital data generated throughout the project life-cycle, and contribute to the success in transitioning from paper-based to digital project delivery for civil infrastructure projects
Affect Lexicon Induction For the Github Subculture Using Distributed Word Representations
Sentiments and emotions play essential roles in small group interactions, especially in self-organized collaborative groups. Many people view sentiments as universal constructs; however, cultural differences exist in some aspects of sentiments. Understanding the features of sentiment space in small group cultures provides essential insights into the dynamics of self-organized collaborations. However, due to the limit of carefully human annotated data, it is hard to describe sentimental divergences across cultures.
In this thesis, we present a new approach to inspect cultural differences on the level of sentiments and compare subculture with the general social environment. We use Github, a collaborative software development network, as an example of self-organized subculture. First, we train word embeddings on large corpora and do embedding alignment using linear transformation method. Then we model finer-grained human sentiment in the Evaluation- Potency-Activity (EPA) space and extend subculture EPA lexicon with two-dense-layered neural networks. Finally, we apply Long Short-Term Memory (LSTM) network to analyze the identitiesâ sentiments triggered by event-based sentences. We evaluate the predicted EPA lexicon for Github community using a recently collected dataset, and the result proves our approach could capture subtle changes in affective dimensions. Moreover, our induced sentiment lexicon shows individuals from two environments have different understandings to sentiment-related words and phrases but agree on nouns and adjectives. The sentiment features of âGithub cultureâ could explain that people in self-organized groups tend to reduce personal sentiment to improve group collaboration
Recommended from our members
Geometric Representation Learning
Vector embedding models are a cornerstone of modern machine learning methods for knowledge representation and reasoning. These methods aim to turn semantic questions into geometric questions by learning representations of concepts and other domain objects in a lower-dimensional vector space. In that spirit, this work advocates for density- and region-based representation learning. Embedding domain elements as geometric objects beyond a single point enables us to naturally represent breadth and polysemy, make asymmetric comparisons, answer complex queries, and provides a strong inductive bias when labeled data is scarce. We present a model for word representation using Gaussian densities, enabling asymmetric entailment judgments between concepts, and a probabilistic model for weighted transitive relations and multivariate discrete data based on a lattice of axis-aligned hyperrectangle representations (boxes). We explore the suitability of these embedding methods in different regimes of sparsity, edge weight, correlation, and independence structure, as well as extensions of the representation and different optimization strategies. We make a theoretical investigation of the representational power of the box lattice, and propose extensions to address shortcomings in modeling difficult distributions and graphs
- âŠ