217 research outputs found

    Analysis of equivalence mapping for terminology services

    Get PDF
    This paper assesses the range of equivalence or mapping types required to facilitate interoperability in the context of a distributed terminology server. A detailed set of mapping types were examined, with a view to determining their validity for characterizing relationships between mappings from selected terminologies (AAT, LCSH, MeSH, and UNESCO) to the Dewey Decimal Classification (DDC) scheme. It was hypothesized that the detailed set of 19 match types proposed by Chaplan in 1995 is unnecessary in this context and that they could be reduced to a less detailed conceptually-based set. Results from an extensive mapping exercise support the main hypothesis and a generic suite of match types are proposed, although doubt remains over the current adequacy of the developing Simple Knowledge Organization System (SKOS) Core Mapping Vocabulary Specification (MVS) for inter-terminology mapping

    On link predictions in complex networks with an application to ontologies and semantics

    Get PDF
    It is assumed that ontologies can be represented and treated as networks and that these networks show properties of so-called complex networks. Just like ontologies “our current pictures of many networks are substantially incomplete” (Clauset et al., 2008, p. 3ff.). For this reason, networks have been analyzed and methods for identifying missing edges have been proposed. The goal of this thesis is to show how treating and understanding an ontology as a network can be used to extend and improve existing ontologies, and how measures from graph theory and techniques developed in social network analysis and other complex networks in recent years can be applied to semantic networks in the form of ontologies. Given a large enough amount of data, here data organized according to an ontology, and the relations defined in the ontology, the goal is to find patterns that help reveal implicitly given information in an ontology. The approach does not, unlike reasoning and methods of inference, rely on predefined patterns of relations, but it is meant to identify patterns of relations or of other structural information taken from the ontology graph, to calculate probabilities of yet unknown relations between entities. The methods adopted from network theory and social sciences presented in this thesis are expected to reduce the work and time necessary to build an ontology considerably by automating it. They are believed to be applicable to any ontology and can be used in either supervised or unsupervised fashion to automatically identify missing relations, add new information, and thereby enlarge the data set and increase the information explicitly available in an ontology. As seen in the IBM Watson example, different knowledge bases are applied in NLP tasks. An ontology like WordNet contains lexical and semantic knowl- edge on lexemes while general knowledge ontologies like Freebase and DBpedia contain information on entities of the non-linguistic world. In this thesis, examples from both kinds of ontologies are used: WordNet and DBpedia. WordNet is a manually crafted resource that establishes a network of representations of word senses, connected to the word forms used to express these, and connect these senses and forms with lexical and semantic relations in a machine-readable form. As will be shown, although a lot of work has been put into WordNet, it can still be improved. While it already contains many lexical and semantical relations, it is not possible to distinguish between polysemous and homonymous words. As will be explained later, this can be useful for NLP problems regarding word sense disambiguation and hence QA. Using graph- and network-based centrality and path measures, the goal is to train a machine learning model that is able to identify new, missing relations in the ontology and assign this new relation to the whole data set (i.e., WordNet). The approach presented here will be based on a deep analysis of the ontology and the network structure it exposes. Using different measures from graph theory as features and a set of manually created examples, a so-called training set, a supervised machine learning approach will be presented and evaluated that will show what the benefit of interpreting an ontology as a network is compared to other approaches that do not take the network structure into account. DBpedia is an ontology derived from Wikipedia. The structured information given in Wikipedia infoboxes is parsed and relations according to an underlying ontology are extracted. Unlike Wikipedia, it only contains the small amount of structured information (e.g., the infoboxes of each page) and not the large amount of unstructured information (i.e., the free text) of Wikipedia pages. Hence DBpedia is missing a large number of possible relations that are described in Wikipedia. Also compared to Freebase, an ontology used and maintained by Google, DBpedia is quite incomplete. This, and the fact that Wikipedia is expected to be usable to compare possible results to, makes DBpedia a good subject of investigation. The approach used to extend DBpedia presented in this thesis will be based on a thorough analysis of the network structure and the assumed evolution of the network, which will point to the locations of the network where information is most likely to be missing. Since the structure of the ontology and the resulting network is assumed to reveal patterns that are connected to certain relations defined in the ontology, these patterns can be used to identify what kind of relation is missing between two entities of the ontology. This will be done using unsupervised methods from the field of data mining and machine learning

    A distributional model of semantic context effects in lexical processinga

    Get PDF
    One of the most robust findings of experimental psycholinguistics is that the context in which a word is presented influences the effort involved in processing that word. We present a novel model of contextual facilitation based on word co-occurrence prob ability distributions, and empirically validate the model through simulation of three representative types of context manipulation: single word priming, multiple-priming and contextual constraint. In our simulations the effects of semantic context are mod eled using general-purpose techniques and representations from multivariate statistics, augmented with simple assumptions reflecting the inherently incremental nature of speech understanding. The contribution of our study is to show that special-purpose m echanisms are not necessary in order to capture the general pattern of the experimental results, and that a range of semantic context effects can be subsumed under the same principled account.›

    Viewing morphology as an inference process

    Get PDF
    AbstractMorphology is the area of linguistics concerned with the internal structure of words. Information retrieval has generally not paid much attention to word structure, other than to account for some of the variability in word forms via the use of stemmers. We report on our experiments to determine the importance of morphology, and the effect that it has on performance. We found that grouping morphological variants makes a significant improvement in retrieval performance. Improvements are seen by grouping inflectional as well as derivational variants. We also found that performance was enhanced by recognizing lexical phrases. We describe the interaction between morphology and lexical ambiguity, and how resolving that ambiguity will lead to further improvements in performance

    Label Imputation for Homograph Disambiguation: Theoretical and Practical Approaches

    Full text link
    This dissertation presents the first implementation of label imputation for the task of homograph disambiguation using 1) transcribed audio, and 2) parallel, or translated, corpora. For label imputation from parallel corpora, a hypothesis of interlingual alignment between homograph pronunciations and text word forms is developed and formalized. Both audio and parallel corpora label imputation techniques are tested empirically in experiments that compare homograph disambiguation model performance using: 1) hand-labeled training data, and 2) hand-labeled training data augmented with label-imputed data. Regularized, multinomial logistic regression and pre-trained ALBERT, BERT, and XLNet language models fine-tuned as token classifiers are developed for homograph disambiguation. Model performance after training on parallel corpus-based, label-imputed augmented data shows improvement over training on hand-labeled data alone in classes with low prevalence samples. Four homograph disambiguation data sets generated during the work on the dissertation are made available to the research community. In addition, this dissertation offers a novel typology of homographs with practical implications for both the label imputation process and homograph disambiguation

    Towards the ontology-based approach for factual information matching

    Get PDF
    Factual information is information based on facts or relating to facts. The reliability of automatically extracted facts is the main problem of processing factual information. The fact retrieval system remains one of the most effective tools for identifying the information for decision-making. In this work, we explore how can natural language processing methods and problem domain ontology help to check contradictions and mismatches in facts automatically

    The effect of printed word attributes on Arabic reading

    Get PDF
    Printed Arabic texts usually contain no short vowels and therefore a single letter string can often be associated with two or more distinct pronunciations and meanings. The high level of homography is believed to present difficulties for the skilled reader. However, this is the first study to gather empirical evidence on what readers know about the different words that can be associated with each homograph. There are few studies of the effects of psycholinguistic variables on Arabic word naming and lexical decision. The present work therefore involved the creation of a database of 1,474 unvowelised letter strings, which was used to undertake four studies. The first study presented lists of unvowelised letter strings and asked participants to produce the one or more word forms (with short vowels) evoked by each target. Responses to 1,474 items were recorded from 445 adult speakers of Arabic. The number of different vowelised forms associated with each letter string and the percentage agreement were calculated. The second study collected subjective Age-of-Acquisition ratings from 89 different participants for the agreed vowelised form of each letter string. The third study asked 38 participants to produce pronunciation responses to 1,474 letter strings. Finally, 40 different participants were asked to produce lexical decisions to 1,352 letter strings and 1,352 matched non-word letter strings. Mixed-effects models showed that orthographic frequency, Age-of-Acquisition and name agreement influenced word naming, while lexical decision was not affected by name agreement. Findings indicate that lexical decision in Arabic requires recognition of a basic shared morphemic structure, whereas word naming requires identification of a unique phonological representation. It takes longer to name a word when there are more possible pronunciations. The Age-of-Acquisition effect is consistent with a developmental theory of reading

    THE ROLE OF SHORT VOWELS AND CONTEXT IN THE READING OF ARABIC, COMPREHENSION AND WORD RECOGNITION OF HIGHLY SKILLED READERS

    Get PDF
    The purpose of this study was to investigate the role of short vowels in reading Arabic for skilled Arab adult readers. Previous studies claimed that the presence of short vowels (and diacritics) has a facilitative role in the reading of Arabic. That is, adding short vowels to the consonants facilitates the reading comprehension and reading accuracy of both children and skilled adult Arab readers. Further, those studies claimed that the absence of short vowels (and diacritics) and context makes reading Arabic impossible. But these studies did not manipulate the short vowels and diacritics to the degree that would isolate the short vowels effect. Nor did they take into account the level of reading involved: text, sentence, and word. That is, on a text level, assessing the role of short vowels should take into account the text level in terms of word frequency; on a sentence level, the structure of the sentence- garden-path versus non-garden-path-; and finally, on a word level the type of word, homographic versus nonhomographic. Thus, the study described in the following pages was designed with three tasks to assess the role of short vowels in relation to each level: the text frequency, the garden-path structure, and the homography aspect of the word. In general, the results showed that the presence or absence of short vowels and diacritics in combination do not affect the reading process, comprehension, and accuracy of skilled adult Arab readers. However, only in a word-naming task, the absence of short vowels and context prevented the skilled adult Arab reader from choosing the right form of the heterophonic homographic word. Further, according to the findings, at the absence of short vowels and diacritics in combination, the role of context in Arabic is still limited to the heterophonic homographic words. In sum, the results demonstrated that the only variable that affects the reading process of Adult Arab skilled readers is the word frequency. Justification for such effects and recommendations for pedagogical purposes and future research are suggested
    corecore