3,414 research outputs found

    Redefining part-of-speech classes with distributional semantic models

    Full text link
    This paper studies how word embeddings trained on the British National Corpus interact with part of speech boundaries. Our work targets the Universal PoS tag set, which is currently actively being used for annotation of a range of languages. We experiment with training classifiers for predicting PoS tags for words based on their embeddings. The results show that the information about PoS affiliation contained in the distributional vectors allows us to discover groups of words with distributional patterns that differ from other words of the same part of speech. This data often reveals hidden inconsistencies of the annotation process or guidelines. At the same time, it supports the notion of `soft' or `graded' part of speech affiliations. Finally, we show that information about PoS is distributed among dozens of vector components, not limited to only one or two features

    SimLex-999: Evaluating Semantic Models with (Genuine) Similarity Estimation

    Full text link
    We present SimLex-999, a gold standard resource for evaluating distributional semantic models that improves on existing resources in several important ways. First, in contrast to gold standards such as WordSim-353 and MEN, it explicitly quantifies similarity rather than association or relatedness, so that pairs of entities that are associated but not actually similar [Freud, psychology] have a low rating. We show that, via this focus on similarity, SimLex-999 incentivizes the development of models with a different, and arguably wider range of applications than those which reflect conceptual association. Second, SimLex-999 contains a range of concrete and abstract adjective, noun and verb pairs, together with an independent rating of concreteness and (free) association strength for each pair. This diversity enables fine-grained analyses of the performance of models on concepts of different types, and consequently greater insight into how architectures can be improved. Further, unlike existing gold standard evaluations, for which automatic approaches have reached or surpassed the inter-annotator agreement ceiling, state-of-the-art models perform well below this ceiling on SimLex-999. There is therefore plenty of scope for SimLex-999 to quantify future improvements to distributional semantic models, guiding the development of the next generation of representation-learning architectures

    Using distributional similarity to organise biomedical terminology

    Get PDF
    We investigate an application of distributional similarity techniques to the problem of structural organisation of biomedical terminology. Our application domain is the relatively small GENIA corpus. Using terms that have been accurately marked-up by hand within the corpus, we consider the problem of automatically determining semantic proximity. Terminological units are dened for our purposes as normalised classes of individual terms. Syntactic analysis of the corpus data is carried out using the Pro3Gres parser and provides the data required to calculate distributional similarity using a variety of dierent measures. Evaluation is performed against a hand-crafted gold standard for this domain in the form of the GENIA ontology. We show that distributional similarity can be used to predict semantic type with a good degree of accuracy

    The company that words keep: comparing the statistical structure of child- versus adult-directed language

    Get PDF
    Does child-directed language differ from adult-directed language in ways that might facilitate word learning? Associative structure (the probability that a word appears with its free associates), contextual diversity, word repetitions and frequency were compared longitudinally across six language corpora, with four corpora of language directed at children aged 1 ; 0 to 5 ; 0, and two adult-directed corpora representing spoken and written language. Statistics were adjusted relative to shuffled corpora. Child-directed language was found to be more associative, repetitive and consistent than adult-directed language. Moreover, these statistical properties of child-directed language better predicted word acquisition than the same statistics in adult-directed language. Word frequency and repetitions were the best predictors within word classes (nouns, verbs, adjectives and function words). For all word classes combined, associative structure, contextual diversity and word repetitions best predicted language acquisition. These results support the hypothesis that child-directed language is structured in ways that facilitate language acquisition

    Functional versus lexical: a cognitive dichotomy

    Get PDF

    Modelling the acquisition of syntactic categories

    Get PDF
    This research represents an attempt to model the child’s acquisition of syntactic categories. A computational model, based on the EPAM theory of perception and learning, is developed. The basic assumptions are that (1) syntactic categories are actively constructed by the child using distributional learning abilities; and (2) cognitive constraints in learning rate and memory capacity limit these learning abilities. We present simulations of the syntax acquisition of a single subject, where the model learns to build up multi-word utterances by scanning a sample of the speech addressed to the subject by his mother

    Input and Intake in Language Acquisition

    Get PDF
    This dissertation presents an approach for a productive way forward in the study of language acquisition, sealing the rift between claims of an innate linguistic hypothesis space and powerful domain general statistical inference. This approach breaks language acquisition into its component parts, distinguishing the input in the environment from the intake encoded by the learner, and looking at how a statistical inference mechanism, coupled with a well defined linguistic hypothesis space could lead a learn to infer the native grammar of their native language. This work draws on experimental work, corpus analyses and computational models of Tsez, Norwegian and English children acquiring word meanings, word classes and syntax to highlight the need for an appropriate encoding of the linguistic input in order to solve any given problem in language acquisition

    The ‘nouniness’ of attributive adjectives and ‘verbiness’ of predicative adjectives:Evidence from phonology

    Get PDF
    This article investigates prototypically attributive versus predicative adjectives in English in terms of the phonological properties that have been associated especially with nouns versus verbs in a substantial body of psycholinguistic research (e.g. Kelly 1992) - often ignored in theoretical linguistic work on word classes. Inspired by Berg's (2000, 2009) 'cross-level harmony constraint', the hypothesis I test is that prototypically attributive adjectives not only align more with nouns than with verbs syntactically, semantically and pragmatically, but also phonologically - and likewise for prototypically predicative adjectives and verbs. I analyse the phonological structure of frequent adjectives from the Corpus of Contemporary American English (COCA), and show that the data do indeed support the hypothesis. Berg's 'cross-level harmony constraint' may thus apply not only to the entire word classes noun, verb and adjective, but also to these two adjectival subclasses. I discuss several theoretical issues that emerge. The facts are most readily accommodated in a usage-based model, such as Radical Construction Grammar (Croft 2001), where these adjectives are seen as forming two distinct but overlapping classes. Drawing also on recent research by Boyd & Goldberg (2011) and Hao (2015), I explore the possible nature and emergence of these classes in some detail

    ADJECTIVISH INDONESIAN VERBS: A COGNITIVE SEMANTICS PERSPECTIVE

    Get PDF
    There has been a deeply rooted belief that parts of speech can be discretely categorized. It is somethingwidely accepted in linguistics. There is a tendency of taking for granted of such an academic beliefTherefore it happens from time to time without being thought critically the degree of its empirical truthThose studying linguistics will sooner or later read many linguistics text books stating that once a word hasits own category, there will be no potential of the word to have another word category. Most people learninglinguistics considered it as something necessary to occur. This linguistic phenomenon is not just taken tobe true, yet it comes to be taken as something conclusive. Factually, there are Indonesian verbs behavingadjectivishly. They are, to some extent, verbs, yet to another one, they are adjectives. It is evidenced by thefact that they have the properties of adjective. These linguistic phenomena demonstrate that there are Indonesian verbs that have stronger quality of their verbness. It means that there are Indonesian verbs thaare verbier than others. Based on the data found, Indonesian transitive verbs have higher potential to behaveadjectivishly than the Indonesian intransitive ones. A certain kind of Indonesian transitive verbs can betreated adjectivishly. This finding shows that the degree of word category discreteness, particularly verb, isnot something clear and cut. There are possibilities to emerge that word categories can, to some extent, be fuzzy. The fuzzy quality can be referred to the attributions of adjective to the Indonesian transitive verbs. Imeans that categorizing word class is not as simple as we thought before
    • 

    corecore