14 research outputs found

    Measuring Thematic Fit with Distributional Feature Overlap

    Full text link
    In this paper, we introduce a new distributional method for modeling predicate-argument thematic fit judgments. We use a syntax-based DSM to build a prototypical representation of verb-specific roles: for every verb, we extract the most salient second order contexts for each of its roles (i.e. the most salient dimensions of typical role fillers), and then we compute thematic fit as a weighted overlap between the top features of candidate fillers and role prototypes. Our experiments show that our method consistently outperforms a baseline re-implementing a state-of-the-art system, and achieves better or comparable results to those reported in the literature for the other unsupervised systems. Moreover, it provides an explicit representation of the features characterizing verb-specific semantic roles.Comment: 9 pages, 2 figures, 5 tables, EMNLP, 2017, thematic fit, selectional preference, semantic role, DSMs, Distributional Semantic Models, Vector Space Models, VSMs, cosine, APSyn, similarity, prototyp

    Quantifying the dynamics of topical fluctuations in language

    Get PDF
    The availability of large diachronic corpora has provided the impetus for a growing body of quantitative research on language evolution and meaning change. The central quantities in this research are token frequencies of linguistic elements in texts, with changes in frequency taken to reflect the popularity or selective fitness of an element. However, corpus frequencies may change for a wide variety of reasons, including purely random sampling effects, or because corpora are composed of contemporary media and fiction texts within which the underlying topics ebb and flow with cultural and socio-political trends. In this work, we introduce a simple model for controlling for topical fluctuations in corpora - the topical-cultural advection model - and demonstrate how it provides a robust baseline of variability in word frequency changes over time. We validate the model on a diachronic corpus spanning two centuries, and a carefully-controlled artificial language change scenario, and then use it to correct for topical fluctuations in historical time series. Finally, we use the model to show that the emergence of new words typically corresponds with the rise of a trending topic. This suggests that some lexical innovations occur due to growing communicative need in a subspace of the lexicon, and that the topical-cultural advection model can be used to quantify this.Comment: Code to run the analyses described in this paper is now available at https://github.com/andreskarjus/topical_cultural_advection_model . A previous shorter version of this paper outlining the basic model appeared as an extended abstract in the proceedings of the Society for Computation in Linguistics (Karjus et al. 2018, Topical advection as a baseline model for corpus-based lexical dynamics

    Extracting common sense knowledge via triple ranking using supervised and unsupervised distributional models

    Get PDF
    Jebbara S, Basile V, Cabrio E, Cimiano P. Extracting common sense knowledge via triple ranking using supervised and unsupervised distributional models. Semantic Web. 2019;10(1):139-158.In this paper we are concerned with developing information extraction models that support the extraction of common sense knowledge from a combination of unstructured and semi-structured datasets. Our motivation is to extract manipulation-relevant knowledge that can support robots' action planning. We frame the task as a relation extraction task and, as proof-ofconcept, validate our method on the task of extracting two types of relations: locative and instrumental relations. The locative relation relates objects to the prototypical places where the given object is found or stored. The second instrumental relation relates objects to their prototypical purpose of use. While we extract these relations from text, our goal is not to extract specific textual mentions, but rather, given an object as input, extract a ranked list of locations and uses ranked by `prototypicality'. We use distributional methods in embedding space, relying on the well-known skip-gram model to embed words into a low-dimensional distributional space, using cosine similarity to rank the various candidates. In addition, we also present experiments that rely on the vector space model NASARI, which compute embeddings for disambiguated concepts and are thus semantically aware. While this distributional approach has been published before, we extend our framework by additional methods relying on neural networks that learn a score to judge whether a given candidate pair actually expresses a desired relation. The network thus learns a scoring function using a supervised approach. While we use a ranking-based evaluation, the supervised model is trained using a binary classification task. The resulting score from the neural network and the cosine similarity in the case of the distributional approach are both used to compute a ranking. We compare the different approaches and parameterizations thereof on the task of extracting the above mentioned relations. We show that the distributional similarity approach performs very well on the task. The best performing parameterization achieves an NDCG of 0.913, a Precision@ 1 of 0.400 and a Precision@ 3 of 0.423. The performance of the supervised learning approach, in spite of having being trained on positive and negative examples of the relation in question, is not as good as expected and achieves an NCDG of 0.908, a Precision@ 1 of 0.454 and a Precision@3 of 0.387, respectively

    Competition, selection and communicative need in language change: an investigation using corpora, computational modelling and experimentation

    Get PDF
    Constant change is one of the few truly universal cross-linguistic properties of living languages. In this thesis I focus on lexical change, and ask why the introduction and spread of some words leads to competition and eventual extinction of words with similar functions, while in other cases semantically similar words are able to companionably co-exist for decades. I start out by using extensive computational simulations to evaluate a recently published method for differentiating selection and drift in language change. While I conclude this particular method still requires improvement to be reliably applicable to historical corpus data, my findings suggest that the approach in general, when properly evaluated, could have considerable future potential for better understanding the interplay of drift, selection and therefore competition in language change. In a series of corpus studies, I argue that the communicative needs of speakers play a significant role in how languages change, as they continue to be moulded to meet the needs of linguistic communities. I developed and evaluated computational methods for inferring a number of linguistic processes – changes in communicative need, competition between lexical items, and changes in colexification – directly from diachronic corpus data. Applying these new methods to massive historical corpora of multiple languages spanning several centuries, I show that communicative need modulates the outcome of competition between lexical items, and the colexification of concepts in semantic subspaces. I also conducted an experiment in the form of a dyadic artificial language communication game, the results of which demonstrate how speakers adapt their lexicons to the communicative needs of the situation. This combination of methods allows me to link actions of individual speakers at short timescales to population-level findings in large corpora at historical timescales, in order to show that language change is driven by communicative need

    Similarity Models in Distributional Semantics using Task Specific Information

    Get PDF
    In distributional semantics, the unsupervised learning approach has been widely used for a large number of tasks. On the other hand, supervised learning has less coverage. In this dissertation, we investigate the supervised learning approach for semantic relatedness tasks in distributional semantics. The investigation considers mainly semantic similarity and semantic classification tasks. Existing and newly-constructed datasets are used as an input for the experiments. The new datasets are constructed from thesauruses like Eurovoc. The Eurovoc thesaurus is a multilingual thesaurus maintained by the Publications Office of the European Union. The meaning of the words in the dataset is represented by using a distributional semantic approach. The distributional semantic approach collects co-occurrence information from large texts and represents the words in high-dimensional vectors. The English words are represented by using UkWaK corpus while German words are represented by using DeWaC corpus. After representing each word by the high dimensional vector, different supervised machine learning methods are used on the selected tasks. The outputs from the supervised machine learning methods are evaluated by comparing the tasks performance and accuracy with the state of the art unsupervised machine learning methods’ results. In addition, multi-relational matrix factorization is introduced as one supervised learning method in distributional semantics. This dissertation shows the multi-relational matrix factorization method as a good alternative method to integrate different sources of information of words in distributional semantics. In the dissertation, some new applications are also introduced. One of the applications is an application which analyzes a German company’s website text, and provides information about the company with a concept cloud visualization. The other applications are automatic recognition/disambiguation of the library of congress subject headings and automatic identification of synonym relations in the Dutch Parliament thesaurus applications
    corecore