24 research outputs found

    Prototype categorisation and the emergence of a lexicon in an infinite world

    Get PDF
    One of the least understood issues in language evolution is how hominins were able to ground and establish a shared lexicon. Recently, researchers have explored this issue using a variety of computational models, whose results have suggested that a shared lexicon could have emerged spontaneously through a process of self-organisation. However, these models have used psychologically unrecognised concept representations and an oversimplified environment. In this dissertation, I present a new computational model in an attempt to address these problems. Agents' category representations are inspired by prototype theory, having central members and graded membership. The environment consists of an infinite number of objects, and has a probabilistic structure which can be easily manipulated through model parameters. Despite the relatively complex model, simulation results are generally in line with previous ones and add further support to the self-organisation hypothesis. In addition, the speed and level of lexical convergence depend on the world structure, confirming that this is an aspect of past models which has seen too little attention. Future work should investigate the vast parameter space in further detail, and extend the simulations in various new directions

    Role of language in conceptual coordination

    Get PDF
    Although concepts are located within individual minds, while word forms are shared across entire language communities, words and concepts are normally deemed to be tightly bound. But in fact, at least to the extent that concepts vary, the relationship between words and concepts may not be as uniform or stable as is often assumed. Nevertheless, language may itself mediate that relationship, through its entrenchment and use. Psychologists have already investigated language use in referential communication, but they have yet to focus in detail on the role of language in conceptual coordination. One of the obstacles has been the theoretical and methodological challenges that arise from seriously abandoning conceptual universals. To that end, an experimental framework was developed based on sorting tasks in which participants freely partition a set of stimuli into categories and an objective measure for comparing two outputs. Four experiments were then conducted to investigate whether people were conceptually coordinated before, during and after linguistic interaction. Experiment 1 consisted of a cross-linguistic study looking at default coordination between native speakers. Participants both sorted items into groups and named them individually. There was a relatively high degree of categorisation agreement among speakers of the same language, but not nearly as high as for naming agreement. Experiments 2-4 inquired into conceptual coordination during or immediately after linguistic interaction. Experimental manipulations involved the form of language use (full dialogue or only category labels), as well as the type of feedback (category groupings, labels, both, or neither). In particular, Experiment 2 investigated the effects of categorising a set of objects together, with or without dialogue, on subsequent individual categorisation. The results were inconclusive and revealed specific methodological issues, but yielded interesting data and were encouraging for the general framework. Experiment 3 modified the designwhile testing and extending the same general hypotheses. Participants carried out a sequence of categorisation tasks in which they tried to coordinate their categories, followed by individual categorisation and similarity tasks. The availability of dialogue and feedback was manipulated in the interactive tasks. During interaction, they also received both kinds of feedback, except in the control condition. Pairs that could talk coordinated much better than the others, but feedback didn’t help. Experiment 4 looked into the effects of the four possibilities for feedback during a longer sequence of interactive tasks. In general, conceptual coordination was found to depend on grouping feedback only. However, by the end of the task, pairs who received both kinds of feedback did best. All three interactive experiments also measured lexical convergence between pairs. The results generally revealed a dissociation, with lexical alignment showingmore convergence and occurring under a wider variety of conditions. Togetherwith previous research, these findings showthat language can bring about conceptual coordination. However, it appears that the richer the form of language use, the more conceptual convergence occurs, and the closer it gets coupled with lexical convergence. The long-term effects, if any, are much weaker. These studies have implications for the general role of language in cognition and other important issues

    Molecular basis of FIR-mediated c-myc transcriptional control

    Get PDF
    The far upstream element (FUSE) regulatory system promotes a peak in the concentration of c-Myc during cell cycle. First, the FBP transcriptional activator binds to the FUSE DNA element upstream of the c-myc promoter. Then, FBP recruits its specific repressor (FIR), which acts as an on/off transcriptional switch. Here we describe the molecular basis of FIR recruitment, showing that the tandem RNA recognition motifs of FIR provide a platform for independent FUSE DNA and FBP protein binding and explaining the structural basis of the reversibility of the FBP-FIR interaction. We also show that the physical coupling between FBP and FIR is modulated by a flexible linker positioned sequentially to the recruiting element. Our data explain how the FUSE system precisely regulates c-myc transcription and suggest that a small change in FBP-FIR affinity leads to a substantial effect on c-Myc concentration.MRC Grant-in-aid U11757455

    CORDEX inflectional lookup data 1.0

    No full text
    The inflectional data lookup module serves as an optional component within the cordex library (https://github.com/clarinsi/cordex/) that significantly improves the quality of the results. The module consists of a pickled dictionary of 111,660 lemmas, and maps these lemmas to their corresponding word forms. Each word form in the dictionary is accompanied by its MULTEXT-East morphosytactic descriptions, relevant features (custom features extracted from morphosytactic descriptions with the help of https://gitea.cjvt.si/generic/conversion_utils and its frequency within the Gigafida 2.0 corpus (http://hdl.handle.net/11356/1320), or Gigafida 1.0 when other information is unavailable. The dictionary is used to select the most frequent word form of a lemma that satisfies additional filtering conditions (ie. find the most utilized word form of lemma "centralen" in singular, i.e."centralni")

    Morphological patterns from the Sloleks 2.0 lexicon 1.0

    No full text
    This entry consists of XML files with 96,290 lexical units (nouns, verbs, adjectives, and adverbs) from the Sloleks Morphological Lexicon of Slovene 2.0 (http://hdl.handle.net/11356/1230) that include codes for morphological patterns. The pattern codes were designed based on a manual analysis of automatically extracted paradigms and were obtained as follows: The lexical units from Sloleks 2.0 were first automatically clustered into groups through a rule-based approach based on (1) a number of predetermined grammatical features from the MULTEXT-East Version 6 morphosyntactic specifications for Slovenian (http://nl.ijs.si/ME/V6/), such as part of speech, gender and properness for nouns, aspect for verbs, and (2) the differentiating characteristics of their morphological paradigms (i.e. their mutable word parts, which are similar to but not always overlapping with the linguistic definition of word endings – for example: čas-Ø; čas-a; čas-om / prijatelj- Ø; prijatelj-a; prijatelj-em / odstot-ek; odstot-ka; odstot-kom). More than 1,000 automatically extracted pattern candidates were subsequently linguistically analyzed, combined into groups, and hierarchically organized. As a result, every lexical unit in the XML file features a code (listed as ) corresponding to the relevant morphological paradigm in the hierarchy (available in the accompanying file titled "nssss_morphological_pattern_hierarchy_1.0.tsv"). Because the patterns were extracted from Sloleks 2.0, they reflect the decisions that were implemented in its initial compilation, particularly in terms of the degree of morphological variation documented in the lexicon (e.g. not all morphological variants are necessarily included in the lexicon) and paradigm integrity (for instance, some nouns in Sloleks 2.0 only feature singular or plural forms). It should be noted that non-standard word forms were not included in the design of the patterns. In addition, the XML file does not contain lexical units from Sloleks 2.0 that consist of word forms from more than one morphological paradigm (e.g. lesketati – lesketam / leskečem; or lojen – lojenega / lojnega), or other problematic units (such as those with missing or erroneous data)

    Slovenian datasets for contextual synonym and antonym detection

    No full text
    Slovenian datasets for contextual synonym and antonym detection can be used for training machine learning classifiers as described in the MSc thesis of Jasmina Pegan "Semantic detection of synonyms and antonyms with contextual embeddings" (https://repozitorij.uni-lj.si/IzpisGradiva.php?id=141456). Datasets contain example pairs of synonyms and antonyms in contexts together with additional information on a sense pair. Candidates for synonyms and antonyms were retrieved from the dataset created in the BSc thesis of Jasmina Pegan "Antonym detection with word embeddings" (https://repozitorij.uni-lj.si/IzpisGradiva.php?id=110533). Example sentences were retrieved from The comprehensive Slovenian-Hungarian dictionary (VSMS) (https://www.clarin.si/repository/xmlui/handle/11356/1453). Each dataset is class balanced and contains an equal amount of examples and counterexamples. An example is a pair of example sentences where the two words are synonyms/antonyms. A counterexample is a pair of example sentences where two words are not synonyms/antonyms. Note that a word pair can be synonymous or antonymous in some sense of the two words (but not in the given context). Datasets are divided into two categories, datasets for synonyms and datasets for antonyms. Each category is further divided into base and updated datasets. These contain three dataset files: train, validation and test dataset. Base datasets include only manually-reviewed sense pairs. These are generated from all pairs of VSMS sense examples for all confirmed pairs of antonym and synonym senses. Updated datasets include automatically generated sense pairs while constraining the maximal number of examples per word. In this way, the dataset is more balanced word-wise, but is not fully manually-reviewed and contains less accurate data. A single dataset entry contains the information on the base word, followed by data on synonym/antonym candidate. The last column discerns whether the sense pair is a pair of synonyms/antonyms or not. More details on this can be found inside the included README file

    Developmental corpus ccĹ olar 1.0

    No full text
    The ccĹ olar corpus contains 1693 texts collected during 2016-2018, as part of the upgrade of the corpus Ĺ olar project. The project aims were to increase the size of the Ĺ olar 1.0 corpus and to improve text balance across regions and education level. For each text, the information on school (elementary or secondary), subject, level (grade or year), type of text, region and date of production is provided. The ccĹ olar 1.0 corpus is offered separately because the new texts were collected under CC BY 4.0 licence, a more open licence than the earlier texts

    Frequency lists of collocations from the Gigafida 2.1 corpus

    No full text
    Frequency lists of collocations were extracted from the Gigafida 2.1 Corpus of Written Standard Slovene (https://www.clarin.si/noske/run.cgi/corp_info?corpname=gfida21) using specialised scripts for extraction of data from syntactically parsed corpora. The lists contain collocations with absolute frequency 10 and above, split into files corresponding to 81 predefined syntactic structures. The formal description of syntactic structures with information on restrictions and representations applied to POS and dependency parsing annotations is included in the dataset. The lists are sorted according to absolute frequency of collocations and include frequency information on individual lemmas, together with the most frequent representative forms of combined lemmas. The lists also include calculation of logDice score for collocations, and the number of distinct forms of lemmas appearing in corpus hits for a particular collocation

    Developmental corpus of Slovene (without language corrections) Ĺ olar-Clear

    No full text
    Ĺ olar-Clear is an adapted version of the Ĺ olar 1.0 corpus, cf. http://hdl.handle.net/11356/1036. The Ĺ olar(-Clear) corpus consists of texts written by students in Slovene primary and secondary schools. School essays form the majority of the corpus (64.2%) while other material includes texts created during lessons, such as text recapitulations or descriptions, examples of formal applications etc. Unlike the original Ĺ olar corpus, Ĺ olar-Clear only includes student texts while language corrections and other types of feedback from the teachers are not included. The corpus can thus be used for processing tasks where the inclusion of corrections hinders or complicates the procedures (e.g. for comparative data extraction, training of language models etc)

    Thesaurus of Modern Slovene 1.0

    No full text
    This is an automatically created Slovene thesaurus from Slovene data available in a comprehensive English–Slovenian dictionary, a monolingual dictionary, and a corpus. A network analysis on the bilingual dictionary word co-occurrence graph was used, together with additional information from the distributional thesaurus data available as part of the Sketch Engine tool and extracted from the 1.2 billion word Gigafida corpus and the monolingual dictionary
    corecore