24 research outputs found
Prototype categorisation and the emergence of a lexicon in an infinite world
One of the least understood issues in language evolution is how hominins were able to
ground and establish a shared lexicon. Recently, researchers have explored this issue using
a variety of computational models, whose results have suggested that a shared lexicon
could have emerged spontaneously through a process of self-organisation. However,
these models have used psychologically unrecognised concept representations and an
oversimplified environment. In this dissertation, I present a new computational model
in an attempt to address these problems. Agents' category representations are inspired
by prototype theory, having central members and graded membership. The environment
consists of an infinite number of objects, and has a probabilistic structure which
can be easily manipulated through model parameters. Despite the relatively complex
model, simulation results are generally in line with previous ones and add further support
to the self-organisation hypothesis. In addition, the speed and level of lexical convergence
depend on the world structure, confirming that this is an aspect of past models
which has seen too little attention. Future work should investigate the vast parameter
space in further detail, and extend the simulations in various new directions
Role of language in conceptual coordination
Although concepts are located within individual minds, while word forms are shared
across entire language communities, words and concepts are normally deemed to be
tightly bound. But in fact, at least to the extent that concepts vary, the relationship
between words and concepts may not be as uniform or stable as is often assumed. Nevertheless,
language may itself mediate that relationship, through its entrenchment and
use. Psychologists have already investigated language use in referential communication,
but they have yet to focus in detail on the role of language in conceptual coordination.
One of the obstacles has been the theoretical and methodological challenges that arise
from seriously abandoning conceptual universals. To that end, an experimental framework
was developed based on sorting tasks in which participants freely partition a set
of stimuli into categories and an objective measure for comparing two outputs. Four
experiments were then conducted to investigate whether people were conceptually coordinated
before, during and after linguistic interaction.
Experiment 1 consisted of a cross-linguistic study looking at default coordination between
native speakers. Participants both sorted items into groups and named them individually.
There was a relatively high degree of categorisation agreement among speakers of
the same language, but not nearly as high as for naming agreement. Experiments 2-4
inquired into conceptual coordination during or immediately after linguistic interaction.
Experimental manipulations involved the form of language use (full dialogue or only
category labels), as well as the type of feedback (category groupings, labels, both, or
neither). In particular, Experiment 2 investigated the effects of categorising a set of objects
together, with or without dialogue, on subsequent individual categorisation. The results were inconclusive and revealed specific methodological issues, but yielded interesting
data and were encouraging for the general framework. Experiment 3 modified
the designwhile testing and extending the same general hypotheses. Participants carried
out a sequence of categorisation tasks in which they tried to coordinate their categories,
followed by individual categorisation and similarity tasks. The availability of dialogue
and feedback was manipulated in the interactive tasks. During interaction, they also
received both kinds of feedback, except in the control condition. Pairs that could talk
coordinated much better than the others, but feedback didnât help. Experiment 4 looked
into the effects of the four possibilities for feedback during a longer sequence of interactive
tasks. In general, conceptual coordination was found to depend on grouping feedback
only. However, by the end of the task, pairs who received both kinds of feedback
did best. All three interactive experiments also measured lexical convergence between
pairs. The results generally revealed a dissociation, with lexical alignment showingmore
convergence and occurring under a wider variety of conditions.
Togetherwith previous research, these findings showthat language can bring about conceptual
coordination. However, it appears that the richer the form of language use, the
more conceptual convergence occurs, and the closer it gets coupled with lexical convergence.
The long-term effects, if any, are much weaker. These studies have implications
for the general role of language in cognition and other important issues
Molecular basis of FIR-mediated c-myc transcriptional control
The far upstream element (FUSE) regulatory system promotes a peak in the concentration of c-Myc during cell cycle. First, the FBP transcriptional activator binds to the FUSE DNA element upstream of the c-myc promoter. Then, FBP recruits its specific repressor (FIR), which acts as an on/off transcriptional switch. Here we describe the molecular basis of FIR recruitment, showing that the tandem RNA recognition motifs of FIR provide a platform for independent FUSE DNA and FBP protein binding and explaining the structural basis of the reversibility of the FBP-FIR interaction. We also show that the physical coupling between FBP and FIR is modulated by a flexible linker positioned sequentially to the recruiting element. Our data explain how the FUSE system precisely regulates c-myc transcription and suggest that a small change in FBP-FIR affinity leads to a substantial effect on c-Myc concentration.MRC Grant-in-aid U11757455
CORDEX inflectional lookup data 1.0
The inflectional data lookup module serves as an optional component within the cordex library (https://github.com/clarinsi/cordex/) that significantly improves the quality of the results. The module consists of a pickled dictionary of 111,660 lemmas, and maps these lemmas to their corresponding word forms.
Each word form in the dictionary is accompanied by its MULTEXT-East morphosytactic descriptions, relevant features (custom features extracted from morphosytactic descriptions with the help of https://gitea.cjvt.si/generic/conversion_utils and its frequency within the Gigafida 2.0 corpus (http://hdl.handle.net/11356/1320), or Gigafida 1.0 when other information is unavailable. The dictionary is used to select the most frequent word form of a lemma that satisfies additional filtering conditions (ie. find the most utilized word form of lemma "centralen" in singular, i.e."centralni")
Morphological patterns from the Sloleks 2.0 lexicon 1.0
This entry consists of XML files with 96,290 lexical units (nouns, verbs, adjectives, and adverbs) from the Sloleks Morphological Lexicon of Slovene 2.0 (http://hdl.handle.net/11356/1230) that include codes for morphological patterns.
The pattern codes were designed based on a manual analysis of automatically extracted paradigms and were obtained as follows: The lexical units from Sloleks 2.0 were first automatically clustered into groups through a rule-based approach based on (1) a number of predetermined grammatical features from the MULTEXT-East Version 6 morphosyntactic specifications for Slovenian (http://nl.ijs.si/ME/V6/), such as part of speech, gender and properness for nouns, aspect for verbs, and (2) the differentiating characteristics of their morphological paradigms (i.e. their mutable word parts, which are similar to but not always overlapping with the linguistic definition of word endings â for example: Äas-Ă; Äas-a; Äas-om / prijatelj- Ă; prijatelj-a; prijatelj-em / odstot-ek; odstot-ka; odstot-kom).
More than 1,000 automatically extracted pattern candidates were subsequently linguistically analyzed, combined into groups, and hierarchically organized. As a result, every lexical unit in the XML file features a code (listed as ) corresponding to the relevant morphological paradigm in the hierarchy (available in the accompanying file titled "nssss_morphological_pattern_hierarchy_1.0.tsv").
Because the patterns were extracted from Sloleks 2.0, they reflect the decisions that were implemented in its initial compilation, particularly in terms of the degree of morphological variation documented in the lexicon (e.g. not all morphological variants are necessarily included in the lexicon) and paradigm integrity (for instance, some nouns in Sloleks 2.0 only feature singular or plural forms). It should be noted that non-standard word forms were not included in the design of the patterns. In addition, the XML file does not contain lexical units from Sloleks 2.0 that consist of word forms from more than one morphological paradigm (e.g. lesketati â lesketam / leskeÄem; or lojen â lojenega / lojnega), or other problematic units (such as those with missing or erroneous data)
Slovenian datasets for contextual synonym and antonym detection
Slovenian datasets for contextual synonym and antonym detection can be used for training machine learning classifiers as described in the MSc thesis of Jasmina Pegan "Semantic detection of synonyms and antonyms with contextual embeddings" (https://repozitorij.uni-lj.si/IzpisGradiva.php?id=141456). Datasets contain example pairs of synonyms and antonyms in contexts together with additional information on a sense pair. Candidates for synonyms and antonyms were retrieved from the dataset created in the BSc thesis of Jasmina Pegan "Antonym detection with word embeddings" (https://repozitorij.uni-lj.si/IzpisGradiva.php?id=110533).
Example sentences were retrieved from The comprehensive Slovenian-Hungarian dictionary (VSMS) (https://www.clarin.si/repository/xmlui/handle/11356/1453). Each dataset is class balanced and contains an equal amount of examples and counterexamples. An example is a pair of example sentences where the two words are synonyms/antonyms. A counterexample is a pair of example sentences where two words are not synonyms/antonyms. Note that a word pair can be synonymous or antonymous in some sense of the two words (but not in the given context).
Datasets are divided into two categories, datasets for synonyms and datasets for antonyms. Each category is further divided into base and updated datasets. These contain three dataset files: train, validation and test dataset. Base datasets include only manually-reviewed sense pairs. These are generated from all pairs of VSMS sense examples for all confirmed pairs of antonym and synonym senses. Updated datasets include automatically generated sense pairs while constraining the maximal number of examples per word. In this way, the dataset is more balanced word-wise, but is not fully manually-reviewed and contains less accurate data.
A single dataset entry contains the information on the base word, followed by data on synonym/antonym candidate. The last column discerns whether the sense pair is a pair of synonyms/antonyms or not. More details on this can be found inside the included README file
Developmental corpus ccĹ olar 1.0
The ccĹ olar corpus contains 1693 texts collected during 2016-2018, as part of the upgrade of the corpus Ĺ olar project. The project aims were to increase the size of the Ĺ olar 1.0 corpus and to improve text balance across regions and education level. For each text, the information on school (elementary or secondary), subject, level (grade or year), type of text, region and date of production is provided.
The ccĹ olar 1.0 corpus is offered separately because the new texts were collected under CC BY 4.0 licence, a more open licence than the earlier texts
Frequency lists of collocations from the Gigafida 2.1 corpus
Frequency lists of collocations were extracted from the Gigafida 2.1 Corpus of Written Standard Slovene (https://www.clarin.si/noske/run.cgi/corp_info?corpname=gfida21) using specialised scripts for extraction of data from syntactically parsed corpora.
The lists contain collocations with absolute frequency 10 and above, split into files corresponding to 81 predefined syntactic structures. The formal description of syntactic structures with information on restrictions and representations applied to POS and dependency parsing annotations is included in the dataset.
The lists are sorted according to absolute frequency of collocations and include frequency information on individual lemmas, together with the most frequent representative forms of combined lemmas. The lists also include calculation of logDice score for collocations, and the number of distinct forms of lemmas appearing in corpus hits for a particular collocation
Developmental corpus of Slovene (without language corrections) Ĺ olar-Clear
Ĺ olar-Clear is an adapted version of the Ĺ olar 1.0 corpus, cf. http://hdl.handle.net/11356/1036.
The Ĺ olar(-Clear) corpus consists of texts written by students in Slovene primary and secondary schools. School essays form the majority of the corpus (64.2%) while other material includes texts created during lessons, such as text recapitulations or descriptions, examples of formal applications etc.
Unlike the original Ĺ olar corpus, Ĺ olar-Clear only includes student texts while language corrections and other types of feedback from the teachers are not included. The corpus can thus be used for processing tasks where the inclusion of corrections hinders or complicates the procedures (e.g. for comparative data extraction, training of language models etc)
Thesaurus of Modern Slovene 1.0
This is an automatically created Slovene thesaurus from Slovene data available in a comprehensive EnglishâSlovenian dictionary, a monolingual dictionary, and a corpus. A network analysis on the bilingual dictionary word co-occurrence graph was used, together with additional information from the distributional thesaurus data available as part of the Sketch Engine tool and extracted from the 1.2 billion word Gigafida corpus and the monolingual dictionary