11,476 research outputs found

    Recognition times for 54 thousand Dutch words : data from the Dutch crowdsourcing project

    Get PDF
    We present a new database of Dutch word recognition times for a total of 54 thousand words, called the Dutch Crowdsourcing Project. The data were collected with an Internet vocabulary test. The database is limited to native Dutch speakers. Participants were asked to indicate which words they knew. Their response times were registered, even though the participants were not asked to respond as fast as possible. Still, the response times correlate around .7 with the response times of the Dutch Lexicon Projects for shared words. Also results of virtual experiments indicate that the new response times are a valid addition to the Dutch Lexicon Projects. This not only means that we have useful response times for some 20 thousand extra words, but we now also have data on differences in response latencies as a function of education and age. The new data correspond better to word use in the Netherlands

    Stabilizing knowledge through standards - A perspective for the humanities

    Get PDF
    It is usual to consider that standards generate mixed feelings among scientists. They are often seen as not really reflecting the state of the art in a given domain and a hindrance to scientific creativity. Still, scientists should theoretically be at the best place to bring their expertise into standard developments, being even more neutral on issues that may typically be related to competing industrial interests. Even if it could be thought of as even more complex to think about developping standards in the humanities, we will show how this can be made feasible through the experience gained both within the Text Encoding Initiative consortium and the International Organisation for Standardisation. By taking the specific case of lexical resources, we will try to show how this brings about new ideas for designing future research infrastructures in the human and social sciences

    Russian word sense induction by clustering averaged word embeddings

    Full text link
    The paper reports our participation in the shared task on word sense induction and disambiguation for the Russian language (RUSSE-2018). Our team was ranked 2nd for the wiki-wiki dataset (containing mostly homonyms) and 5th for the bts-rnc and active-dict datasets (containing mostly polysemous words) among all 19 participants. The method we employed was extremely naive. It implied representing contexts of ambiguous words as averaged word embedding vectors, using off-the-shelf pre-trained distributional models. Then, these vector representations were clustered with mainstream clustering techniques, thus producing the groups corresponding to the ambiguous word senses. As a side result, we show that word embedding models trained on small but balanced corpora can be superior to those trained on large but noisy data - not only in intrinsic evaluation, but also in downstream tasks like word sense induction.Comment: Proceedings of the 24rd International Conference on Computational Linguistics and Intellectual Technologies (Dialogue-2018

    TEI and LMF crosswalks

    Get PDF
    The present paper explores various arguments in favour of making the Text Encoding Initia-tive (TEI) guidelines an appropriate serialisation for ISO standard 24613:2008 (LMF, Lexi-cal Mark-up Framework) . It also identifies the issues that would have to be resolved in order to reach an appropriate implementation of these ideas, in particular in terms of infor-mational coverage. We show how the customisation facilities offered by the TEI guidelines can provide an adequate background, not only to cover missing components within the current Dictionary chapter of the TEI guidelines, but also to allow specific lexical projects to deal with local constraints. We expect this proposal to be a basis for a future ISO project in the context of the on going revision of LMF

    Competition is Good for Descriptions

    Get PDF
    • …
    corecore