223 research outputs found

    An autoencoder-based neural network model for selectional preference: evidence from pseudo-disambiguation and cloze tasks

    Get PDF
    Intuitively, some predicates have a better fit with certain arguments than others. Usage-based models of language emphasize the importance of semantic similarity in shaping the structuring of constructions (form and meaning). In this study, we focus on modeling the semantics of transitive constructions in Finnish and present an autoencoder-based neural network model trained on semantic vectors based on Word2vec. This model builds on the distributional hypothesis according to which semantic information is primarily shaped by contextual information. Specifically, we focus on the realization of the object. The performance of the model is evaluated in two tasks: a pseudo-disambiguation and a cloze task. Additionally, we contrast the performance of the autoencoder with a previously implemented neural model. In general, the results show that our model achieves an excellent performance on these tasks in comparison to the other models. The results are discussed in terms of usage-based construction grammar.Kokkuvõte. Aki-Juhani Kyröläinen, M. Juhani Luotolahti ja Filip Ginter: Autokoodril põhinev närvivõrkude mudel valikulisel eelistamisel. Intuitiivselt tundub, et mõned argumendid sobivad teatud predikaatidega paremini kokku kui teised. Kasutuspõhised keelemudelid rõhutavad konstruktsioonide struktuuri (nii vormi kui tähenduse) kujunemisel tähendusliku sarnasuse olulisust. Selles uurimuses modelleerime soome keele transitiivsete konstruktsioonide semantikat ja esitame närvivõrkude mudeli ehk autokoodri. Mudel põhineb distributiivse semantika hüpoteesil, mille järgi kujuneb semantiline info peamiselt konteksti põhjal. Täpsemalt keskendume uurimuses objektile. Mudelit hindame nii valeühestamise kui ka lünkülesande abil. Kõrvutame autokoodri tulemusi varem välja töötatud neurovõrgumudelitega ja tõestame, et meie mudel töötab võrreldes teiste mudelitega väga hästi. Tulemused esitame kasutuspõhise konstruktsioonigrammatika kontekstis.Võtmesõnad: neurovõrk; autokooder; tähendusvektor; kasutuspõhine mudel; soome kee

    Parsing Clinical Finnish: Experiments with Rule-Based and Statistical Dependency Parsers

    Get PDF
    Proceedings of the 17th Nordic Conference of Computational Linguistics NODALIDA 2009. Editors: Kristiina Jokinen and Eckhard Bick. NEALT Proceedings Series, Vol. 4 (2009), 65-72. © 2009 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/9206

    Cell line name recognition in support of the identification of synthetic lethality in cancer from text

    Get PDF
    Motivation: The recognition and normalization of cell line names in text is an important task in biomedical text mining research, facilitating for instance the identification of synthetically lethal genes from the literature. While several tools have previously been developed to address cell line recognition, it is unclear whether available systems can perform sufficiently well in realistic and broad-coverage applications such as extracting synthetically lethal genes from the cancer literature. In this study, we revisit the cell line name recognition task, evaluating both available systems and newly introduced methods on various resources to obtain a reliable tagger not tied to any specific subdomain. In support of this task, we introduce two text collections manually annotated for cell line names: the broad-coverage corpus Gellus and CLL, a focused target domain corpus. Results: We find that the best performance is achieved using NERsuite, a machine learning system based on Conditional Random Fields, trained on the Gellus corpus and supported with a dictionary of cell line names. The system achieves an F-score of 88.46% on the test set of Gellus and 85.98% on the independently annotated CLL corpus. It was further applied at large scale to 24 302 102 unannotated articles, resulting in the identification of 5 181 342 cell line mentions, normalized to 11 755 unique cell line database identifiers

    Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022)

    Get PDF
    The prevailing practice in the academia is to evaluate the model performance on in-domain evaluation data typically set aside from the training corpus. However, in many real world applications the data on which the model is applied may very substantially differ from the characteristics of the training data. In this paper, we focus on Finnish out-of-domain parsing by introducing a novel UD Finnish-OOD out-of-domain treebank including five very distinct data sources (web documents, clinical, online discussions, tweets, and poetry), and a total of 19,382 syntactic words in 2,122 sentences released under the Universal Dependencies framework. Together with the new treebank, we present extensive out-of-domain parsing evaluation utilizing the available section-level information from three different Finnish UD treebanks (TDT, PUD, OOD). Compared to the previously existing treebanks, the new Finnish-OOD is shown include sections more challenging for the general parser, creating an interesting evaluation setting and yielding valuable information for those applying the parser outside of its training domain.</p

    Contextual weighting for Support Vector Machines in literature mining: an application to gene versus protein name disambiguation

    Get PDF
    BACKGROUND: The ability to distinguish between genes and proteins is essential for understanding biological text. Support Vector Machines (SVMs) have been proven to be very efficient in general data mining tasks. We explore their capability for the gene versus protein name disambiguation task. RESULTS: We incorporated into the conventional SVM a weighting scheme based on distances of context words from the word to be disambiguated. This weighting scheme increased the performance of SVMs by five percentage points giving performance better than 85% as measured by the area under ROC curve and outperformed the Weighted Additive Classifier, which also incorporates the weighting, and the Naive Bayes classifier. CONCLUSION: We show that the performance of SVMs can be improved by the proposed weighting scheme. Furthermore, our results suggest that in this study the increase of the classification performance due to the weighting is greater than that obtained by selecting the underlying classifier or the kernel part of the SVM

    Learning to Extract Biological Event and Relation Graphs

    Get PDF
    Proceedings of the 17th Nordic Conference of Computational Linguistics NODALIDA 2009. Editors: Kristiina Jokinen and Eckhard Bick. NEALT Proceedings Series, Vol. 4 (2009), 18-25. © 2009 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/9206
    corecore