7 research outputs found

    Methods and algorithms for unsupervised learning of morphology

    Get PDF
    This is an accepted manuscript of a chapter published by Springer in Computational Linguistics and Intelligent Text Processing. CICLing 2014. Lecture Notes in Computer Science, vol 8403 in 2014 available online: https://doi.org/10.1007/978-3-642-54906-9_15 The accepted version of the publication may differ from the final published version.This paper is a survey of methods and algorithms for unsupervised learning of morphology. We provide a description of the methods and algorithms used for morphological segmentation from a computational linguistics point of view. We survey morphological segmentation methods covering methods based on MDL (minimum description length), MLE (maximum likelihood estimation), MAP (maximum a posteriori), parametric and non-parametric Bayesian approaches. A review of the evaluation schemes for unsupervised morphological segmentation is also provided along with a summary of evaluation results on the Morpho Challenge evaluations.Published versio

    Dialog Modelling Experiments with Finnish One-to-One Chat Data

    No full text
    We analyzed two conversational corpora in Finnish: A public library question-answering (QA) data and a private medical chat dataẆe developed response retrieval (ranking) models using TF-IDF, StarSpace, ESIM and BERT methods. These four represent techniques ranging from the simple and classical ones to recent pretrained transformer neural networks. We evaluated the effect of different preprocessing strategies, including raw, casing, lemmatization and spell-checking for the different methods. Using our medical chat data, We also developed a novel three-stage preprocessing pipeline with speaker role classification. We found the BERT model pretrained with Finnish (FinBERT) an unambiguous winner in ranking accuracy, reaching 92.2% for the medical chat and 98.7% for the library QA in the 1-out-of-10 response ranking task where the chance level was 10%. The best accuracies were reached using uncased text with spell-checking (BERT models) or lemmatization (non-BERT models). The role of preprocessing had less impact for BERT models compared to the classical and other neural network models. Furthermore, we found the TF-IDF method still a strong baseline for the vocabulary-rich library QA task, even surpassing the more advanced StarSpace method. Our results highlight the complex interplay between preprocessing strategies and model type when choosing the optimal approach in chat-data modelling. Our study is the first work on dialogue modelling using neural networks for the Finnish language. It is also first of the kind to use real medical chat data. Our work contributes towards the development of automated chatbots in the professional domain

    Finnish as Source Language in Bilingual Question Answering

    No full text
    corecore