2,691 research outputs found

    Learning Language from a Large (Unannotated) Corpus

    Full text link
    A novel approach to the fully automated, unsupervised extraction of dependency grammars and associated syntax-to-semantic-relationship mappings from large text corpora is described. The suggested approach builds on the authors' prior work with the Link Grammar, RelEx and OpenCog systems, as well as on a number of prior papers and approaches from the statistical language learning literature. If successful, this approach would enable the mining of all the information needed to power a natural language comprehension and generation system, directly from a large, unannotated corpus.Comment: 29 pages, 5 figures, research proposa

    Russian word sense induction by clustering averaged word embeddings

    Full text link
    The paper reports our participation in the shared task on word sense induction and disambiguation for the Russian language (RUSSE-2018). Our team was ranked 2nd for the wiki-wiki dataset (containing mostly homonyms) and 5th for the bts-rnc and active-dict datasets (containing mostly polysemous words) among all 19 participants. The method we employed was extremely naive. It implied representing contexts of ambiguous words as averaged word embedding vectors, using off-the-shelf pre-trained distributional models. Then, these vector representations were clustered with mainstream clustering techniques, thus producing the groups corresponding to the ambiguous word senses. As a side result, we show that word embedding models trained on small but balanced corpora can be superior to those trained on large but noisy data - not only in intrinsic evaluation, but also in downstream tasks like word sense induction.Comment: Proceedings of the 24rd International Conference on Computational Linguistics and Intellectual Technologies (Dialogue-2018

    Frequency vs. Association for Constraint Selection in Usage-Based Construction Grammar

    Get PDF
    A usage-based Construction Grammar (CxG) posits that slot-constraints generalize from common exemplar constructions. But what is the best model of constraint generalization? This paper evaluates competing frequency-based and association-based models across eight languages using a metric derived from the Minimum Description Length paradigm. The experiments show that association-based models produce better generalizations across all languages by a significant margin

    Towards the development of a problem solver for the monitoring and control of instrumentation in a grid environment

    Get PDF
    This paper considers the issues involved in developing a generic problem solver to be used within a grid environment for the monitoring and control of instrumentation. The specific feature of such an environment is that the type of data to be processed, as well as the problem, is not always known in advance. Therefore, it is necessary to develop a problem solver architecture that addresses this issue. We propose to analyze the performance of the problem solving algorithms available within the WEKA toolkit and determine a decision tree of the best performing algorithm for a given type of data. For this purpose the algorithms have been tested using 51 datasets either drawn from publicly available repositories or generated in a grid-enabled environmen

    Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline

    Get PDF
    Morfessor is a family of probabilistic machine learning methods that find morphological segmentations for words of a natural language, based solely on raw text data. After the release of the public implementations of the Morfessor Baseline and Categories-MAP methods in 2005, they have become popular as automatic tools for processing morphologically complex languages for applications such as speech recognition and machine translation. This report describes a new implementation of the Morfessor Baseline method. The new version not only fixes the main restrictions of the previous software, but also includes recent methodological extensions such as semi-supervised learning, which can make use of small amounts of manually segmented words. Experimental results for the various features of the implementation are reported for English and Finnish segmentation tasks

    Constructions: a new unit of analysis for corpus-based discourse analysis

    Get PDF
    We propose and assess the novel idea of using automatically induced constructions as a unit of analysis for corpus-based discourse analysis. Automated techniques are needed in order to elucidate important characteristics of corpora for social science research into topics, framing and argument structures. Compared with cur-rent techniques (keywords, n-grams, and collo-cations), constructions capture more linguistic patterning, including some grammatical phe-nomena. Recent advances in natural language processing mean that it is now feasible to auto-matically induce some constructions from large unannotated corpora. In order to assess how well constructions characterise the content of a corpus and how well they elucidate interesting aspects of different discourses, we analysed a corpus of climate change blogs. The utility of constructions for corpus-based discourse analy-sis was compared qualitatively with keywords, n-grams and collocations. We found that the unusually frequent constructions gave interest-ing and different insights into the content of the discourses and enabled better comparison of sub-corpora.
    corecore