228 research outputs found

    Comparing Fifty Natural Languages and Twelve Genetic Languages Using Word Embedding Language Divergence (WELD) as a Quantitative Measure of Language Distance

    Full text link
    We introduce a new measure of distance between languages based on word embedding, called word embedding language divergence (WELD). WELD is defined as divergence between unified similarity distribution of words between languages. Using such a measure, we perform language comparison for fifty natural languages and twelve genetic languages. Our natural language dataset is a collection of sentence-aligned parallel corpora from bible translations for fifty languages spanning a variety of language families. Although we use parallel corpora, which guarantees having the same content in all languages, interestingly in many cases languages within the same family cluster together. In addition to natural languages, we perform language comparison for the coding regions in the genomes of 12 different organisms (4 plants, 6 animals, and two human subjects). Our result confirms a significant high-level difference in the genetic language model of humans/animals versus plants. The proposed method is a step toward defining a quantitative measure of similarity between languages, with applications in languages classification, genre identification, dialect identification, and evaluation of translations

    Sinonim Dan Word Sense Disambiguation Untuk Melengkapi Detektor Plagiat Dokumen Tugas Akhir

    Get PDF
    Plagiarism can be categorized into several levels: carbon copy, the addition of words, word substitutions, changing active into passive sentences, and paraphrase. In this research, the detection is only performed by local similarity assessment method. This research is categorized into 3 major processes: preprocessing, candidate determination, and calculation of similarity. In preprocessing, extraction and conversion of a PDF file into XML is performed. Stopword removal and stemming are also performed in this process. For Candidates determination, the process used VSM (Vector Space Model) algorithm using Lucene.NET. It will then calculate the similarity values of the candidates. Similarity values that meet the threshold will be processed in the third stage. The next process is detecting plagiarism at the level of carbon copy. The plagiarism of the substitution level will be determined by finding synonymous with Lesk algorithm and utilizing WordNet as a language dictionary. Lesk notice the words around it, before doing the search process is synonymous with Lesk, performed first sentence extractor. From this experiment, it is concluded that the determination of synonyms using WordNet and Lesk algorithm does not seem to increase its similarity value role. This is due to the difficulty of finding plagiarism by just substituting words. However, plagiarism at the level of carbon copy can be handled with the help of sentence matching

    Planting Trees in the Desert: Delexicalized Tagging and Parsing Combined

    Get PDF
    Various unsupervised and semi-supervised methods have been proposed to tag and parse an unseen language. We explore delexicalized parsing, proposed by (Zeman and Resnik, 2008), and delexicalized tagging, proposed by (Yu et al., 2016). For both approaches we provide a detailed evaluation on Universal Dependencies data (Nivre et al., 2016), a de-facto standard for multi-lingual morphosyntactic processing (while the previous work used other datasets). Our results confirm that in separation, each of the two delexicalized techniques has some limited potential when no annotation of the target language is available. However, if used in combination, their errors multiply beyond acceptable limits. We demonstrate that even the tiniest bit of expert annotation in the target language may contain significant potential and should be used if available

    Supervised Training on Synthetic Languages: A Novel Framework for Unsupervised Parsing

    Get PDF
    This thesis focuses on unsupervised dependency parsing—parsing sentences of a language into dependency trees without accessing the training data of that language. Different from most prior work that uses unsupervised learning to estimate the parsing parameters, we estimate the parameters by supervised training on synthetic languages. Our parsing framework has three major components: Synthetic language generation gives a rich set of training languages by mix-and-match over the real languages; surface-form feature extraction maps an unparsed corpus of a language into a fixed-length vector as the syntactic signature of that language; and, finally, language-agnostic parsing incorporates the syntactic signature during parsing so that the decision on each word token is reliant upon the general syntax of the target language. The fundamental question we are trying to answer is whether some useful information about the syntax of a language could be inferred from its surface-form evidence (unparsed corpus). This is the same question that has been implicitly asked by previous papers on unsupervised parsing, which only assumes an unparsed corpus to be available for the target language. We show that, indeed, useful features of the target language can be extracted automatically from an unparsed corpus, which consists only of gold part-of-speech (POS) sequences. Providing these features to our neural parser enables it to parse sequences like those in the corpus. Strikingly, our system has no supervision in the target language. Rather, it is a multilingual system that is trained end-to-end on a variety of other languages, so it learns a feature extractor that works well. This thesis contains several large-scale experiments requiring hundreds of thousands of CPU-hours. To our knowledge, this is the largest study of unsupervised parsing yet attempted. We show experimentally across multiple languages: (1) Features computed from the unparsed corpus improve parsing accuracy. (2) Including thousands of synthetic languages in the training yields further improvement. (3) Despite being computed from unparsed corpora, our learned task-specific features beat previous works’ interpretable typological features that require parsed corpora or expert categorization of the language

    Methodological Tools for Linguistic Description and Typology.

    Get PDF
    International audienc

    Satellite Workshop On Language, Artificial Intelligence and Computer Science for Natural Language Processing Applications (LAICS-NLP): Discovery of Meaning from Text

    Get PDF
    This paper proposes a novel method to disambiguate important words from a collection of documents. The hypothesis that underlies this approach is that there is a minimal set of senses that are significant in characterizing a context. We extend Yarowsky’s one sense per discourse [13] further to a collection of related documents rather than a single document. We perform distributed clustering on a set of features representing each of the top ten categories of documents in the Reuters-21578 dataset. Groups of terms that have a similar term distributional pattern across documents were identified. WordNet-based similarity measurement was then computed for terms within each cluster. An aggregation of the associations in WordNet that was employed to ascertain term similarity within clusters has provided a means of identifying clusters’ root senses

    Probabilistic Modelling of Morphologically Rich Languages

    Full text link
    This thesis investigates how the sub-structure of words can be accounted for in probabilistic models of language. Such models play an important role in natural language processing tasks such as translation or speech recognition, but often rely on the simplistic assumption that words are opaque symbols. This assumption does not fit morphologically complex language well, where words can have rich internal structure and sub-word elements are shared across distinct word forms. Our approach is to encode basic notions of morphology into the assumptions of three different types of language models, with the intention that leveraging shared sub-word structure can improve model performance and help overcome data sparsity that arises from morphological processes. In the context of n-gram language modelling, we formulate a new Bayesian model that relies on the decomposition of compound words to attain better smoothing, and we develop a new distributed language model that learns vector representations of morphemes and leverages them to link together morphologically related words. In both cases, we show that accounting for word sub-structure improves the models' intrinsic performance and provides benefits when applied to other tasks, including machine translation. We then shift the focus beyond the modelling of word sequences and consider models that automatically learn what the sub-word elements of a given language are, given an unannotated list of words. We formulate a novel model that can learn discontiguous morphemes in addition to the more conventional contiguous morphemes that most previous models are limited to. This approach is demonstrated on Semitic languages, and we find that modelling discontiguous sub-word structures leads to improvements in the task of segmenting words into their contiguous morphemes.Comment: DPhil thesis, University of Oxford, submitted and accepted 2014. http://ora.ox.ac.uk/objects/uuid:8df7324f-d3b8-47a1-8b0b-3a6feb5f45c
    • …
    corecore