1,233 research outputs found

    Learning morphology with Morfette

    Get PDF
    Morfette is a modular, data-driven, probabilistic system which learns to perform joint morphological tagging and lemmatization from morphologically annotated corpora. The system is composed of two learning modules which are trained to predict morphological tags and lemmas using the Maximum Entropy classifier. The third module dynamically combines the predictions of the Maximum-Entropy models and outputs a probability distribution over tag-lemma pair sequences. The lemmatization module exploits the idea of recasting lemmatization as a classification task by using class labels which encode mappings from wordforms to lemmas. Experimental evaluation results and error analysis on three morphologically rich languages show that the system achieves high accuracy with no language-specific feature engineering or additional resources

    Tagset Reductions in Morphosyntactic Tagging of Croatian Texts

    Get PDF
    Morphosyntactic tagging of Croatian texts is performed with stochastic taggersby using a language model built on a manually annotated corpus implementingthe Multext East version 3 specifications for Croatian. Tagging accuracy in thisframework is basically predefined, i.e. proportionally dependent of two things:the size of the training corpus and the number of different morphosyntactic tagsencompassed by that corpus. Being that the 100 kw Croatia Weekly newspapercorpus by definition makes a rather small language model in terms of stochastictagging of free domain texts, the paper presents an approach dealing withtagset reductions. Several meaningful subsets of the Croatian Multext-East version3 morphosyntactic tagset specifications are created and applied on Croatiantexts with the CroTag stochastic tagger, measuring overall tagging accuracyand F1-measures. Obtained results are discussed in terms of applying differentreductions in different natural language processing systems and specifictasks defined by specific user requirements

    Data sparsity in highly inflected languages: the case of morphosyntactic tagging in Polish

    Get PDF
    In morphologically complex languages, many high-level tasks in natural language processing rely on accurate morphosyntactic analyses of the input. However, in light of the risk of error propagation in present-day pipeline architectures for basic linguistic pre-processing, the state of the art for morphosyntactic tagging is still not satisfactory. The main obstacle here is data sparsity inherent to natural lan- guage in general and highly inflected languages in particular. In this work, we investigate whether semi-supervised systems may alleviate the data sparsity problem. Our approach uses word clusters obtained from large amounts of unlabelled text in an unsupervised manner in order to provide a su- pervised probabilistic tagger with morphologically informed features. Our evalua- tions on a number of datasets for the Polish language suggest that this simple technique improves tagging accuracy, especially with regard to out-of-vocabulary words. This may prove useful to increase cross-domain performance of taggers, and to alleviate the dependency on large amounts of supervised training data, which is especially important from the perspective of less-resourced languages

    Digital Museum Consortia: A Prototype for Interconnected and Accessible Database Design

    Get PDF
    The evolution of the internet and devices allowing access to it indicate that users trend toward networking and interconnectivity in their daily lives. Museums have started to tread into this territory—that is, crafting, managing, and maintaining an effective internet presence and ancillary content tools—on their own. However, many museums still rely upon the earliest types of education and interpretation tools, such as audio tours and recordings that address content from one collection. Moving beyond a single institution’s holdings, a shared database of museum content including photos of artifacts and objects, historic documents, and videos would allow users to examine pieces they enjoy and to find similar works at other locations. A single application providing museum collection capabilities and visitor access would benefit both sides. To support this claim, this thesis first provides a literature review of application use in museums that is supplemented by statistics of visitor use of museum mobile offerings. This historical overview yields a list of needs, interests, and obstacles to such an interconnective model. The third section constitutes the building blocks of such a model: database design, application design, and a web-accessible mirror site which are visualized in the prototyped content. The fourth section hypothesizes the future and expected impact of a shared network topology

    Results from the Relativistic Heavy Ion Collider

    Full text link
    We describe the current status of the heavy ion research program at the Relativistic Heavy Ion Collider (RHIC). The new suite of experiments and the collider energies have opened up new probes of the medium created in the collisions. Our review focuses on the experimental discoveries to date at RHIC and their interpretation in the light of our present theoretical understanding of the dynamics of relativistic heavy ion collisions and of the structure of strongly interacting matter at high energy density.Comment: 47 pages, 10 figures, submitted to Annual Review of Nuclear and Particle Science. The authors invite and appreciate feedback about possible errors and/or inconsistencies in the manuscrip

    A gloss composition and context clustering based distributed word sense representation model

    Get PDF
    In recent years, there has been an increasing interest in learning a distributed representation of word sense. Traditional context clustering based models usually require careful tuning of model parameters, and typically perform worse on infrequent word senses. This paper presents a novel approach which addresses these limitations by first initializing the word sense embeddings through learning sentence-level embeddings from WordNet glosses using a convolutional neural networks. The initialized word sense embeddings are used by a context clustering based model to generate the distributed representations of word senses. Our learned representations outperform the publicly available embeddings on half of the metrics in the word similarity task, 6 out of 13 sub tasks in the analogical reasoning task, and gives the best overall accuracy in the word sense effect classification task, which shows the effectiveness of our proposed distributed distribution learning model
    corecore