1,233 research outputs found
Learning morphology with Morfette
Morfette is a modular, data-driven, probabilistic system which learns to perform joint morphological tagging and lemmatization from morphologically annotated corpora. The system is composed of two learning modules which are trained to predict morphological tags and lemmas using the Maximum Entropy classifier. The third module dynamically combines the predictions of the Maximum-Entropy models and outputs a probability distribution over tag-lemma pair sequences. The lemmatization module exploits the idea of recasting lemmatization as a classification task by using class labels which encode mappings from wordforms to lemmas. Experimental evaluation results and error analysis on three morphologically rich languages show that the system achieves high accuracy with no language-specific
feature engineering or additional resources
Tagset Reductions in Morphosyntactic Tagging of Croatian Texts
Morphosyntactic tagging of Croatian texts is performed with stochastic taggersby using a language model built on a manually annotated corpus implementingthe Multext East version 3 specifications for Croatian. Tagging accuracy in thisframework is basically predefined, i.e. proportionally dependent of two things:the size of the training corpus and the number of different morphosyntactic tagsencompassed by that corpus. Being that the 100 kw Croatia Weekly newspapercorpus by definition makes a rather small language model in terms of stochastictagging of free domain texts, the paper presents an approach dealing withtagset reductions. Several meaningful subsets of the Croatian Multext-East version3 morphosyntactic tagset specifications are created and applied on Croatiantexts with the CroTag stochastic tagger, measuring overall tagging accuracyand F1-measures. Obtained results are discussed in terms of applying differentreductions in different natural language processing systems and specifictasks defined by specific user requirements
Data sparsity in highly inflected languages: the case of morphosyntactic tagging in Polish
In morphologically complex languages, many high-level tasks in natural language
processing rely on accurate morphosyntactic analyses of the input. However, in
light of the risk of error propagation in present-day pipeline architectures for basic
linguistic pre-processing, the state of the art for morphosyntactic tagging is still
not satisfactory. The main obstacle here is data sparsity inherent to natural lan-
guage in general and highly inflected languages in particular.
In this work, we investigate whether semi-supervised systems may alleviate the
data sparsity problem. Our approach uses word clusters obtained from large
amounts of unlabelled text in an unsupervised manner in order to provide a su-
pervised probabilistic tagger with morphologically informed features. Our evalua-
tions on a number of datasets for the Polish language suggest that this simple
technique improves tagging accuracy, especially with regard to out-of-vocabulary
words. This may prove useful to increase cross-domain performance of taggers,
and to alleviate the dependency on large amounts of supervised training data,
which is especially important from the perspective of less-resourced languages
Digital Museum Consortia: A Prototype for Interconnected and Accessible Database Design
The evolution of the internet and devices allowing access to it indicate that users trend toward networking and interconnectivity in their daily lives. Museums have started to tread into this territory—that is, crafting, managing, and maintaining an effective internet presence and ancillary content tools—on their own. However, many museums still rely upon the earliest types of education and interpretation tools, such as audio tours and recordings that address content from one collection. Moving beyond a single institution’s holdings, a shared database of museum content including photos of artifacts and objects, historic documents, and videos would allow users to examine pieces they enjoy and to find similar works at other locations. A single application providing museum collection capabilities and visitor access would benefit both sides. To support this claim, this thesis first provides a literature review of application use in museums that is supplemented by statistics of visitor use of museum mobile offerings. This historical overview yields a list of needs, interests, and obstacles to such an interconnective model. The third section constitutes the building blocks of such a model: database design, application design, and a web-accessible mirror site which are visualized in the prototyped content. The fourth section hypothesizes the future and expected impact of a shared network topology
Recommended from our members
Towards a People's Social Epidemiology: Envisioning a More Inclusive and Equitable Future for Social Epi Research and Practice in the 21st Century.
Social epidemiology has made critical contributions to understanding population health. However, translation of social epidemiology science into action remains a challenge, raising concerns about the impacts of the field beyond academia. With so much focus on issues related to social position, discrimination, racism, power, and privilege, there has been surprisingly little deliberation about the extent and value of social inclusion and equity within the field itself. Indeed, the challenge of translation/action might be more readily met through re-envisioning the role of the people within the research/practice enterprise-reimagining what "social" could, or even should, mean for the future of the field. A potential path forward rests at the nexus of social epidemiology, community-based participatory research (CBPR), and information and communication technology (ICT). Here, we draw from social epidemiology, CBPR, and ICT literatures to introduce A People's Social Epi-a multi-tiered framework for guiding social epidemiology in becoming more inclusive, equitable, and actionable for 21st century practice. In presenting this framework, we suggest the value of taking participatory, collaborative approaches anchored in CBPR and ICT principles and technological affordances-especially within the context of place-based and environmental research. We believe that such approaches present opportunities to create a social epidemiology that is of, with, and by the people-not simply about them. In this spirit, we suggest 10 ICT tools to "socialize" social epidemiology and outline 10 ways to move towards A People's Social Epi in practice
Results from the Relativistic Heavy Ion Collider
We describe the current status of the heavy ion research program at the
Relativistic Heavy Ion Collider (RHIC). The new suite of experiments and the
collider energies have opened up new probes of the medium created in the
collisions. Our review focuses on the experimental discoveries to date at RHIC
and their interpretation in the light of our present theoretical understanding
of the dynamics of relativistic heavy ion collisions and of the structure of
strongly interacting matter at high energy density.Comment: 47 pages, 10 figures, submitted to Annual Review of Nuclear and
Particle Science. The authors invite and appreciate feedback about possible
errors and/or inconsistencies in the manuscrip
A gloss composition and context clustering based distributed word sense representation model
In recent years, there has been an increasing interest in learning a distributed representation of word sense. Traditional context clustering based models usually require careful tuning of model parameters, and typically perform worse on infrequent word senses. This paper presents a novel approach which addresses these limitations by first initializing the word sense embeddings through learning sentence-level embeddings from WordNet glosses using a convolutional neural networks. The initialized word sense embeddings are used by a context clustering based model to generate the distributed representations of word senses. Our learned representations outperform the publicly available embeddings on half of the metrics in the word similarity task, 6 out of 13 sub tasks in the analogical reasoning task, and gives the best overall accuracy in the word sense effect classification task, which shows the effectiveness of our proposed distributed distribution learning model
- …