302 research outputs found
Strong domain variation and treebank-induced LFG resources
In this paper we present a number of experiments to test the portability of existing treebank induced LFG resources. We test the LFG parsing resources of Cahill et al. (2004) on the ATIS corpus which represents a considerably different domain to the Penn-II Treebank Wall Street Journal sections, from which the resources were induced. This testing shows an under-performance at both c- and f-structure level as a result of the domain variation. We show that in order to adapt the LFG resources of Cahill et al. (2004) to this new domain, all that is necessary is to retrain the c-structure parser on data from the new domain
Phrase extraction for machine translation
Statistical Machine Translation (SMT) developed in the late 1980s, based initially upon a word-to-word translation process. However, such processes have difficulties when good quality translation is not strictly word-to-word. Easy cases can be handled by allowing insertion and deletion of single words, but for more general word reordering phenomena, a more general translation process is required. There is currently much interest in phrase-to-phrase models, which can overcome this problem, but require that candidate phrases, together with their translations, be identified in the training corpora. Since phrase delimiters are not explicit, this gives rise to a new problem; that of phrase pair extraction. The current project proposes a phrase extraction algorithm which uses a window of n words around source and target words to extract equivalent phrases. The extracted phrases together with their probabilities, are used as input to an existing Machine Translation system for the purpose of evaluating the phrase extraction algorithm.peer-reviewe
Exploring probabilistic grammars of symbolic music using PRISM
In this paper we describe how we used the logic-based probabilistic
programming language PRISM to conduct a systematic comparison
of several probabilistic models of symbolic music, including 0th and
1st order Markov models over pitches and intervals, and a probabilistic
grammar with two parameterisations. Using PRISM allows us to take
advantage of variational Bayesian methods for assessing the goodness of
fit of the models. When applied to a corpus of Bach chorales and the Essen
folk song collection, we found that, depending on various parameters, the
probabilistic grammars sometimes but not always out-perform the simple
Markov models. Examining how the models perform on smaller subsets
of pieces, we find that the simpler Markov models do out-perform the
best grammar-based model at the small end of the scale
Improving Machine Translation of Educational Content via Crowdsourcing
The limited availability of in-domain training data is a major issue in the training of application-specific neural machine translation
models. Professional outsourcing of bilingual data collections is costly and often not feasible. In this paper we analyze the influence of
using crowdsourcing as a scalable way to obtain translations of target in-domain data having in mind that the translations can be of a
lower quality. We apply crowdsourcing with carefully designed quality controls to create parallel corpora for the educational domain
by collecting translations of texts from MOOCs from English to eleven languages, which we then use to fine-tune neural machine
translation models previously trained on general-domain data. The results from our research indicate that crowdsourced data collected
with proper quality controls consistently yields performance gains over general-domain baseline systems, and systems fine-tuned with
pre-existing in-domain corpora
libcloudph++ 0.2: single-moment bulk, double-moment bulk, and particle-based warm-rain microphysics library in C++
This paper introduces a library of algorithms for representing cloud
microphysics in numerical models. The library is written in C++, hence the name
libcloudph++. In the current release, the library covers three warm-rain
schemes: the single- and double-moment bulk schemes, and the particle-based
scheme with Monte-Carlo coalescence. The three schemes are intended for
modelling frameworks of different dimensionality and complexity ranging from
parcel models to multi-dimensional cloud-resolving (e.g. large-eddy)
simulations. A two-dimensional prescribed-flow framework is used in example
simulations presented in the paper with the aim of highlighting the library
features. The libcloudph++ and all its mandatory dependencies are free and
open-source software. The Boost.units library is used for zero-overhead
dimensional analysis of the code at compile time. The particle-based scheme is
implemented using the Thrust library that allows to leverage the power of
graphics processing units (GPU), retaining the possibility to compile the
unchanged code for execution on single or multiple standard processors (CPUs).
The paper includes complete description of the programming interface (API) of
the library and a performance analysis including comparison of GPU and CPU
setups.Comment: The library description has been updated to the new library API (i.e.
v0.1 -> v0.2 update). The key difference is that the model state variables
are now mixing ratios as opposed to densities. The particle-based scheme was
supplemented with the "particle recycling" process. Numerous editorial
corrections were mad
- âŠ