8 research outputs found
Empirical methods for the study of denotation in nominalizations in Spanish
This article deals with deverbal nominalizations in Spanish; concretely, we focus on the denotative distinction between event and result nominalizations. The goals of this work is twofold: first, to detect the most relevant features for this denotative distinction; and, second, to build an automatic classification system of deverbal nominalizations according to their denotation. We have based our study on theoretical hypotheses dealing with this semantic distinction and we have analyzed them empirically by means of Machine Learning techniques which are the basis of the ADN-Classifier. This is the first tool that aims to automatically classify deverbal nominalizations in event, result, or underspecified denotation types in Spanish. The ADN-Classifier has helped us to quantitatively evaluate the validity of our claims regarding deverbal nominalizations. We set up a series of experiments in order to test the ADN-Classifier with different models and in different realistic scenarios depending on the knowledge resources and natural language processors available. The ADN-Classifier achieved good results (87.20% accuracy)
(In)definiteness Spread in Semitic Construct State: Does it Really Exist?
To argue against a long established assumption seems to be not that easy task. In this article, I argue against one of those assumptions, namely (in)definiteness spread in Semitic Construct State (CS). I argue that CSs are of two types: either definite or indefinite. The former refers to those CSs, where the head N is syntactically definite, in the sense of having the definite article al-/ha- (the, Arabic/Hebrew), and the latter to those not having it. Three tenets constitute the crux of this paper: i) the controversy (in)definiteness spread gives rise to among Semitic scholars, ii) there does exist good evidence that the head N of CSs can take the definite article in Arabic and Hebrew, and iii) in Arabic the absence of the indefinite article (marker) -n on the head N has presumably to do with what I call VCR (= the Vowel Contextualization Rule) like several similar other phonological phenomena in the language. As for the specifici-ty/uniqueness denoted by the head N of a CS in some contexts, I propose that such specificity has nothing to do with definiteness spread, but rather it may be linked to a Universal Gram-mar (UG) principle, which correlates specificity/uniqueness with possessivization cross-linguistically, or to a UG parameter in the case of Semitic CSs
Proceedings
Proceedings of the Ninth International Workshop
on Treebanks and Linguistic Theories.
Editors: Markus Dickinson, Kaili Müürisep and Marco Passarotti.
NEALT Proceedings Series, Vol. 9 (2010), 268 pages.
© 2010 The editors and contributors.
Published by
Northern European Association for Language
Technology (NEALT)
http://omilia.uio.no/nealt .
Electronically published at
Tartu University Library (Estonia)
http://hdl.handle.net/10062/15891
Translationese indicators for human translation quality estimation (based on English-to-Russian translation of mass-media texts)
A thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of Philosophy.Human translation quality estimation is a relatively new and challenging area of research,
because human translation quality is notoriously more subtle and subjective than machine
translation, which attracts much more attention and effort of the research community. At
the same time, human translation is routinely assessed by education and certification institutions,
as well as at translation competitions. Do the quality labels and scores generated
from real-life quality judgments align well with objective properties of translations? This
thesis puts this question to a test using machine learning methods.
Conceptually, this research is built around a hypothesis that linguistic properties characteristic
of translations, as a specific form of communication, can correlate with translation
quality. This assumption is often made in translation studies but has never been put to
a rigorous empirical test. Exploring translationese features in a quality estimation task
can help identify quality-related trends in translational behaviour and provide data-driven
insights into professionalism to improve training. Using translationese for quality estimation
fits well with the concept of quality in translation studies, because it is essentially a
document-level property. Linguistically-motivated translationese features are also more interpretable
than popular distributed representations and can explain linguistic differences
between quality categories in human translation.
We investigated (i) an extended set of Universal Dependencies-based morphosyntactic
features as well as two lexical feature sets capturing (ii) collocational properties of translations,
and (iii) ratios of vocabulary items in various frequency bands along with entropy
scores from n-gram models. To compare the performance of our feature sets in translationese
classifications and in quality estimation tasks against other representations, the
experiments were also run on tf-idf features, QuEst++ features and on contextualised
embeddings from a range of pre-trained language models, including the state-of-the-art
multilingual solution for machine translation quality estimation. Our major focus was on
document-level prediction, however, where the labels and features allowed, the experiments
were extended to the sentence level.
The corpus used in this research includes English-to-Russian parallel subcorpora of student
and professional translations of mass-media texts, and a register-comparable corpus of
non-translations in the target language. Quality labels for various subsets of student translations
come from a number of real-life settings: translation competitions, graded student
translations, error annotations and direct assessment. We overview approaches to benchmarking
quality in translation and provide a detailed description of our own annotation
experiments.
Of the three proposed translationese feature sets, morphosyntactic features, returned
the best results on all tasks. In many settings they were secondary only to contextualised
embeddings. At the same time, performance on various representations was contingent
on the type of quality captured by quality labels/scores. Using the outcomes of machine
learning experiments and feature analysis, we established that translationese properties of
translations were not equality reflected by various labels and scores. For example, professionalism
was much less related to translationese than expected. Labels from documentlevel
holistic assessment demonstrated maximum support for our hypothesis: lower-ranking
translations clearly exhibited more translationese. They bore more traces of mechanical
translational behaviours associated with following source language patterns whenever possible,
which led to the inflated frequencies of analytical passives, modal predicates, verbal
forms, especially copula verbs and verbs in the finite form. As expected, lower-ranking
translations were more repetitive and had longer, more complex sentences. Higher-ranking
translations were indicative of greater skill in recognising and counteracting translationese
tendencies. For document-level holistic labels as an approach to capture quality, translationese
indicators might provide a valuable contribution to an effective quality estimation
pipeline.
However, error-based scores, and especially scores from sentence-level direct assessment,
proved to be much less correlated by translationese and fluency issues, in general. This was
confirmed by relatively low regression results across all representations that had access only
to the target language side of the dataset, by feature analysis and by correlation between
error-based scores and scores from direct assessment
Investigation of the Distributions, Derivation, and Generalizations in Arabic Plural System
The Arabic plural system poses a challenge to current morphological accounts since the regular sound plural that is formed by suffixation contrasts with irregular broken plurals formed by internally modifying the singular stem. Although aspects of the Arabic plural system have been widely studied since the early ages of Arab grammarians (Abu Al-Saud 1971; Yaaqub 2004), there are several issues that remain unresolved and warrant further investigation. This dissertation uses a combination of statistical, qualitative, and computational approaches to provide a comprehensive account of several outstanding problems in Arabic nominal plurals.
Theoretical investigations of Arabic nominal plurals have led to conflicting results about the status of Arabic nominal ablaut as a minority default system (McCarthy & Prince 1990; Boudelaa & Gaskell 2002). Apart from this, little has been said about other aspects of the distribution of Arabic plurals, namely the interplay between regularity of the plural and its frequency in actual language use (Bybee 2001). This dissertation takes a usage-based approach to revisit the question of the status of Arabic as a minority-default system and to examine the interplay between the productivity and the frequency in actual language use of plural types. The results from the statistical distribution of sound and broken plurals are in line with the claim Arabic pluralization is a minority default system. The results are also consistent with the prediction made by the usage-based model that low type frequency compensates for weak lexical strength by high token frequency.
The dissertation also investigates the role of singular stem weight on plural derivation. Numerous attempts have been made to model Arabic broken plurals, which fall into three main groups according to their specific morphological approach: Generative Morphophonology (Brame 1970; Levy 1971), Root-&-Pattern Morphology (McCarthy 1979; Hammond 1988), and Prosodic Morphology (McCarthy and Prince 1990). However, there has not been any investigation of the influence of the additive weight of singular stems on the derivation of plural forms. Results from qualitative and computational analyses provide evidence for the role of simple additive weight on the mapping of singular input stems to plural outputs in Arabic broken plural. Stem weight does not completely determine the plural pattern. Rather, its role can be viewed as a (quasi) well-formedness condition on plural templates based on input forms.
The dissertation examines the types of information that are relevant to the Arabic plural system by performing a computational analysis on singular-plural pairs collected from a comprehensive corpus. The performance of multiple K-Nearest Neighbor (KNN) classifiers that use different combinations of factors to select plural patterns are compared to determine the importance of each factor. The results show that the CV template, vowel melody and semantic qualities of the singular all contribute to determining the shape of the plural template, though with varying degrees. The syllabic shape of the singular forms of Arabic nouns is the major factor in predicting their plural forms, followed by the vowel melody and the semantic features.PHDLinguisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/168038/1/fahadal_1.pd
SinSpeC 01
Volume 1 of the Working Papers of the SFB732 "Incremental Specification in Context" (SinSpeC
The Restriction on Predicative Codas in Existential There-Clauses. Theoretical and Empirical Perspectives
This thesis investigates the so-called Predicate Restriction (PR) in English existential sentences. The general consensus in the literature is that only stage-level predicates (i.e. predicates denoting temporary properties, like 'sick') can appear in the coda of an existential sentence, while individual-level predicates (i.e. predicates that denote permanent properties, like 'tall') are excluded.
After a discussion of the various theoretical approaches to the PR (syntactic, semantic, as well as pragmatic), this thesis pursues two empirical studies, using both corpus data from the BNC and a judgment study. Both studies confirm the theoretical preference for SLP (in a general sense), but also show that the distinction between predicates should not be reduced to a binary SLP–ILP choice. Predicates should rather be analysed in a more fine-grained system with multiple factors, as done in Jäger (2001)