25 research outputs found
DANSK and DaCy 2.6.0: Domain Generalization of Danish Named Entity Recognition
Named entity recognition is one of the cornerstones of Danish NLP, essential
for language technology applications within both industry and research.
However, Danish NER is inhibited by a lack of available datasets. As a
consequence, no current models are capable of fine-grained named entity
recognition, nor have they been evaluated for potential generalizability issues
across datasets and domains. To alleviate these limitations, this paper
introduces: 1) DANSK: a named entity dataset providing for high-granularity
tagging as well as within-domain evaluation of models across a diverse set of
domains; 2) DaCy 2.6.0 that includes three generalizable models with
fine-grained annotation; and 3) an evaluation of current state-of-the-art
models' ability to generalize across domains. The evaluation of existing and
new models revealed notable performance discrepancies across domains, which
should be addressed within the field. Shortcomings of the annotation quality of
the dataset and its impact on model training and evaluation are also discussed.
Despite these limitations, we advocate for the use of the new dataset DANSK
alongside further work on the generalizability within Danish NER
Direct Causation: A New Approach to an Old Question
Causative constructions come in lexical and periphrastic variants, exemplified in English by Sam killed Lee and Sam caused Lee to die. While use of the former, the lexical causative, entails the truth of the latter, an entailment in the other direction does not hold. The source of this asymmetry is commonly ascribed to the lexical causative having an additional prerequisite of “direct causation , such that the causative relation holds between a contiguous cause and effect (Fodor 1970, Katz 1970). However, this explanation encounters both empirical and theoretical problems (Nelleman & van der Koot 2012). To explain the source of the directness inferences (as well as other longstanding puzzles), we propose a formal analysis based on the framework of Structural Equation Models (SEMs) (Pearl 2000) which provides the necessary background for licensing causal inferences. Specifically, we provide a formalization of a \u27sufficient set of conditions\u27 within a model and demonstrate its role in the selectional parameters of causative descriptions. We argue that “causal sufficiency” is not a property of singular conditions, but rather sets of conditions, which are individually necessary but only sufficient when taken together (a view originally motivated in the philosophical literature by Mackie 1965). We further introduce the notion of a “completion event” of a sufficient set, which is critical to explain the particular inferential profile of lexical causatives
MULTILINGUAL SENTIMENT NORMALIZATION FOR SCANDINAVIAN LANGUAGES
In this paper, we address the challenge of multilingual sentiment analysis using a traditional lexicon and rule-based sentiment instrument that is tailored to capture sentiment patterns in a particular language. Focusing on a case study of three closely related Scandinavian languages (Danish, Norwegian, and Swedish) and using three tailored versions of VADER, we measure the relative degree of variation in valence using the OPUS corpus. We found that scores for Swedish are systematically skewed lower than Danish for translational pairs, and that scores for Norwegian are skewed higher for both other languages. We use a neural network to optimize the fit between Norwegian and Swedish respectively and Danish as the reference (target) language
Speaker Attitude and Sexual Orientation Affect Phonetic Imitation
Numerous studies have documented the phenomenon of phonetic convergence: the process by which speakers alter their productions to become more similar on some phonetic or acoustic dimension to those of their interlocutor. Though social factors have been suggested as a motivator for imitation, few studies have established a tight connection between these extralinguistic factors and a speaker’s likelihood to imitate. The present study explores the effects of perceived sexual orientation and speaker attitude toward the interlocutor on the likelihood of imitation for extended VOT. Experimental results show that the extent of phonetic convergence (and divergence) depends on the perceived sexual orientation of the talker as well as whether the speaker is positively disposed to the interlocutor
The Danish Gigaword Project
Danish is a North Germanic/Scandinavian language spoken primarily in Denmark,
a country with a tradition of technological and scientific innovation. However,
from a technological perspective, the Danish language has received relatively
little attention and, as a result, Danish language technology is hard to
develop, in part due to a lack of large or broad-coverage Danish corpora. This
paper describes the Danish Gigaword project, which aims to construct a
freely-available one billion word corpus of Danish text that represents the
breadth of the written language
Recommended from our members
Causal and associational language in observational health research: a systematic evaluation
We estimated the degree to which language used in the high profile medical/public health/epidemiology literature implied causality using language linking exposures to outcomes and action recommendations; examined disconnects between language and recommendations; identified the most common linking phrases; and estimated how strongly linking phrases imply causality.
We searched and screened for 1,170 articles from 18 high-profile journals (65 per journal) published from 2010-2019. Based on written framing and systematic guidance, three reviewers rated the degree of causality implied in abstracts and full text for exposure/outcome linking language and action recommendations.
Reviewers rated the causal implication of exposure/outcome linking language as None (no causal implication) in 13.8%, Weak 34.2%, Moderate 33.2%, and Strong 18.7% of abstracts. The implied causality of action recommendations was higher than the implied causality of linking sentences for 44.5% or commensurate for 40.3% of articles. The most common linking word in abstracts was “associate” (45.7%). Reviewer’s ratings of linking word roots were highly heterogeneous; over half of reviewers rated “association” as having at least some causal implication. This research undercuts the assumption that avoiding “causal” words leads to clarity of interpretation in medical research
The Middle Construction in Mandarin Chinese
The middle is an un accusative construction which expresses a modal generalization over events\ud
(Keyser and Roeper 1984). Although the middle is not homogenous cross-linguistically (Ting 2006),\ud
manifestations of the middle have been observed in most Indo-European languages. In this thesis, I\ud
will develop criteria for middles based on cross-linguistic generalizations and argue for the existence of\ud
a middle construction in Chinese. Chinese has a class of so-called 'notional passives,' unaccusative\ud
sentences which display active morphology but receive passive interpretation. I will provide evidence\ud
that the notional passive is distinct both structurally and semantically from the canonical Chinese\ud
passive and demonstrate the inadequacy of the topic-comment account of such constructions proposed\ud
by Li and Thompson (1981).\ud
My account of the middle will crucially define it as a resultative form in Chinese, appearing\ud
exclusively with Resultative Verb Compounds (RVCs). I will adopt Cheng and Huang's (1994)\ud
classification of RVCs into four verbal subcategories (unergative, transitive, ergative, and causative)\ud
and consider the syntactic and semantic properties of the resultative middle based on the argument\ud
structure of its component predicates. Using data, I will analyze whether these Chinese middle verbs\ud
pattern in a predictable, cross-linguistically consistent way, considerin~ syntactic distribution,\ud
aspectual composition, and semantic constraints on middle formation