2,336 research outputs found
Centering Theory in natural text: a large-scale corpus study
We present an extensive corpus study of Centering Theory (CT), examining how adequately CT models coherence in a large body of natural text. A novel analysis of transition bigrams provides strong empirical support for several CT-related linguistic claims which so far have been investigated only on various small data sets. The study also reveals genre-based differences in texts’ degrees of entity coherence. Previous work has shown unsupervised CT-based coherence metrics to be unable to outperform a simple baseline. We identify two reasons: 1) these metrics assume that some transition types are more coherent and that they occur more frequently than others, but in our corpus the latter is not the case; and 2) the original sentence order of a document and a random permutation of its sentences differ mostly in the fraction of entity-sharing sentence pairs, exactly the factor measured by the baseline
Centering theory in natural text: a large-scale corpus study
We present an extensive corpus study of Centering Theory (CT), examining how adequately CT models coherence in a large body of natural text. A novel analysis of transition bigrams provides strong empirical support for several CT-related linguistic claims which so far have been investigated only on various small data sets. The
study also reveals genre-based differences in texts’ degrees of entity coherence. Previous work has shown unsupervised CTbased coherence metrics to be unable to outperform a simple baseline. We identify
two reasons: 1) these metrics assume that some transition types are more coherent and that they occur more frequently than others, but in our corpus the latter is not the case; and 2) the original sentence order of a document and a random permutation of its sentences differ mostly in the fraction of entity-sharing sentence pairs, exactly the
factor measured by the baseline
Situation entity annotation
This paper presents an annotation scheme for a new semantic annotation task with relevance for analysis and computation at both the clause level and the discourse level. More specifically, we label the finite clauses of texts with the type of situation entity (e.g., eventualities, statements about kinds, or statements of belief) they introduce to the discourse, following and extending work by Smith (2003). We take a feature-driven approach to annotation, with the result that each clause is also annotated with fundamental aspectual class, whether the main NP referent is specific or generic, and whether the situation evoked is episodic or habitual. This annotation is performed (so far) on three sections of the MASC corpus, with each clause labeled by at least two annotators. In this paper we present the annotation scheme, statistics of the corpus in its current version, and analyses of both inter-annotator agreement and intra-annotator consistency
Automatic prediction of aspectual class of verbs in context
This paper describes a new approach to predicting the aspectual class of verbs in context, i.e., whether a verb is used in a stative or dynamic sense. We identify two challenging cases of this problem: when
the verb is unseen in training data, and when the verb is ambiguous for aspectual class. A semi-supervised approach using linguistically-motivated features and a novel set of distributional features based
on representative verb types allows us to predict classes accurately, even for unseen verbs. Many frequent verbs can be either stative or dynamic in different contexts, which has not been modeled by previous
work; we use contextual features to resolve this ambiguity. In addition, we introduce two new datasets of clauses marked for aspectual class
Taxonomic Loss for Morphological Glossing of Low-Resource Languages
Morpheme glossing is a critical task in automated language documentation and
can benefit other downstream applications greatly. While state-of-the-art
glossing systems perform very well for languages with large amounts of existing
data, it is more difficult to create useful models for low-resource languages.
In this paper, we propose the use of a taxonomic loss function that exploits
morphological information to make morphological glossing more performant when
data is scarce. We find that while the use of this loss function does not
outperform a standard loss function with regards to single-label prediction
accuracy, it produces better predictions when considering the top-n predicted
labels. We suggest this property makes the taxonomic loss function useful in a
human-in-the-loop annotation setting
A Kind Introduction to Lexical and Grammatical Aspect, with a Survey of Computational Approaches
Aspectual meaning refers to how the internal temporal structure of situations
is presented. This includes whether a situation is described as a state or as
an event, whether the situation is finished or ongoing, and whether it is
viewed as a whole or with a focus on a particular phase. This survey gives an
overview of computational approaches to modeling lexical and grammatical aspect
along with intuitive explanations of the necessary linguistic concepts and
terminology. In particular, we describe the concepts of stativity, telicity,
habituality, perfective and imperfective, as well as influential inventories of
eventuality and situation types. We argue that because aspect is a crucial
component of semantics, especially when it comes to reporting the temporal
structure of situations in a precise way, future NLP approaches need to be able
to handle and evaluate it systematically in order to achieve human-level
language understanding.Comment: Accepted at EACL 2023, camera ready versio
Bringing Active Learning to Life
Active learning has been applied to different NLP tasks, with the aim of limiting the amount of time and cost for human annotation. Most studies on active learning have only simulated the annotation scenario, using prelabelled gold standard data. We present the first active learning experiment for Word Sense Disambiguation with human annotators in a realistic environment, using fine-grained sense distinctions, and investigate whether AL can reduce annotation cost and boost classifier performance when applied to a real-world task
LQVSumm: a corpus of linguistic quality violations in multi-document summarization
We present LQVSumm, a corpus of about 2000 automatically created extractive multi-document summaries from the TAC 2011 shared task on Guided Summarization, which we annotated with several types of linguistic quality violations. Examples for such violations include pronouns that lack antecedents or ungrammatical clauses. We give details on the annotation scheme and show that inter-annotator agreement is good given the open-ended nature of the task. The annotated summaries have previously been scored for Readability on a numeric scale by human annotators in the context of the TAC challenge; we show that the number of instances of violations of linguistic quality of a summary correlates with these intuitively assigned numeric scores. On a system-level, the average number of violations marked in a system’s summaries achieves higher correlation with the Readability scores than current supervised state-of-the-art methods for assigning a single readability score to a summary. It is our hope that our corpus facilitates the development of methods that not only judge the linguistic quality of automatically generated summaries as a whole, but which also allow for detecting, labeling, and fixing particular violations in a text
Situation entity types: automatic classification of clause-level aspect
This paper describes the first robust approach to automatically labeling clauses with their situation entity type (Smith, 2003), capturing aspectual phenomena at the clause level which are relevant for interpreting both semantics at the clause level and discourse structure. Previous work on this task used a small data set from a limited domain, and relied mainly on words as features, an approach which is impractical in larger settings. We provide a new corpus of texts from 13 genres (40,000 clauses) annotated with situation entity types. We show that our sequence labeling approach using distributional information in the form of Brown clusters, as well as syntactic-semantic features targeted to the task, is robust across genres, reaching accuracies of up to 76%
Impact of Social Networks on the Spread of Disease
https://scholar.dsu.edu/research-symposium/1023/thumbnail.jp
- …