3,603 research outputs found
Improving the translation environment for professional translators
When using computer-aided translation systems in a typical, professional translation workflow, there are several stages at which there is room for improvement. The SCATE (Smart Computer-Aided Translation Environment) project investigated several of these aspects, both from a human-computer interaction point of view, as well as from a purely technological side.
This paper describes the SCATE research with respect to improved fuzzy matching, parallel treebanks, the integration of translation memories with machine translation, quality estimation, terminology extraction from comparable texts, the use of speech recognition in the translation process, and human computer interaction and interface design for the professional translation environment. For each of these topics, we describe the experiments we performed and the conclusions drawn, providing an overview of the highlights of the entire SCATE project
The Gutenberg English Poetry Corpus: Exemplary Quantitative Narrative Analyses
This paper describes a corpus of about 3,000 English literary texts with about
250 million words extracted from the Gutenberg project that span a range of
genres from both fiction and non-fiction written by more than 130 authors
(e.g., Darwin, Dickens, Shakespeare). Quantitative narrative analysis (QNA) is
used to explore a cleaned subcorpus, the Gutenberg English Poetry Corpus
(GEPC), which comprises over 100 poetic texts with around two million words
from about 50 authors (e.g., Keats, Joyce, Wordsworth). Some exemplary QNA
studies show author similarities based on latent semantic analysis,
significant topics for each author or various text-analytic metrics for George
Eliot’s poem “How Lisa Loved the King” and James Joyce’s “Chamber Music,”
concerning, e.g., lexical diversity or sentiment analysis. The GEPC is
particularly suited for research in Digital Humanities, Computational
Stylistics, or Neurocognitive Poetics, e.g., as training and test corpus for
stimulus development and control in empirical studies
Term-community-based topic detection with variable resolution
Network-based procedures for topic detection in huge text collections offer
an intuitive alternative to probabilistic topic models. We present in detail a
method that is especially designed with the requirements of domain experts in
mind. Like similar methods, it employs community detection in term
co-occurrence graphs, but it is enhanced by including a resolution parameter
that can be used for changing the targeted topic granularity. We also establish
a term ranking and use semantic word-embedding for presenting term communities
in a way that facilitates their interpretation. We demonstrate the application
of our method with a widely used corpus of general news articles and show the
results of detailed social-sciences expert evaluations of detected topics at
various resolutions. A comparison with topics detected by Latent Dirichlet
Allocation is also included. Finally, we discuss factors that influence topic
interpretation.Comment: 31 pages, 6 figure
CrossNER: Evaluating Cross-Domain Named Entity Recognition
Cross-domain named entity recognition (NER) models are able to cope with the
scarcity issue of NER samples in target domains. However, most of the existing
NER benchmarks lack domain-specialized entity types or do not focus on a
certain domain, leading to a less effective cross-domain evaluation. To address
these obstacles, we introduce a cross-domain NER dataset (CrossNER), a
fully-labeled collection of NER data spanning over five diverse domains with
specialized entity categories for different domains. Additionally, we also
provide a domain-related corpus since using it to continue pre-training
language models (domain-adaptive pre-training) is effective for the domain
adaptation. We then conduct comprehensive experiments to explore the
effectiveness of leveraging different levels of the domain corpus and
pre-training strategies to do domain-adaptive pre-training for the cross-domain
task. Results show that focusing on the fractional corpus containing
domain-specialized entities and utilizing a more challenging pre-training
strategy in domain-adaptive pre-training are beneficial for the NER domain
adaptation, and our proposed method can consistently outperform existing
cross-domain NER baselines. Nevertheless, experiments also illustrate the
challenge of this cross-domain NER task. We hope that our dataset and baselines
will catalyze research in the NER domain adaptation area. The code and data are
available at https://github.com/zliucr/CrossNER.Comment: Accepted in AAAI-202
- …