3,682 research outputs found
Improving the translation environment for professional translators
When using computer-aided translation systems in a typical, professional translation workflow, there are several stages at which there is room for improvement. The SCATE (Smart Computer-Aided Translation Environment) project investigated several of these aspects, both from a human-computer interaction point of view, as well as from a purely technological side.
This paper describes the SCATE research with respect to improved fuzzy matching, parallel treebanks, the integration of translation memories with machine translation, quality estimation, terminology extraction from comparable texts, the use of speech recognition in the translation process, and human computer interaction and interface design for the professional translation environment. For each of these topics, we describe the experiments we performed and the conclusions drawn, providing an overview of the highlights of the entire SCATE project
The Gutenberg English Poetry Corpus: Exemplary Quantitative Narrative Analyses
This paper describes a corpus of about 3,000 English literary texts with about
250 million words extracted from the Gutenberg project that span a range of
genres from both fiction and non-fiction written by more than 130 authors
(e.g., Darwin, Dickens, Shakespeare). Quantitative narrative analysis (QNA) is
used to explore a cleaned subcorpus, the Gutenberg English Poetry Corpus
(GEPC), which comprises over 100 poetic texts with around two million words
from about 50 authors (e.g., Keats, Joyce, Wordsworth). Some exemplary QNA
studies show author similarities based on latent semantic analysis,
significant topics for each author or various text-analytic metrics for George
Eliot’s poem “How Lisa Loved the King” and James Joyce’s “Chamber Music,”
concerning, e.g., lexical diversity or sentiment analysis. The GEPC is
particularly suited for research in Digital Humanities, Computational
Stylistics, or Neurocognitive Poetics, e.g., as training and test corpus for
stimulus development and control in empirical studies
Term-community-based topic detection with variable resolution
Network-based procedures for topic detection in huge text collections offer
an intuitive alternative to probabilistic topic models. We present in detail a
method that is especially designed with the requirements of domain experts in
mind. Like similar methods, it employs community detection in term
co-occurrence graphs, but it is enhanced by including a resolution parameter
that can be used for changing the targeted topic granularity. We also establish
a term ranking and use semantic word-embedding for presenting term communities
in a way that facilitates their interpretation. We demonstrate the application
of our method with a widely used corpus of general news articles and show the
results of detailed social-sciences expert evaluations of detected topics at
various resolutions. A comparison with topics detected by Latent Dirichlet
Allocation is also included. Finally, we discuss factors that influence topic
interpretation.Comment: 31 pages, 6 figure
- …