705 research outputs found
Relation Extraction Datasets in the Digital Humanities Domain and their Evaluation with Word Embeddings
In this research, we manually create high-quality datasets in the digital
humanities domain for the evaluation of language models, specifically word
embedding models. The first step comprises the creation of unigram and n-gram
datasets for two fantasy novel book series for two task types each, analogy and
doesn't-match. This is followed by the training of models on the two book
series with various popular word embedding model types such as word2vec, GloVe,
fastText, or LexVec. Finally, we evaluate the suitability of word embedding
models for such specific relation extraction tasks in a situation of comparably
small corpus sizes. In the evaluations, we also investigate and analyze
particular aspects such as the impact of corpus term frequencies and task
difficulty on accuracy. The datasets, and the underlying system and word
embedding models are available on github and can be easily extended with new
datasets and tasks, be used to reproduce the presented results, or be
transferred to other domains
Translation Alignment Applied to Historical Languages: methods, evaluation, applications, and visualization
Translation alignment is an essential task in Digital Humanities and Natural
Language Processing, and it aims to link words/phrases in the source
text with their translation equivalents in the translation. In addition to
its importance in teaching and learning historical languages, translation
alignment builds bridges between ancient and modern languages through
which various linguistics annotations can be transferred. This thesis focuses
on word-level translation alignment applied to historical languages in general
and Ancient Greek and Latin in particular. As the title indicates, the thesis
addresses four interdisciplinary aspects of translation alignment.
The starting point was developing Ugarit, an interactive annotation tool
to perform manual alignment aiming to gather training data to train an
automatic alignment model. This effort resulted in more than 190k accurate
translation pairs that I used for supervised training later. Ugarit has been
used by many researchers and scholars also in the classroom at several
institutions for teaching and learning ancient languages, which resulted
in a large, diverse crowd-sourced aligned parallel corpus allowing us to
conduct experiments and qualitative analysis to detect recurring patterns in
annotators’ alignment practice and the generated translation pairs.
Further, I employed the recent advances in NLP and language modeling to
develop an automatic alignment model for historical low-resourced languages,
experimenting with various training objectives and proposing a training
strategy for historical languages that combines supervised and unsupervised
training with mono- and multilingual texts. Then, I integrated this alignment
model into other development workflows to project cross-lingual annotations
and induce bilingual dictionaries from parallel corpora.
Evaluation is essential to assess the quality of any model. To ensure employing the best practice, I reviewed the current evaluation procedure, defined
its limitations, and proposed two new evaluation metrics. Moreover, I introduced a visual analytics framework to explore and inspect alignment gold
standard datasets and support quantitative and qualitative evaluation of
translation alignment models. Besides, I designed and implemented visual
analytics tools and reading environments for parallel texts and proposed
various visualization approaches to support different alignment-related tasks
employing the latest advances in information visualization and best practice.
Overall, this thesis presents a comprehensive study that includes manual and
automatic alignment techniques, evaluation methods and visual analytics
tools that aim to advance the field of translation alignment for historical
languages
Creation and Evaluation of Datasets for Distributional Semantics Tasks in the Digital Humanities Domain
Word embeddings are already well studied in the general domain, usually
trained on large text corpora, and have been evaluated for example on word
similarity and analogy tasks, but also as an input to downstream NLP processes.
In contrast, in this work we explore the suitability of word embedding
technologies in the specialized digital humanities domain. After training
embedding models of various types on two popular fantasy novel book series, we
evaluate their performance on two task types: term analogies, and word
intrusion. To this end, we manually construct test datasets with domain
experts. Among the contributions are the evaluation of various word embedding
techniques on the different task types, with the findings that even embeddings
trained on small corpora perform well for example on the word intrusion task.
Furthermore, we provide extensive and high-quality datasets in digital
humanities for further investigation, as well as the implementation to easily
reproduce or extend the experiments
TiFi: Taxonomy Induction for Fictional Domains [Extended version]
Taxonomies are important building blocks of structured knowledge bases, and their construction from text sources and Wikipedia has received much attention. In this paper we focus on the construction of taxonomies for fictional domains, using noisy category systems from fan wikis or text extraction as input. Such fictional domains are archetypes of entity universes that are poorly covered by Wikipedia, such as also enterprise-specific knowledge bases or highly specialized verticals. Our fiction-targeted approach, called TiFi, consists of three phases: (i) category cleaning, by identifying candidate categories that truly represent classes in the domain of interest, (ii) edge cleaning, by selecting subcategory relationships that correspond to class subsumption, and (iii) top-level construction, by mapping classes onto a subset of high-level WordNet categories. A comprehensive evaluation shows that TiFi is able to construct taxonomies for a diverse range of fictional domains such as Lord of the Rings, The Simpsons or Greek Mythology with very high precision and that it outperforms state-of-the-art baselines for taxonomy induction by a substantial margin
Supporting Methodology Transfer in Visualization Research with Literature-Based Discovery and Visual Text Analytics
[ES] La creciente especialización de la ciencia está motivando la rápida fragmentación
de disciplinas bien establecidas en comunidades interdisciplinares. Esta descom-
posición se puede observar en un tipo de investigación en visualización conocida
como investigación de visualización dirigida por el problema. En ella, equipos de
expertos en visualización y un dominio concreto, colaboran en un área especÃfica
de conocimiento como pueden ser las humanidades digitales, la bioinformática, la
seguridad informática o las ciencias del deporte. Esta tesis propone una serie de
métodos inspirados en avances recientes en el análisis automático de textos y la rep-
resentación del conocimiento para promover la adecuada comunicación y transferen-
cia de conocimiento entre estas comunidades. Los métodos obtenidos se combinaron
en una interfaz de análisis visual de textos orientada al descubrimiento cientÃfico,
GlassViz, que fue diseñada con estos objetivos en mente. La herramienta se probó
por primera vez en el dominio de las humanidades digitales para explorar un corpus
masivo de artÃculos de visualización de propósito general. GlassViz fue adaptada en
un estudio posterior para que soportase diferentes fuentes de datos representativas de
estas comunidades, mostrando evidencia de que el enfoque propuesto también es una
alternativa válida para abordar el problema de la fragmentación en la investigación
en visualización
- …