The anatomy of a search and mining system for digital humanities : Search And Mining Tools for Language Archives (SAMTLA)
- Publication date
- Publisher
Abstract
Humanities researchers are faced with an overwhelming volume of digitised
primary source material, and "born digital" information, of relevance to their
research as a result of large-scale digitisation projects. The current digital tools
do not provide consistent support for analysing the content of digital archives
that are potentially large in scale, multilingual, and come in a range of data
formats. The current language-dependent, or project specific, approach to tool
development often puts the tools out of reach for many research disciplines in
the humanities. In addition, the tools can be incompatible with the way
researchers locate and compare the relevant sources. For instance, researchers
are interested in shared structural text patterns, known as \parallel passages"
that describe a specific cultural, social, or historical context relevant to their
research topic. Identifying these shared structural text patterns is challenging
due to their repeated yet highly variable nature, as a result of differences in
the domain, author, language, time period, and orthography.
The contribution of the thesis is a novel infrastructure that directly addresses
the need for generic,
flexible, extendable, and sustainable digital tools
that are applicable to a wide range of digital archives and research in the
humanities. The infrastructure adopts a character-level n-gram Statistical
Language Model (SLM), stored in a space-optimised k-truncated suffix tree
data structure as its underlying data model. A character-level n-gram model
is a relatively new approach that is competitive with word-level n-gram models,
but has the added advantage that it is domain and language-independent,
requiring little or no preprocessing of the document text unlike word-level
models that require some form of language-dependent tokenisation and stemming.
Character-level n-grams capture word internal features that are ignored
by word-level n-gram models, which provides greater
exibility in addressing
the information need of the user through tolerant search, and compensation
for erroneous query specification or spelling errors in the document text. Furthermore,
the SLM provides a unified approach to information retrieval and
text mining, where traditional approaches have tended to adopt separate data
models that are often ad-hoc or based on heuristic assumptions. In addition,
the performance of the character-level n-gram SLM was formally evaluated
through crowdsourcing, which demonstrates that the retrieval performance of
the SLM is close to that of the human level performance.
The proposed infrastructure, supports the development of the Samtla (Search
And Mining Tools for Language Archives), which provides humanities researchers
digital tools for search, browsing, and text mining of digital archives
in any domain or language, within a single system. Samtla supersedes many of
the existing tools for humanities researchers, by supporting the same or similar
functionality of the systems, but with a domain-independent and languageindependent
approach. The functionality includes a browsing tool constructed
from the metadata and named entities extracted from the document text, a
hybrid-recommendation system for recommending related queries and documents.
However, some tools are novel tools and developed in response to
the specific needs of the researchers, such as the document comparison tool
for visualising shared sequences between groups of related documents. Furthermore,
Samtla is the first practical example of a system with a SLM as
its primary data model that supports the real research needs of several case
studies covering different areas of research in the humanities