The anatomy of a search and mining system for digital humanities : Search And Mining Tools for Language Archives (SAMTLA)

Harris, Martyn

The anatomy of a search and mining system for digital humanities : Search And Mining Tools for Language Archives (SAMTLA)

Authors: Martyn Harris
Publication date
Publisher

Abstract

Humanities researchers are faced with an overwhelming volume of digitised primary source material, and "born digital" information, of relevance to their research as a result of large-scale digitisation projects. The current digital tools do not provide consistent support for analysing the content of digital archives that are potentially large in scale, multilingual, and come in a range of data formats. The current language-dependent, or project specific, approach to tool development often puts the tools out of reach for many research disciplines in the humanities. In addition, the tools can be incompatible with the way researchers locate and compare the relevant sources. For instance, researchers are interested in shared structural text patterns, known as \parallel passages" that describe a specific cultural, social, or historical context relevant to their research topic. Identifying these shared structural text patterns is challenging due to their repeated yet highly variable nature, as a result of differences in the domain, author, language, time period, and orthography. The contribution of the thesis is a novel infrastructure that directly addresses the need for generic, flexible, extendable, and sustainable digital tools that are applicable to a wide range of digital archives and research in the humanities. The infrastructure adopts a character-level n-gram Statistical Language Model (SLM), stored in a space-optimised k-truncated suffix tree data structure as its underlying data model. A character-level n-gram model is a relatively new approach that is competitive with word-level n-gram models, but has the added advantage that it is domain and language-independent, requiring little or no preprocessing of the document text unlike word-level models that require some form of language-dependent tokenisation and stemming. Character-level n-grams capture word internal features that are ignored by word-level n-gram models, which provides greater exibility in addressing the information need of the user through tolerant search, and compensation for erroneous query specification or spelling errors in the document text. Furthermore, the SLM provides a unified approach to information retrieval and text mining, where traditional approaches have tended to adopt separate data models that are often ad-hoc or based on heuristic assumptions. In addition, the performance of the character-level n-gram SLM was formally evaluated through crowdsourcing, which demonstrates that the retrieval performance of the SLM is close to that of the human level performance. The proposed infrastructure, supports the development of the Samtla (Search And Mining Tools for Language Archives), which provides humanities researchers digital tools for search, browsing, and text mining of digital archives in any domain or language, within a single system. Samtla supersedes many of the existing tools for humanities researchers, by supporting the same or similar functionality of the systems, but with a domain-independent and languageindependent approach. The functionality includes a browsing tool constructed from the metadata and named entities extracted from the document text, a hybrid-recommendation system for recommending related queries and documents. However, some tools are novel tools and developed in response to the specific needs of the researchers, such as the document comparison tool for visualising shared sequences between groups of related documents. Furthermore, Samtla is the first practical example of a system with a SLM as its primary data model that supports the real research needs of several case studies covering different areas of research in the humanities

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

Birkbeck Institutional Research Online

oai:eprints.bbk.ac.uk.oai2:402...

Last time updated on 15/09/2020