Extracting knowledge from complex unstructured corpora: Text classification and a case study on the safeguarding domain

Abstract

The advances in internet, data collection and sharing technologies have lead to an increase in the amount of unstructured information in the form of news, articles, and social media. Additionally, many specialised domains such as the medical, law, and social science-related domains use unstructured documents as a main platform for collecting, storing and sharing domain-specific knowledge. However, the manual processing of these documents is a resource-consuming and error-prone process. This is especially apparent when the volume of the documents that need annotating constantly increases over time. Therefore, automated information extraction techniques have been widely used to efficiently analyse text and discover patterns. Specifically, text classification methods have become valuable for specialised domains for organising content, such as patient notes, and help fast topic-based retrieval of information. However, many specialised domains suffer from lack of data and class imbalance problems because documents are hard to obtain. In addition, the manual annotation needs to be performed by experts which can be costly. This makes the application of supervised classification approaches a challenging task. In this thesis, we research methods for improving the performance of text classifiers for specialised domains with limited amounts of data and highly domain-specific terminology where the annotation of documents is performed by domain experts. First, we study the applicability of traditional feature enhancement approaches using publicly available resources for improving classifiers performance for specialised domains. Then, we conduct extensive research into suitability of existing classification algorithms and the importance of both domain and task specific data for few-shot classification which helps identify classification strategies applicable to small datasets. This gives the basis for the development of a methodology for improving a classifier’s performance for few-shot settings using text generation-based data augmentation techniques. Specifically, we aim to improve quality of generated data by using strategies for selecting class representative samples from the original dataset used to produce additional training instances. We perform extensive analysis, considering multiple strategies, datasets, and few-shot text classification settings. Our study uses a corpus of safeguarding reports as an exemplary case study of a specialised domain with a small volume of data. The safeguarding reports contain valuable information about learning experiences and reflections on tackling serious crimes involving children and vulnerable adults. They carry great potential to improve multiagency work and help develop better crime prevention strategies. However, the lack of centralised access and the constant growth of the collection, make the manual analysis of the reports unfeasible. Therefore, we collaborated with the Crime and Security Research Institute (CSRI) at Cardiff University for the creation of a Wales Safeguarding Repository (WSR) for providing a centralised access to the safeguarding reports and means for automatic information extraction. The aim of the repository is to facilitate efficient searchability of the collection and thus help free up resources and assist practitioners from health and social care agencies in making faster and more accurate decisions. In particular, we apply methods identified in the thesis, in order to support automated annotation of the documents using a thematic framework, created by subject-matter experts. Our close work with domain experts throughout the thesis allowed incorporating experts‘ knowledge into classification and augmentation techniques which proved beneficial for the improvement of automated supervised methods for specialised domains

    Similar works