Automated machine learning plankton taxonomy pipeline

Abstract

Plankton taxonomy is considered a multi-class classification problem. The current state-of-the-art developments in machine learning and phytoplankton taxonomy, such as MorphoCluster, include using a convolutional neural network as a feature extractor and Hierarchical Density-Based Clustering for the classification of plankton and identification of outliers. These convolutional feature extraction algorithms achieved accuracies of 0.78 during the classification process. However, these feature extraction models are trained on clean datasets. They perform very well when analysing previously encountered and well-defined classes but do not perform well when tested on raw datasets expected in field deployment. Raw plankton datasets are unbalanced; whereas some classes only have one or two samples, others can have thousands. They also exhibit many inter-class similarities with significant size differences. The data can also be in the form of low-resolution, noisy images. Phytoplankton species are also highly biodiverse, meaning that there is always a higher chance of a network encountering unknown sample types. Some samples, such as the various body parts of organisms, are easily confused with the species itself. Marine experts classifying plankton tend to group ambiguous samples according to the highest order to which they are confident they belong. This system leads to a dataset containing conflicting classes and forces the feature extraction network to overfit when training. This research aims to address these spatial issues and present a feature extraction methodology built upon existing research and novel concepts. The proposed algorithm uses feature extraction methods designed around real-world sample sets and offers an alternative approach to optimizing the features extracted and supplied to the clustering algorithm. The proposed feature extraction methods achieved scores of 0.821 when tested on the same datasets as the general feature extractor. The algorithm also consists of Auxiliary SoftMax classification branches which indicate the class prediction obtained by the feature extraction models. These branches allow for autonomous labelling of the clusters formed during the HDBSCAN algorithm being performed on the extracted features. This results in a fully automated semi-supervised plankton taxonomy pipeline which achieves a classification score of 0.775 on a real-life sample set.Thesis (MA) -- Faculty of Engineering, the Built Environment, and Technology, 202

    Similar works