22 research outputs found

    An exploration of improvements to semi-supervised fuzzy c-means clustering for real-world biomedical data

    Get PDF
    This thesis explores various detailed improvements to semi-supervised learning (using labelled data to guide clustering or classification of unlabelled data) with fuzzy c-means clustering (a ‘soft’ clustering technique which allows data patterns to be assigned to multiple clusters using membership values), with the primary aim of creating a semi-supervised fuzzy clustering algorithm that shows good performance on real-world data. Hence, there are two main objectives in this work. The first objective is to explore novel technical improvements to semi-supervised Fuzzy c-means (ssFCM) that can address the problem of initialisation sensitivity and can improve results. The second objective is to apply the developed algorithm on real biomedical data, such as the Nottingham Tenovus Breast Cancer (NTBC) dataset, to create an automatic methodology for identifying stable subgroups which have been previously elicited semi-manually. Investigations were conducted into detailed improvements to the ss-FCM algorithm framework, including a range of distance metrics, initialisation and feature selection techniques and scaling parameter values. These methodologies were tested on different data sources to demonstrate their generalisation properties. Evaluation results between methodologies were compared to determine suitable techniques on various University of California, Irvine (UCI) benchmark datasets. Results were promising, suggesting that initialisation techniques, feature selection and scaling parameter adjustment can increase ssFCM performance. Based on these investigations, a novel ssFCM framework was developed, applied to the NTBC dataset, and various statistical and biological evaluations were conducted. This demonstrated highly significant improvement in agreement with previous classifications, with solutions that are biologically useful and clinically relevant in comparison with Sorias study [141]. On comparison with the latest NTBC study by Green et al. [63], similar clinical results have been observed, confirming stability of the subgroups. Two main contributions to knowledge have been made in this work. Firstly, the ssFCM framework has been improved through various technical refinements, which may be used together or separately. Secondly, the NTBC dataset has been successfully automatically clustered (in a single algorithm) into clinical sub-groups which had previously been elucidated semi-manually. While results are very promising, it is important to note that fully, detailed validation of the framework has only been carried out on the NTBC dataset, and so there is limit on the general conclusions that may be drawn. Future studies include applying the framework on other biomedical datasets and applying distance metric learning into ssFCM. In conclusion, an enhanced ssFCM framework has been proposed, and has been demonstrated to have highly significant improved accuracy on the NTBC dataset

    An exploration of improvements to semi-supervised fuzzy c-means clustering for real-world biomedical data

    Get PDF
    This thesis explores various detailed improvements to semi-supervised learning (using labelled data to guide clustering or classification of unlabelled data) with fuzzy c-means clustering (a ‘soft’ clustering technique which allows data patterns to be assigned to multiple clusters using membership values), with the primary aim of creating a semi-supervised fuzzy clustering algorithm that shows good performance on real-world data. Hence, there are two main objectives in this work. The first objective is to explore novel technical improvements to semi-supervised Fuzzy c-means (ssFCM) that can address the problem of initialisation sensitivity and can improve results. The second objective is to apply the developed algorithm on real biomedical data, such as the Nottingham Tenovus Breast Cancer (NTBC) dataset, to create an automatic methodology for identifying stable subgroups which have been previously elicited semi-manually. Investigations were conducted into detailed improvements to the ss-FCM algorithm framework, including a range of distance metrics, initialisation and feature selection techniques and scaling parameter values. These methodologies were tested on different data sources to demonstrate their generalisation properties. Evaluation results between methodologies were compared to determine suitable techniques on various University of California, Irvine (UCI) benchmark datasets. Results were promising, suggesting that initialisation techniques, feature selection and scaling parameter adjustment can increase ssFCM performance. Based on these investigations, a novel ssFCM framework was developed, applied to the NTBC dataset, and various statistical and biological evaluations were conducted. This demonstrated highly significant improvement in agreement with previous classifications, with solutions that are biologically useful and clinically relevant in comparison with Sorias study [141]. On comparison with the latest NTBC study by Green et al. [63], similar clinical results have been observed, confirming stability of the subgroups. Two main contributions to knowledge have been made in this work. Firstly, the ssFCM framework has been improved through various technical refinements, which may be used together or separately. Secondly, the NTBC dataset has been successfully automatically clustered (in a single algorithm) into clinical sub-groups which had previously been elucidated semi-manually. While results are very promising, it is important to note that fully, detailed validation of the framework has only been carried out on the NTBC dataset, and so there is limit on the general conclusions that may be drawn. Future studies include applying the framework on other biomedical datasets and applying distance metric learning into ssFCM. In conclusion, an enhanced ssFCM framework has been proposed, and has been demonstrated to have highly significant improved accuracy on the NTBC dataset

    Profiling Obese Subgroups in National Health and Nutritional Status Survey Data using Machine Learning Techniques: A Case Study from Brunei Darussalam

    Full text link
    National Health and Nutritional Status Survey (NHANSS) is conducted annually by the Ministry of Health in Negara Brunei Darussalam to assess the population health and nutritional patterns and characteristics. The main aim of this study was to discover meaningful patterns (groups) from the obese sample of NHANSS data by applying data reduction and interpretation techniques. The mixed nature of the variables (qualitative and quantitative) in the data set added novelty to the study. Accordingly, the Categorical Principal Component (CATPCA) technique was chosen to interpret the meaningful results. The relationships between obesity and the lifestyle factors like demography, socioeconomic status, physical activity, dietary behavior, history of blood pressure, diabetes, etc., were determined based on the principal components generated by CATPCA. The results were validated with the help of the split method technique to counter verify the authenticity of the generated groups. Based on the analysis and results, two subgroups were found in the data set, and the salient features of these subgroups have been reported. These results can be proposed for the betterment of the healthcare industry.Comment: A Case study of Obese Subgroups from Brunei Darussalam: 15 Pages, 4 figures, journa

    A methodology for automatic classification of breast cancer immunohistochemical data using semi-supervised fuzzy c-means

    Get PDF
    Previously, a semi-manual method was used to identify six novel and clinically useful classes in the Nottingham Tenovus Breast Cancer dataset. 663 out of 1,076 patients were classified. The objectives of our work is three folds. Firstly, our primary objective is to use one single automatic method (post-initialisation) to reproduce the six classes for the 663 patients and to classify the remaining 413 patients. Secondly, we explore using semi-supervised fuzzy c-means with various distance metrics and initialisation techniques to achieve this. Thirdly, the clinical characteristics of the 413 patients are examined by comparing with the 663 patients. Our experiments use various amount of labelled data and 10-fold cross validation to reproduce and evaluate the classification. ssFCM with Euclidean distance and initialisation technique by Katsavounidis et al. produced the best results. It is then used to classify the 413 patients. Visual evaluation of the 413 patients’ classifications revealed common characteristics as those previously reported. Examination of clinical characteristics indicates significant associations between classification and clinical parameters. More importantly, association between classification and survival based on the survival curves is shown

    An Empirical Study of Cluster-Based MOEA/D Bare Bones PSO for Data Clustering †

    No full text
    Previously, cluster-based multi or many objective function techniques were proposed to reduce the Pareto set. Recently, researchers proposed such techniques to find better solutions in the objective space to solve engineering problems. In this work, we applied a cluster-based approach for solution selection in a multiobjective evolutionary algorithm based on decomposition with bare bones particle swarm optimization for data clustering and investigated its clustering performance. In our previous work, we found that MOEA/D with BBPSO performed the best on 10 datasets. Here, we extend this work applying a cluster-based approach tested on 13 UCI datasets. We compared with six multiobjective evolutionary clustering algorithms from the existing literature and ten from our previous work. The proposed technique was found to perform well on datasets highly overlapping clusters, such as CMC and Sonar. So far, we found only one work that used cluster-based MOEA for clustering data, the hierarchical topology multiobjective clustering algorithm. All other cluster-based MOEA found were used to solve other problems that are not data clustering problems. By clustering Pareto solutions and evaluating new candidates against the found cluster representatives, local search is introduced in the solution selection process within the objective space, which can be effective on datasets with highly overlapping clusters. This is an added layer of search control in the objective space. The results are found to be promising, prompting different areas of future research which are discussed, including the study of its effects with an increasing number of clusters as well as with other objective functions

    A cluster analysis of population based cancer registry in Brunei Darussalam : an exploratory study

    Get PDF
    Machine learning techniques have been mostly applied in gene expression cancer data. Socio-demographic data available in cancer registries could be explored, to get further insight into relationships between cancer types and their contributing factors. Moreover, less attention has been paid to analyse the mixed demographic data (numeric and categorical) from cancer registries and its association to the cancer types. The aim of this study is to identify subgroups of patients, having similar demographics characteristics, from the population based cancer registry in Brunei Darussalam and examine the prevalent cancer types in these subgroups. Four clustering algorithms are explored in the cluster analysis of Brunei Darussalam Cancer Registry; Two-step, Partitional Around Medoid, Agglomerative Hierarchical and Model-based. Gower distance was used for measuring similarity for mixed data types. To evaluate the clusters found; cluster distribution and Silhouette index were used for cluster quality, Cohen's Kappa Index for cluster stability and Cramer's V Coefficient for clinical relevance of clusters. Six distinct demographic subgroups were consistently found by three algorithms while model-based clustering solution were not considered for deeper analysis as highly imbalanced clusters were produced. The subgroups found have good quality clusters, moderate association with cancer types and high stability. The top three prevalent cancers associated with these subgroups were consistently identified using the three algorithms. Upon comparing the subgroups’ ages during diagnosis, we identify possible screening behaviours of specific subgroups, suggesting for early screening awareness programmes. This study demonstrates the use of cluster analysis in a cancer registry to identify demographic subgroups that could suggest potential areas to develop targeted and improved healthcare management strategies

    On Using Genetic Algorithm for Initialising Semi-supervised Fuzzy c-Means Clustering

    Get PDF
    In a previous work, suitable initialisation techniques were incorporated with semi-supervised Fuzzy c-Means clustering (ssFCM) to improve clustering results on a trial and error basis. In this work, we present a single fully-automatic version of an existing semi-supervised Fuzzy c-means clustering framework which uses genetically-modified prototypes (ssFCMGA). Initial prototypes are generated by GA to initialise the ssFCM algorithm without experimentation of different initialisation techniques. The framework is tested on a real, biomedical dataset NTBC and on the Arrhythmia UCI dataset, using varying amounts of labelled data from 10% to 60% of the total data patterns. Different ssFCM threshold values and fitness functions for ssFCMGA are also investigated (sGAs). We used accuracy and NMI to measure class-label agreement and internal measures WSS, BSS, CH, CWB, DB and DU to evaluate cluster quality of the clustering algorithms. Results are compared with those produced by the existing ssFCM. While ssFCMGA and sGAs produced slightly lower agreement level than ssFCM with known class labels based on accuracy and NMI, the other six measurements showed improvement in the results in terms of compactness and well-separatedness (cluster quality), particularly when labelled data are low at 10%. Furthermore, the cluster quality are shown to further improve using ssFCMGA with a more complex fitness function (sGA2). This demonstrates the application of GA in ssFCM improves cluster quality without exploration of different initialisation techniques

    Segmentation for Multi-Rock Types on Digital Outcrop Photographs Using Deep Learning Techniques

    No full text
    The basic identification and classification of sedimentary rocks into sandstone and mudstone are important in the study of sedimentology and they are executed by a sedimentologist. However, such manual activity involves countless hours of observation and data collection prior to any interpretation. When such activity is conducted in the field as part of an outcrop study, the sedimentologist is likely to be exposed to challenging conditions such as the weather and their accessibility to the outcrops. This study uses high-resolution photographs which are acquired from a sedimentological study to test an alternative basic multi-rock identification through machine learning. While existing studies have effectively applied deep learning techniques to classify the rock types in field rock images, their approaches only handle a single rock-type classification per image. One study applied deep learning techniques to classify multi-rock types in each image; however, the test was performed on artificially overlaid images of different rock types in a test sample and not of naturally occurring rock surfaces of multiple rock types. To the best of our knowledge, no study has applied semantic segmentation to solve the multi-rock classification problem using digital photographs of multiple rock types. This paper presents the application of two state-of-the-art segmentation models, namely U-Net and LinkNet, to identify multiple rock types in digital photographs by segmenting the sandstone, mudstone, and background classes in a self-collected dataset of 102 images from a field in Brunei Darussalam. Four pre-trained networks, including Resnet34, Inceptionv3, VGG16, and Efficientnetb7 were used as a backbone for both models, and the performances of the individual models and their ensembles were compared. We also investigated the impact of image enhancement and different color representations on the performances of these segmentation models. The experiment results of this study show that among the individual models, LinkNet with Efficientnetb7 as a backbone had the best performance with a mean over intersection (MIoU) value of 0.8135 for all of the classes. While the ensemble of U-Net models (with all four backbones) performed slightly better than the LinkNet with Efficientnetb7 did with an MIoU of 0.8201. When different color representations and image enhancements were explored, the best performance (MIoU = 0.8178) was noticed for the L*a*b* color representation with Efficientnetb7 using U-Net segmentation. For the individual classes of interest (sandstone and mudstone), U-Net with Efficientnetb7 was found to be the best model for the segmentation. Thus, this study presents the potential of semantic segmentation in automating the reservoir characterization process whereby we can extract the patches of interest from the rocks for much deeper study and modeling to be conducted

    FCD-AttResU-Net: An improved forest change detection in Sentinel-2 satellite images using attention residual U-Net

    No full text
    Forest Change Detection (FCD) is a critical component of natural resource monitoring and conservation strategies, enabling informed decision-making. Various methods utilizing the power of artificial intelligence (AI) have been developed for detecting and categorizing changes in forest cover using remote sensing (RS) data. One prominent AI-powered approach is the U-Net, a deep learning (DL) architecture famous for its segmentation proficiency. However, the standard U-Net architecture fails to effectively capture intricate spatial dependencies and long-range contextual information present in remote sensing imagery. To address this research gap, we introduce an attention-residual-based novel DL model which leverages the U-Net architecture and Sentinel-2 satellite images to map alterations in forest vegetation cover in the tropical region. Our novel model enhances the U-Net architecture by seamlessly integrating the strengths of the U-Net, harnessing attention mechanisms strategically to amplify crucial features, and leveraging cutting-edge residual connections to facilitate the smooth flow of information and gradient propagation. These meticulous design choices enabled the precise feature extraction, resulting in improved computational performance of the proposed method compared to the Standard U-Net, Deeplabv3+, Deep Res-U-Net, and Attention U-Net. The classification results demonstrate the enhanced efficiency of our model, achieving a Mean Intersection over Union (MIoU) of 0.9330 on our test dataset. This performance surpasses the Attention U-Net (0.9146), Standard U-Net (0.9029), Deeplabv3+ (0.9247), and Deep Res-U-Net (0.9282). The comparative analysis of ground truth reproductions unveiled the superior detection capabilities of our model in accurately identifying forest and non-forest polygons, surpassing both the standard U-Net, and the U-Net augmented with attention mechanism, along with other state-of-the-art techniques, thereby highlighting its enhanced efficacy. The model’s broad applicability can support forest managers and ecologists in rapidly evaluating the long-term ramifications of infrastructure initiatives, such as roads, on tropical forests, including those in Brunei
    corecore