984 research outputs found

    The effect of noise and sample size on an unsupervised feature selection method for manifold learning

    Get PDF
    The research on unsupervised feature selection is scarce in comparison to that for supervised models, despite the fact that this is an important issue for many clustering problems. An unsupervised feature selection method for general Finite Mixture Models was recently proposed and subsequently extended to Generative Topographic Mapping (GTM), a manifold learning constrained mixture model that provides data visualization. Some of the results of a previous partial assessment of this unsupervised feature selection method for GTM suggested that its performance may be affected by insufficient sample size and by noisy data. In this brief study, we test in some detail such limitations of the method.Postprint (published version

    Visualisation of heterogeneous data with simultaneous feature saliency using Generalised Generative Topographic Mapping

    Get PDF
    Most machine-learning algorithms are designed for datasets with features of a single type whereas very little attention has been given to datasets with mixed-type features. We recently proposed a model to handle mixed types with a probabilistic latent variable formalism. This proposed model describes the data by type-specific distributions that are conditionally independent given the latent space and is called generalised generative topographic mapping (GGTM). It has often been observed that visualisations of high-dimensional datasets can be poor in the presence of noisy features. In this paper we therefore propose to extend the GGTM to estimate feature saliency values (GGTMFS) as an integrated part of the parameter learning process with an expectation-maximisation (EM) algorithm. The efficacy of the proposed GGTMFS model is demonstrated both for synthetic and real datasets

    The effect of noise and sample size in the performance of an unsupervised feature relevant determination method for manifold learning

    Get PDF
    The research on unsupervised feature selection is scarce in comparison to that for supervised models, despite the fact that this is an important issue for many clustering problems. An unsupervised feature selection method for general Finite Mixture Models was recently proposed and subsequently extended to Generative Topographic Mapping (GTM), a manifold learning constrained mixture model that provides data clustering and visualization. Some of the results of previous research on this unsupervised feature selection method for GTM suggested that its performance may be affected by insuficient sample size and by noisy data. In this thesis, we test in detail such limitations of the method and outline some techniques that could provide an at least partial solution to the negative effect of the presence of uninformative noise. In particular, we provide a detailed account of a variational Bayesian formulation of feature relevance determination for GTM

    A machine learning approach based on generative topographic mapping for disruption prevention and avoidance at JET

    Get PDF
    The need for predictive capabilities greater than 95% with very limited false alarms are demanding requirements for reliable disruption prediction systems in tokamaks such as JET or, in the near future, ITER. The prediction of an upcoming disruption must be provided sufficiently in advance in order to apply effective disruption avoidance or mitigation actions to prevent the machine from being damaged. In this paper, following the typical machine learning workflow, a generative topographic mapping (GTM) of the operational space of JET has been built using a set of disrupted and regularly terminated discharges. In order to build the predictive model, a suitable set of dimensionless, machine-independent, physics-based features have been synthesized, which make use of 1D plasma profile information, rather than simple zero-D time series. The use of such predicting features, together with the power of the GTM in fitting the model to the data, obtains, in an unsupervised way, a 2D map of the multi-dimensional parameter space of JET, where it is possible to identify a boundary separating the region free from disruption from the disruption region. In addition to helping in operational boundaries studies, the GTM map can also be used for disruption prediction exploiting the potential of the developed GTM toolbox to monitor the discharge dynamics. Following the trajectory of a discharge on the map throughout the different regions, an alarm is triggered depending on the disruption risk of these regions. The proposed approach to predict disruptions has been evaluated on a training and an independent test set and achieves very good performance with only one tardive detection and a limited number of false detections. The warning times are suitable for avoidance purposes and, more important, the detections are consistent with physical causes and mechanisms that destabilize the plasma leading to disruptions.Peer reviewe

    Preliminary theoretical results on a feature relevance determination method for Generative Topographic Mapping

    Get PDF
    Feature selection (FS) has long been studied in classification and regression problems, following diverse approaches and resulting on a wide variety of methods, usually grouped as either /filters /or /wrappers/. In comparison, FS for unsupervised learning has received far less attention. For many real problems concerning unsupervised multivariate data clustering, FS becomes an issue of paramount importance as results have to meet interpretability and actionability requirements. A FS method for Gaussian mixture models was recently defined in Law et al. (2004). Mixture models are well established as clustering methods, but their multivariate data visualization capabilities are limited. The Generative Topographic Mapping (Bishop et al. 1998a), a constrained mixture of distributions, was originally defined to overcome such limitation. In this brief report we provide the theoretical development of a feature relevance determination method for Generative Topographic Mapping, based on that defined in Law et al. (2004); with this method, the clustering results can be visualized on a low dimensional latent space and interpreted in terms of a reduced subset of selected relevant features. [This documend has been revised (8/11/2006)]Postprint (published version

    On Martian Surface Exploration: Development of Automated 3D Reconstruction and Super-Resolution Restoration Techniques for Mars Orbital Images

    Get PDF
    Very high spatial resolution imaging and topographic (3D) data play an important role in modern Mars science research and engineering applications. This work describes a set of image processing and machine learning methods to produce the “best possible” high-resolution and high-quality 3D and imaging products from existing Mars orbital imaging datasets. The research work is described in nine chapters of which seven are based on separate published journal papers. These include a) a hybrid photogrammetric processing chain that combines the advantages of different stereo matching algorithms to compute stereo disparity with optimal completeness, fine-scale details, and minimised matching artefacts; b) image and 3D co-registration methods that correct a target image and/or 3D data to a reference image and/or 3D data to achieve robust cross-instrument multi-resolution 3D and image co-alignment; c) a deep learning network and processing chain to estimate pixel-scale surface topography from single-view imagery that outperforms traditional photogrammetric methods in terms of product quality and processing speed; d) a deep learning-based single-image super-resolution restoration (SRR) method to enhance the quality and effective resolution of Mars orbital imagery; e) a subpixel-scale 3D processing system using a combination of photogrammetric 3D reconstruction, SRR, and photoclinometric 3D refinement; and f) an optimised subpixel-scale 3D processing system using coupled deep learning based single-view SRR and deep learning based 3D estimation to derive the best possible (in terms of visual quality, effective resolution, and accuracy) 3D products out of present epoch Mars orbital images. The resultant 3D imaging products from the above listed new developments are qualitatively and quantitatively evaluated either in comparison with products from the official NASA planetary data system (PDS) and/or ESA planetary science archive (PSA) releases, and/or in comparison with products generated with different open-source systems. Examples of the scientific application of these novel 3D imaging products are discussed

    A novel ensemble Beta-scale invariant map algorithm

    Get PDF
    [Abstract]: This research presents a novel topology preserving map (TPM) called Weighted Voting Supervision -Beta-Scale Invariant Map (WeVoS-Beta-SIM), based on the application of the Weighted Voting Supervision (WeVoS) meta-algorithm to a novel family of learning rules called Beta-Scale Invariant Map (Beta-SIM). The aim of the novel TPM presented is to improve the original models (SIM and Beta-SIM) in terms of stability and topology preservation and at the same time to preserve their original features, especially in the case of radial datasets, where they all are designed to perform their best. These scale invariant TPM have been proved with very satisfactory results in previous researches. This is done by generating accurate topology maps in an effectively and efficiently way. WeVoS meta-algorithm is based on the training of an ensemble of networks and the combination of them to obtain a single one that includes the best features of each one of the networks in the ensemble. WeVoS-Beta-SIM is thoroughly analyzed and successfully demonstrated in this study over 14 diverse real benchmark datasets with diverse number of samples and features, using three different well-known quality measures. In order to present a complete study of its capabilities, results are compared with other topology preserving models such as Self Organizing Maps, Scale Invariant Map, Maximum Likelihood Hebbian Learning-SIM, Visualization Induced SOM, Growing Neural Gas and Beta- Scale Invariant Map. The results obtained confirm that the novel algorithm improves the quality of the single Beta-SIM algorithm in terms of topology preservation and stability without losing performance (where this algorithm has proved to overcome other well-known algorithms). This improvement is more remarkable when complexity of the datasets increases, in terms of number of features and samples and especially in the case of radial datasets improving the Topographic Error

    Manifold Learning Approaches to Compressing Latent Spaces of Unsupervised Feature Hierarchies

    Get PDF
    Field robots encounter dynamic unstructured environments containing a vast array of unique objects. In order to make sense of the world in which they are placed, they collect large quantities of unlabelled data with a variety of sensors. Producing robust and reliable applications depends entirely on the ability of the robot to understand the unlabelled data it obtains. Deep Learning techniques have had a high level of success in learning powerful unsupervised representations for a variety of discriminative and generative models. Applying these techniques to problems encountered in field robotics remains a challenging endeavour. Modern Deep Learning methods are typically trained with a substantial labelled dataset, while datasets produced in a field robotics context contain limited labelled training data. The primary motivation for this thesis stems from the problem of applying large scale Deep Learning models to field robotics datasets that are label poor. While the lack of labelled ground truth data drives the desire for unsupervised methods, the need for improving the model scaling is driven by two factors, performance and computational requirements. When utilising unsupervised layer outputs as representations for classification, the classification performance increases with layer size. Scaling up models with multiple large layers of features is problematic, as the sizes of subsequent hidden layers scales with the size of the previous layer. This quadratic scaling, and the associated time required to train such networks has prevented adoption of large Deep Learning models beyond cluster computing. The contributions in this thesis are developed from the observation that parameters or filter el- ements learnt in Deep Learning systems are typically highly structured, and contain related ele- ments. Firstly, the structure of unsupervised filters is utilised to construct a mapping from the high dimensional filter space to a low dimensional manifold. This creates a significantly smaller repre- sentation for subsequent feature learning. This mapping, and its effect on the resulting encodings, highlights the need for the ability to learn highly overcomplete sets of convolutional features. Driven by this need, the unsupervised pretraining of Deep Convolutional Networks is developed to include a number of modern training and regularisation methods. These pretrained models are then used to provide initialisations for supervised convolutional models trained on low quantities of labelled data. By utilising pretraining, a significant increase in classification performance on a number of publicly available datasets is achieved. In order to apply these techniques to outdoor 3D Laser Illuminated Detection And Ranging data, we develop a set of resampling techniques to provide uniform input to Deep Learning models. The features learnt in these systems outperform the high effort hand engineered features developed specifically for 3D data. The representation of a given signal is then reinterpreted as a combination of modes that exist on the learnt low dimensional filter manifold. From this, we develop an encoding technique that allows the high dimensional layer output to be represented as a combination of low dimensional components. This allows the growth of subsequent layers to only be dependent on the intrinsic dimensionality of the filter manifold and not the number of elements contained in the previous layer. Finally, the resulting unsupervised convolutional model, the encoding frameworks and the em- bedding methodology are used to produce a new unsupervised learning stratergy that is able to encode images in terms of overcomplete filter spaces, without producing an explosion in the size of the intermediate parameter spaces. This model produces classification results on par with state of the art models, yet requires significantly less computational resources and is suitable for use in the constrained computation environment of a field robot

    Visualisation of bioinformatics datasets

    Get PDF
    Analysing the molecular polymorphism and interactions of DNA, RNA and proteins is of fundamental importance in biology. Predicting functions of polymorphic molecules is important in order to design more effective medicines. Analysing major histocompatibility complex (MHC) polymorphism is important for mate choice, epitope-based vaccine design and transplantation rejection etc. Most of the existing exploratory approaches cannot analyse these datasets because of the large number of molecules with a high number of descriptors per molecule. This thesis develops novel methods for data projection in order to explore high dimensional biological dataset by visualising them in a low-dimensional space. With increasing dimensionality, some existing data visualisation methods such as generative topographic mapping (GTM) become computationally intractable. We propose variants of these methods, where we use log-transformations at certain steps of expectation maximisation (EM) based parameter learning process, to make them tractable for high-dimensional datasets. We demonstrate these proposed variants both for synthetic and electrostatic potential dataset of MHC class-I. We also propose to extend a latent trait model (LTM), suitable for visualising high dimensional discrete data, to simultaneously estimate feature saliency as an integrated part of the parameter learning process of a visualisation model. This LTM variant not only gives better visualisation by modifying the project map based on feature relevance, but also helps users to assess the significance of each feature. Another problem which is not addressed much in the literature is the visualisation of mixed-type data. We propose to combine GTM and LTM in a principled way where appropriate noise models are used for each type of data in order to visualise mixed-type data in a single plot. We call this model a generalised GTM (GGTM). We also propose to extend GGTM model to estimate feature saliencies while training a visualisation model and this is called GGTM with feature saliency (GGTM-FS). We demonstrate effectiveness of these proposed models both for synthetic and real datasets. We evaluate visualisation quality using quality metrics such as distance distortion measure and rank based measures: trustworthiness, continuity, mean relative rank errors with respect to data space and latent space. In cases where the labels are known we also use quality metrics of KL divergence and nearest neighbour classifications error in order to determine the separation between classes. We demonstrate the efficacy of these proposed models both for synthetic and real biological datasets with a main focus on the MHC class-I dataset

    Optimization of deepwater channel seismic reservoir characterization using seismic attributes and machine learning

    Get PDF
    Accurate subsurface reservoir mapping is essential for resource exploration. In uncalibrated basins, seismic data, often limited by resolution, frequency, quality, etc., algorithms become the primary information source due to the unavailability of well logs and core data. Seismic attributes, while integral for understanding subsurface structures, visually limit interpreters to working with only three of them at once. Conversely, machine learning, though capable of handling numerous attributes, is often seen as inscrutable "black boxes," complicating the interpretation of their predictions and uncertainties. To address these challenges, a comprehensive approach was undertaken, involving a detailed 3D model from Chilean Patagonia's Tres Pasos Formation with synthetic seismic data. The synthetic data served as a benchmark for conducting sensitivity analysis on seismic attributes, offering insights for parameter and workflow optimization. The study also evaluated the uncertainty in unsupervised and supervised machine learning for deepwater facies prediction through qualitative and quantitative assessments. Study key findings include: 1) High-frequency data and smaller analysis windows provide clearer channel images, while low-frequency data and larger windows create composite appearances, particularly in small stratigraphic features. 2) GTM and SOM exhibited similar performance, with error rates around 2% for predominant facies but significantly higher for individual channel-related facies. This suggests that unbalanced data results in higher errors for minor facies and that a reduction in clusters or a simplified model may better represent reservoir versus non-reservoir facies. 3) Resolution and data distribution significantly impact predictability, leading to non-uniqueness in cluster generation, which applies to supervised models as well. Strengthening the argument that understanding the limitations of seismic data is crucial. 4) Uncertainty in seismic facies prediction is influenced by factors such as training attribute selection, original facies proportions (e.g., imbalanced data, variable errors, and data quality). While optimized random forests achieved an 80% accuracy rate, validation accuracy was lower, emphasizing the need to address uncertainties and their role in interpretation. Overall, the utilization of ground truth seismic data derived from outcrops offers valuable insights into the strengths and challenges of machine learning in subsurface applications, where accurate predictions are critical for decision-making and safety in the energy sector
    • …
    corecore