870 research outputs found

    The effect of noise and sample size on an unsupervised feature selection method for manifold learning

    Get PDF
    The research on unsupervised feature selection is scarce in comparison to that for supervised models, despite the fact that this is an important issue for many clustering problems. An unsupervised feature selection method for general Finite Mixture Models was recently proposed and subsequently extended to Generative Topographic Mapping (GTM), a manifold learning constrained mixture model that provides data visualization. Some of the results of a previous partial assessment of this unsupervised feature selection method for GTM suggested that its performance may be affected by insufficient sample size and by noisy data. In this brief study, we test in some detail such limitations of the method.Postprint (published version

    The effect of noise and sample size in the performance of an unsupervised feature relevant determination method for manifold learning

    Get PDF
    The research on unsupervised feature selection is scarce in comparison to that for supervised models, despite the fact that this is an important issue for many clustering problems. An unsupervised feature selection method for general Finite Mixture Models was recently proposed and subsequently extended to Generative Topographic Mapping (GTM), a manifold learning constrained mixture model that provides data clustering and visualization. Some of the results of previous research on this unsupervised feature selection method for GTM suggested that its performance may be affected by insuficient sample size and by noisy data. In this thesis, we test in detail such limitations of the method and outline some techniques that could provide an at least partial solution to the negative effect of the presence of uninformative noise. In particular, we provide a detailed account of a variational Bayesian formulation of feature relevance determination for GTM

    Optimization of deepwater channel seismic reservoir characterization using seismic attributes and machine learning

    Get PDF
    Accurate subsurface reservoir mapping is essential for resource exploration. In uncalibrated basins, seismic data, often limited by resolution, frequency, quality, etc., algorithms become the primary information source due to the unavailability of well logs and core data. Seismic attributes, while integral for understanding subsurface structures, visually limit interpreters to working with only three of them at once. Conversely, machine learning, though capable of handling numerous attributes, is often seen as inscrutable "black boxes," complicating the interpretation of their predictions and uncertainties. To address these challenges, a comprehensive approach was undertaken, involving a detailed 3D model from Chilean Patagonia's Tres Pasos Formation with synthetic seismic data. The synthetic data served as a benchmark for conducting sensitivity analysis on seismic attributes, offering insights for parameter and workflow optimization. The study also evaluated the uncertainty in unsupervised and supervised machine learning for deepwater facies prediction through qualitative and quantitative assessments. Study key findings include: 1) High-frequency data and smaller analysis windows provide clearer channel images, while low-frequency data and larger windows create composite appearances, particularly in small stratigraphic features. 2) GTM and SOM exhibited similar performance, with error rates around 2% for predominant facies but significantly higher for individual channel-related facies. This suggests that unbalanced data results in higher errors for minor facies and that a reduction in clusters or a simplified model may better represent reservoir versus non-reservoir facies. 3) Resolution and data distribution significantly impact predictability, leading to non-uniqueness in cluster generation, which applies to supervised models as well. Strengthening the argument that understanding the limitations of seismic data is crucial. 4) Uncertainty in seismic facies prediction is influenced by factors such as training attribute selection, original facies proportions (e.g., imbalanced data, variable errors, and data quality). While optimized random forests achieved an 80% accuracy rate, validation accuracy was lower, emphasizing the need to address uncertainties and their role in interpretation. Overall, the utilization of ground truth seismic data derived from outcrops offers valuable insights into the strengths and challenges of machine learning in subsurface applications, where accurate predictions are critical for decision-making and safety in the energy sector

    On Martian Surface Exploration: Development of Automated 3D Reconstruction and Super-Resolution Restoration Techniques for Mars Orbital Images

    Get PDF
    Very high spatial resolution imaging and topographic (3D) data play an important role in modern Mars science research and engineering applications. This work describes a set of image processing and machine learning methods to produce the “best possible” high-resolution and high-quality 3D and imaging products from existing Mars orbital imaging datasets. The research work is described in nine chapters of which seven are based on separate published journal papers. These include a) a hybrid photogrammetric processing chain that combines the advantages of different stereo matching algorithms to compute stereo disparity with optimal completeness, fine-scale details, and minimised matching artefacts; b) image and 3D co-registration methods that correct a target image and/or 3D data to a reference image and/or 3D data to achieve robust cross-instrument multi-resolution 3D and image co-alignment; c) a deep learning network and processing chain to estimate pixel-scale surface topography from single-view imagery that outperforms traditional photogrammetric methods in terms of product quality and processing speed; d) a deep learning-based single-image super-resolution restoration (SRR) method to enhance the quality and effective resolution of Mars orbital imagery; e) a subpixel-scale 3D processing system using a combination of photogrammetric 3D reconstruction, SRR, and photoclinometric 3D refinement; and f) an optimised subpixel-scale 3D processing system using coupled deep learning based single-view SRR and deep learning based 3D estimation to derive the best possible (in terms of visual quality, effective resolution, and accuracy) 3D products out of present epoch Mars orbital images. The resultant 3D imaging products from the above listed new developments are qualitatively and quantitatively evaluated either in comparison with products from the official NASA planetary data system (PDS) and/or ESA planetary science archive (PSA) releases, and/or in comparison with products generated with different open-source systems. Examples of the scientific application of these novel 3D imaging products are discussed

    A machine learning approach based on generative topographic mapping for disruption prevention and avoidance at JET

    Get PDF
    The need for predictive capabilities greater than 95% with very limited false alarms are demanding requirements for reliable disruption prediction systems in tokamaks such as JET or, in the near future, ITER. The prediction of an upcoming disruption must be provided sufficiently in advance in order to apply effective disruption avoidance or mitigation actions to prevent the machine from being damaged. In this paper, following the typical machine learning workflow, a generative topographic mapping (GTM) of the operational space of JET has been built using a set of disrupted and regularly terminated discharges. In order to build the predictive model, a suitable set of dimensionless, machine-independent, physics-based features have been synthesized, which make use of 1D plasma profile information, rather than simple zero-D time series. The use of such predicting features, together with the power of the GTM in fitting the model to the data, obtains, in an unsupervised way, a 2D map of the multi-dimensional parameter space of JET, where it is possible to identify a boundary separating the region free from disruption from the disruption region. In addition to helping in operational boundaries studies, the GTM map can also be used for disruption prediction exploiting the potential of the developed GTM toolbox to monitor the discharge dynamics. Following the trajectory of a discharge on the map throughout the different regions, an alarm is triggered depending on the disruption risk of these regions. The proposed approach to predict disruptions has been evaluated on a training and an independent test set and achieves very good performance with only one tardive detection and a limited number of false detections. The warning times are suitable for avoidance purposes and, more important, the detections are consistent with physical causes and mechanisms that destabilize the plasma leading to disruptions.Peer reviewe

    A novel ensemble Beta-scale invariant map algorithm

    Get PDF
    [Abstract]: This research presents a novel topology preserving map (TPM) called Weighted Voting Supervision -Beta-Scale Invariant Map (WeVoS-Beta-SIM), based on the application of the Weighted Voting Supervision (WeVoS) meta-algorithm to a novel family of learning rules called Beta-Scale Invariant Map (Beta-SIM). The aim of the novel TPM presented is to improve the original models (SIM and Beta-SIM) in terms of stability and topology preservation and at the same time to preserve their original features, especially in the case of radial datasets, where they all are designed to perform their best. These scale invariant TPM have been proved with very satisfactory results in previous researches. This is done by generating accurate topology maps in an effectively and efficiently way. WeVoS meta-algorithm is based on the training of an ensemble of networks and the combination of them to obtain a single one that includes the best features of each one of the networks in the ensemble. WeVoS-Beta-SIM is thoroughly analyzed and successfully demonstrated in this study over 14 diverse real benchmark datasets with diverse number of samples and features, using three different well-known quality measures. In order to present a complete study of its capabilities, results are compared with other topology preserving models such as Self Organizing Maps, Scale Invariant Map, Maximum Likelihood Hebbian Learning-SIM, Visualization Induced SOM, Growing Neural Gas and Beta- Scale Invariant Map. The results obtained confirm that the novel algorithm improves the quality of the single Beta-SIM algorithm in terms of topology preservation and stability without losing performance (where this algorithm has proved to overcome other well-known algorithms). This improvement is more remarkable when complexity of the datasets increases, in terms of number of features and samples and especially in the case of radial datasets improving the Topographic Error

    Visualisation of bioinformatics datasets

    Get PDF
    Analysing the molecular polymorphism and interactions of DNA, RNA and proteins is of fundamental importance in biology. Predicting functions of polymorphic molecules is important in order to design more effective medicines. Analysing major histocompatibility complex (MHC) polymorphism is important for mate choice, epitope-based vaccine design and transplantation rejection etc. Most of the existing exploratory approaches cannot analyse these datasets because of the large number of molecules with a high number of descriptors per molecule. This thesis develops novel methods for data projection in order to explore high dimensional biological dataset by visualising them in a low-dimensional space. With increasing dimensionality, some existing data visualisation methods such as generative topographic mapping (GTM) become computationally intractable. We propose variants of these methods, where we use log-transformations at certain steps of expectation maximisation (EM) based parameter learning process, to make them tractable for high-dimensional datasets. We demonstrate these proposed variants both for synthetic and electrostatic potential dataset of MHC class-I. We also propose to extend a latent trait model (LTM), suitable for visualising high dimensional discrete data, to simultaneously estimate feature saliency as an integrated part of the parameter learning process of a visualisation model. This LTM variant not only gives better visualisation by modifying the project map based on feature relevance, but also helps users to assess the significance of each feature. Another problem which is not addressed much in the literature is the visualisation of mixed-type data. We propose to combine GTM and LTM in a principled way where appropriate noise models are used for each type of data in order to visualise mixed-type data in a single plot. We call this model a generalised GTM (GGTM). We also propose to extend GGTM model to estimate feature saliencies while training a visualisation model and this is called GGTM with feature saliency (GGTM-FS). We demonstrate effectiveness of these proposed models both for synthetic and real datasets. We evaluate visualisation quality using quality metrics such as distance distortion measure and rank based measures: trustworthiness, continuity, mean relative rank errors with respect to data space and latent space. In cases where the labels are known we also use quality metrics of KL divergence and nearest neighbour classifications error in order to determine the separation between classes. We demonstrate the efficacy of these proposed models both for synthetic and real biological datasets with a main focus on the MHC class-I dataset

    Towards music perception by redundancy reduction and unsupervised learning in probabilistic models

    Get PDF
    PhDThe study of music perception lies at the intersection of several disciplines: perceptual psychology and cognitive science, musicology, psychoacoustics, and acoustical signal processing amongst others. Developments in perceptual theory over the last fifty years have emphasised an approach based on Shannon’s information theory and its basis in probabilistic systems, and in particular, the idea that perceptual systems in animals develop through a process of unsupervised learning in response to natural sensory stimulation, whereby the emerging computational structures are well adapted to the statistical structure of natural scenes. In turn, these ideas are being applied to problems in music perception. This thesis is an investigation of the principle of redundancy reduction through unsupervised learning, as applied to representations of sound and music. In the first part, previous work is reviewed, drawing on literature from some of the fields mentioned above, and an argument presented in support of the idea that perception in general and music perception in particular can indeed be accommodated within a framework of unsupervised learning in probabilistic models. In the second part, two related methods are applied to two different low-level representations. Firstly, linear redundancy reduction (Independent Component Analysis) is applied to acoustic waveforms of speech and music. Secondly, the related method of sparse coding is applied to a spectral representation of polyphonic music, which proves to be enough both to recognise that the individual notes are the important structural elements, and to recover a rough transcription of the music. Finally, the concepts of distance and similarity are considered, drawing in ideas about noise, phase invariance, and topological maps. Some ecologically and information theoretically motivated distance measures are suggested, and put in to practice in a novel method, using multidimensional scaling (MDS), for visualising geometrically the dependency structure in a distributed representation.Engineering and Physical Science Research Counci

    A computational intelligence analysis of G proteincoupled receptor sequinces for pharmacoproteomic applications

    Get PDF
    Arguably, drug research has contributed more to the progress of medicine during the past decades than any other scientific factor. One of the main areas of drug research is related to the analysis of proteins. The world of pharmacology is becoming increasingly dependent on the advances in the fields of genomics and proteomics. This dependency brings about the challenge of finding robust methods to analyze the complex data they generate. Such challenge invites us to go one step further than traditional statistics and resort to approaches under the conceptual umbrella of artificial intelligence, including machine learning (ML), statistical pattern recognition and soft computing methods. Sound statistical principles are essential to trust the evidence base built through the use of such approaches. Statistical ML methods are thus at the core of the current thesis. More than 50% of drugs currently available target only four key protein families, from which almost a 30% correspond to the G Protein-Coupled Receptors (GPCR) superfamily. This superfamily regulates the function of most cells in living organisms and is at the centre of the investigations reported in the current thesis. No much is known about the 3D structure of these proteins. Fortunately, plenty of information regarding their amino acid sequences is readily available. The automatic grouping and classification of GPCRs into families and these into subtypes based on sequence analysis may significantly contribute to ascertain the pharmaceutically relevant properties of this protein superfamily. There is no biologically-relevant manner of representing the symbolic sequences describing proteins using real-valued vectors. This does not preclude the possibility of analyzing them using principled methods. These may come, amongst others, from the field of statisticalML. Particularly, kernel methods can be used to this purpose. Moreover, the visualization of high-dimensional protein sequence data can be a key exploratory tool for finding meaningful information that might be obscured by their intrinsic complexity. That is why the objective of the research described in this thesis is twofold: first, the design of adequate visualization-oriented artificial intelligence-based methods for the analysis of GPCR sequential data, and second, the application of the developed methods in relevant pharmacoproteomic problems such as GPCR subtyping and protein alignment-free analysis.Se podría decir que la investigación farmacológica ha desempeñado un papel predominante en el avance de la medicina a lo largo de las últimas décadas. Una de las áreas principales de investigación farmacológica es la relacionada con el estudio de proteínas. La farmacología depende cada vez más de los avances en genómica y proteómica, lo que conlleva el reto de diseñar métodos robustos para el análisis de los datos complejos que generan. Tal reto nos incita a ir más allá de la estadística tradicional para recurrir a enfoques dentro del campo de la inteligencia artificial, incluyendo el aprendizaje automático y el reconocimiento de patrones estadístico, entre otros. El uso de principios sólidos de teoría estadística es esencial para confiar en la base de evidencia obtenida mediante estos enfoques. Los métodos de aprendizaje automático estadístico son uno de los fundamentos de esta tesis. Más del 50% de los fármacos en uso hoy en día tienen como ¿diana¿ apenas cuatro familias clave de proteínas, de las que un 30% corresponden a la super-familia de los G-Protein Coupled Receptors (GPCR). Los GPCR regulan la funcionalidad de la mayoría de las células y son el objetivo central de la tesis. Se desconoce la estructura 3D de la mayoría de estas proteínas, pero, en cambio, hay mucha información disponible de sus secuencias de amino ácidos. El agrupamiento y clasificación automáticos de los GPCR en familias, y de éstas a su vez en subtipos, en base a sus secuencias, pueden contribuir de forma significativa a dilucidar aquellas de sus propiedades de interés farmacológico. No hay forma biológicamente relevante de representar las secuencias simbólicas de las proteínas mediante vectores reales. Esto no impide que se puedan analizar con métodos adecuados. Entre estos se cuentan las técnicas provenientes del aprendizaje automático estadístico y, en particular, los métodos kernel. Por otro lado, la visualización de secuencias de proteínas de alta dimensionalidad puede ser una herramienta clave para la exploración y análisis de las mismas. Es por ello que el objetivo central de la investigación descrita en esta tesis se puede desdoblar en dos grandes líneas: primero, el diseño de métodos centrados en la visualización y basados en la inteligencia artificial para el análisis de los datos secuenciales correspondientes a los GPCRs y, segundo, la aplicación de los métodos desarrollados a problemas de farmacoproteómica tales como la subtipificación de GPCRs y el análisis de proteinas no-alineadas

    A variational Bayesian formulation for GTM: Theoretical foundations

    Get PDF
    Generative Topographic Mapping (GTM) is a non-linear latent variable model of the manifold learning family that provides simultaneous visualization and clustering of high-dimensional data. It was originally formulated as a constrained mixture of Gaussian distributions, for which the adaptive parameters were determined by Maximum Likelihood (ML), using the Expectation-Maximization (EM) algorithm. In this paper, we define an alternative variational formulation of GTM that provides a full Bayesian treatment to a Gaussian Process (GP) - based variation of the model.Postprint (published version