2,917 research outputs found

    Soft topographic map for clustering and classification of bacteria

    Get PDF
    In this work a new method for clustering and building a topographic representation of a bacteria taxonomy is presented. The method is based on the analysis of stable parts of the genome, the so-called “housekeeping genes”. The proposed method generates topographic maps of the bacteria taxonomy, where relations among different type strains can be visually inspected and verified. Two well known DNA alignement algorithms are applied to the genomic sequences. Topographic maps are optimized to represent the similarity among the sequences according to their evolutionary distances. The experimental analysis is carried out on 147 type strains of the Gammaprotebacteria class by means of the 16S rRNA housekeeping gene. Complete sequences of the gene have been retrieved from the NCBI public database. In the experimental tests the maps show clusters of homologous type strains and present some singular cases potentially due to incorrect classification or erroneous annotations in the database

    Neural Networks for Complex Data

    Full text link
    Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris

    Improved data visualisation through multiple dissimilarity modelling

    Get PDF
    Popular dimension reduction and visualisation algorithms rely on the assumption that input dissimilarities are typically Euclidean, for instance Metric Multidimensional Scaling, t-distributed Stochastic Neighbour Embedding and the Gaussian Process Latent Variable Model. It is well known that this assumption does not hold for most datasets and often high-dimensional data sits upon a manifold of unknown global geometry. We present a method for improving the manifold charting process, coupled with Elastic MDS, such that we no longer assume that the manifold is Euclidean, or of any particular structure. We draw on the benefits of different dissimilarity measures allowing for the relative responsibilities, under a linear combination, to drive the visualisation process

    Patterns and drivers of plant diversity across Australia

    Get PDF
    Biodiversity analyses across continental extents are important in providing comprehensive information on patterns and likely drivers of diversity. For vascular plants in Australia, community-level diversity analyses have been restricted by the lack of a consistent plot-based survey dataset across the continent. To overcome these challenges, we collated and harmonised plot-based vegetation survey data from the major data sources across Australia and used them as the basis for modelling species richness (α-diversity) and community compositional dissimilarity (ÎČ-diversity), standardised to 400 m2, with the aim of mapping diversity patterns and identifying potential environmental drivers. The harmonised Australian vegetation plot (HAVPlot) dataset includes 219 552 plots, of which we used 115 083 to analyse plant diversity. Models of species richness and compositional dissimilarity both explained approximately one-third of the variation in plant diversity across Australia (D2 = 33.0% and 32.7%, respectively). The strongest environmental predictors for both aspects of diversity were a combination of temperature and precipitation, with soil texture and topographic heterogeneity also important. The fine-resolution (≈ 90 m) spatial predictions of species richness and compositional dissimilarity identify areas expected to be of particular importance for plant diversity, including south-western Australia, rainforests of eastern Australia and the Australian Alps. Arid areas of central and western Australia are predicted to support assemblages that are less speciose or unique; however, these areas are most in need of additional survey data to fill the spatial, environmental and taxonomic gaps in the HAVPlot dataset. The harmonised data and model predictions presented here provide new insight into plant diversity patterns across Australia, enabling a wide variety of future research, such as exploring changes in species abundances, linking compositional patterns to functional traits or undertaking conservation assessments for selected components of the flora

    Probabilistic topographic information visualisation

    Get PDF
    The focus of this thesis is the extension of topographic visualisation mappings to allow for the incorporation of uncertainty. Few visualisation algorithms in the literature are capable of mapping uncertain data with fewer able to represent observation uncertainties in visualisations. As such, modifications are made to NeuroScale, Locally Linear Embedding, Isomap and Laplacian Eigenmaps to incorporate uncertainty in the observation and visualisation spaces. The proposed mappings are then called Normally-distributed NeuroScale (N-NS), T-distributed NeuroScale (T-NS), Probabilistic LLE (PLLE), Probabilistic Isomap (PIso) and Probabilistic Weighted Neighbourhood Mapping (PWNM). These algorithms generate a probabilistic visualisation space with each latent visualised point transformed to a multivariate Gaussian or T-distribution, using a feed-forward RBF network. Two types of uncertainty are then characterised dependent on the data and mapping procedure. Data dependent uncertainty is the inherent observation uncertainty. Whereas, mapping uncertainty is defined by the Fisher Information of a visualised distribution. This indicates how well the data has been interpolated, offering a level of ‘surprise’ for each observation. These new probabilistic mappings are tested on three datasets of vectorial observations and three datasets of real world time series observations for anomaly detection. In order to visualise the time series data, a method for analysing observed signals and noise distributions, Residual Modelling, is introduced. The performance of the new algorithms on the tested datasets is compared qualitatively with the latent space generated by the Gaussian Process Latent Variable Model (GPLVM). A quantitative comparison using existing evaluation measures from the literature allows performance of each mapping function to be compared. Finally, the mapping uncertainty measure is combined with NeuroScale to build a deep learning classifier, the Cascading RBF. This new structure is tested on the MNist dataset achieving world record performance whilst avoiding the flaws seen in other Deep Learning Machines

    Feed-forward neural networks and topographic mappings for exploratory data analysis

    Get PDF
    A recent novel approach to the visualisation and analysis of datasets, and one which is particularly applicable to those of a high dimension, is discussed in the context of real applications. A feed-forward neural network is utilised to effect a topographic, structure-preserving, dimension-reducing transformation of the data, with an additional facility to incorporate different degrees of associated subjective information. The properties of this transformation are illustrated on synthetic and real datasets, including the 1992 UK Research Assessment Exercise for funding in higher education. The method is compared and contrasted to established techniques for feature extraction, and related to topographic mappings, the Sammon projection and the statistical field of multidimensional scaling

    Improved data visualisation through nonlinear dissimilarity modelling

    Get PDF
    Inherent to state-of-the-art dimension reduction algorithms is the assumption that global distances between observations are Euclidean, despite the potential for altogether non-Euclidean data manifolds. We demonstrate that a non-Euclidean manifold chart can be approximated by implementing a universal approximator over a dictionary of dissimilarity measures, building on recent developments in the field. This approach is transferable across domains such that observations can be vectors, distributions, graphs and time series for instance. Our novel dissimilarity learning method is illustrated with four standard visualisation datasets showing the benefits over the linear dissimilarity learning approach
    • 

    corecore