27 research outputs found

    Quasi-orthogonality and intrinsic dimensions as measures of learning and generalisation

    Full text link
    Finding best architectures of learning machines, such as deep neural networks, is a well-known technical and theoretical challenge. Recent work by Mellor et al (2021) showed that there may exist correlations between the accuracies of trained networks and the values of some easily computable measures defined on randomly initialised networks which may enable to search tens of thousands of neural architectures without training. Mellor et al used the Hamming distance evaluated over all ReLU neurons as such a measure. Motivated by these findings, in our work, we ask the question of the existence of other and perhaps more principled measures which could be used as determinants of success of a given neural architecture. In particular, we examine, if the dimensionality and quasi-orthogonality of neural networks' feature space could be correlated with the network's performance after training. We showed, using the setup as in Mellor et al, that dimensionality and quasi-orthogonality may jointly serve as network's performance discriminants. In addition to offering new opportunities to accelerate neural architecture search, our findings suggest important relationships between the networks' final performance and properties of their randomly initialised feature spaces: data dimension and quasi-orthogonality

    Scikit-Dimension: A Python Package for Intrinsic Dimension Estimation

    No full text
    Dealing with uncertainty in applications of machine learning to real-life data critically depends on the knowledge of intrinsic dimensionality (ID). A number of methods have been suggested for the purpose of estimating ID, but no standard package to easily apply them one by one or all at once has been implemented in Python. This technical note introduces scikit-dimension, an open-source Python package for intrinsic dimension estimation. The scikit-dimension package provides a uniform implementation of most of the known ID estimators based on the scikit-learn application programming interface to evaluate the global and local intrinsic dimension, as well as generators of synthetic toy and benchmark datasets widespread in the literature. The package is developed with tools assessing the code quality, coverage, unit testing and continuous integration. We briefly describe the package and demonstrate its use in a large-scale (more than 500 datasets) benchmarking of methods for ID estimation for real-life and synthetic data

    Modélisation de la géométrie des données omiques de grande dimension : dimensionnalité, sous-espaces, trajectoires

    No full text
    Modéliser la structure géométrique des données de grande dimension est un problème fondamental pour de nombreux domaines scientifiques qui produisent de plus en plus de données à grande échelle avec des milliers de variables. Bien que cette ère riche en données ouvre de nouvelles possibilités, elle pose également un défi pour l'apprentissage automatique et nécessite de nouvelles méthodologies et logiciels capables de transformer des données brutes avec un bruit, une parcimonie et une dimensionnalité élevés en résumés et visualisations simples.La biologie est un excellent exemple de ce défi; les avancées récentes des techniques de séquençage nous ont permis d'obtenir pour la première fois des données à grande échelle au niveau de la cellule. Les données actuellement disponibles peuvent décrire des millions de cellules, offrant une richesse d'informations sans précédent sur les profils des cellules individuelles (e.g., génétiques, épigénétiques, transcriptionnels) et leur organisation dans les tissus biologiques, tels que les tumeurs, les organes ou les embryons. En analysant les cellules individuelles d'un tissu, nous pouvons identifier les sources de variabilité de l'expression génétique dans les traits et maladies, cartographier différents types de cellules ou retracer le développement d'embryons en organismes matures. Les améliorations de l'analyse des données de cellule unique ouvrent la voie à une meilleure compréhension de la biologie et au développement de nouveaux traitements.Ce travail de thèse a contribué au développement de nouvelles méthodes et outils pour explorer la géométrie des données de grande dimension, en particulier l'analyse des données de cellule unique. Il présente des méthodes et des logiciels open source pour mieux caractériser la dimension intrinsèque des données et pour l'approximation de la géométrie des nuages ​​de points en termes de graphes principaux.Modelling the geometrical structure of high-dimensional datasets is a fundamental problem for many fields of science that increasingly produce large-scale data with thousands of variables. While this data-rich era offers vast potential, it also poses a challenge for machine learning and requires new methodologies and software able to turn raw data with high noise, sparsity and dimensionality into insightful summaries and visualizations.Biology is a prime example of this challenge; recent breakthroughs in sequencing technology have allowed us to obtain for the first time large scale data at the level of a single cell. Currently available datasets can describe millions of cells, offering an unprecedented wealth of information on individual cells' profiles (e.g., genetic, epigenetic, transcriptional) and their collective organization in biological tissues, such as tumors, organs or embryos. By analyzing individual cells in a tissue, we can identify sources of gene expression variability in human traits and diseases, create maps of different cell types or retrace the development of embryos into mature organisms. Improvements in single-cell data analysis pave the way to a better understanding of biology and to the development of new and more targeted treatments.This thesis work contributed to the effort of developing new methods and tools to explore the geometry of high-dimensional datasets, with a focus on single cell data analysis. It introduces methods and open-source software to better characterize the intrinsic dimension of datasets and to approximate the geometry of data point clouds using principal graphs

    Local intrinsic dimensionality estimators based on concentration of measure

    No full text
    International audienceIntrinsic dimensionality (ID) is one of the most fundamental characteristics of multi-dimensional data point clouds. Knowing ID is crucial to choose the appropriate machine learning approach as well as to understand its behavior and validate it. ID can be computed globally for the whole data distribution, or estimated locally in a point. In this paper, we introduce new local estimators of ID based on linear separability of multi-dimensional data point clouds, which is one of the manifestations of concentration of measure. We empirically study the properties of these measures and compare them with other recently introduced ID estimators exploiting various other effects of measure concentration. Observed differences in the behaviour of different estimators can be used to anticipate their behaviour in practical applications

    Lizard Brain: Tackling Locally Low-Dimensional Yet Globally Complex Organization of Multi-Dimensional Datasets

    No full text
    International audienceMachine learning deals with datasets characterized by high dimensionality. However, in many cases, the intrinsic dimensionality of the datasets is surprisingly low. For example, the dimensionality of a robot's perception space can be large and multi-modal but its variables can have more or less complex non-linear interdependencies. Thus multidimensional data point clouds can be effectively located in the vicinity of principal varieties possessing locally small dimensionality, but having a globally complicated organization which is sometimes difficult to represent with regular mathematical objects (such as manifolds). We review modern machine learning approaches for extracting low-dimensional geometries from multi-dimensional data and their applications in various scientific fields

    Minimum Spanning vs. Principal Trees for Structured Approximations of Multi-Dimensional Datasets

    No full text
    International audienceConstruction of graph-based approximations for multi-dimensional data point clouds is widely used in a variety of areas. Notable examples of applications of such approximators are cellular trajectory inference in single-cell data analysis, analysis of clinical trajectories from synchronic datasets, and skeletonization of images. Several methods have been proposed to construct such approximating graphs, with some based on computation of minimum spanning trees and some based on principal graphs generalizing principal curves. In this article we propose a methodology to compare and benchmark these two graph-based data approximation approaches, as well as to define their hyperparameters. The main idea is to avoid comparing graphs directly, but at first to induce clustering of the data point cloud from the graph approximation and, secondly, to use well-established methods to compare and score the data cloud partitioning induced by the graphs. In particular, mutual information-based approaches prove to be useful in this context. The induced clustering is based on decomposing a graph into non-branching segments, and then clustering the data point cloud by the nearest segment. Such a method allows efficient comparison of graph-based data approximations of arbitrary topology and complexity. The method is implemented in Python using the standard scikit-learn library which provides high speed and efficiency. As a demonstration of the methodology we analyse and compare graph-based data approximation methods using synthetic as well as real-life single cell datasets

    Domain Adaptation Principal Component Analysis: Base Linear Method for Learning with Out-of-Distribution Data

    No full text
    Domain adaptation is a popular paradigm in modern machine learning which aims at tackling the problem of divergence (or shift) between the labeled training and validation datasets (source domain) and a potentially large unlabeled dataset (target domain). The task is to embed both datasets into a common space in which the source dataset is informative for training while the divergence between source and target is minimized. The most popular domain adaptation solutions are based on training neural networks that combine classification and adversarial learning modules, frequently making them both data-hungry and difficult to train. We present a method called Domain Adaptation Principal Component Analysis (DAPCA) that identifies a linear reduced data representation useful for solving the domain adaptation task. DAPCA algorithm introduces positive and negative weights between pairs of data points, and generalizes the supervised extension of principal component analysis. DAPCA is an iterative algorithm that solves a simple quadratic optimization problem at each iteration. The convergence of the algorithm is guaranteed, and the number of iterations is small in practice. We validate the suggested algorithm on previously proposed benchmarks for solving the domain adaptation task. We also show the benefit of using DAPCA in analyzing single-cell omics datasets in biomedical applications. Overall, DAPCA can serve as a practical preprocessing step in many machine learning applications leading to reduced dataset representations, taking into account possible divergence between source and target domains

    Trajectories, bifurcations and pseudotime in large clinical datasets: applications to myocardial infarction and diabetes data

    No full text
    International audienceLarge observational clinical datasets become increasingly available for mining associations between various disease traits and administered therapy. These datasets can be considered as representations of the landscape of all possible disease conditions, in which a concrete pathology develops through a number of stereotypical routes, characterized by 'points of no return' and 'final states' (such as lethal or recovery states). Extracting this information directly from the data remains challenging, especially in the case of synchronic (with a short-term follow up) observations. Here we suggest a semi-supervised methodology for the analysis of large clinical datasets, characterized by mixed data types and missing values, through modeling the geometrical data structure as a bouquet of bifurcating clinical trajectories. The methodology is based on application of elastic principal graphs which can address simultaneously the tasks of dimensionality reduction, data visualization, clustering, feature selection and quantifying the geodesic distances (pseudotime) in partially ordered sequences of observations. The methodology allows positioning a patient on a particular clinical trajectory (pathological scenario) and characterizing the degree of progression along it with a qualitative estimate of the uncertainty of the prognosis. Overall, our pseudo-time quantification-based approach gives a possibility to apply the methods developed for dynamical disease phenotyping and illness trajectory analysis (diachronic data analysis) to synchronic observational data. We developed a tool ClinTrajan for clinical trajectory analysis implemented in Python programming language. We test the methodology in two large publicly available datasets: myocardial infarction complications and readmission of diabetic patients data
    corecore