149 research outputs found

    Table ronde : “pourquoi et comment enseigner l’analyse de données massives (big data)”

    Get PDF
    National audienceLes dernières années ont connu une grande effervescence autour du " big data " ou données massives. Celles-ci soulèvent de nouveaux enjeux scientifiques autour des problèmes de stockage des données (volume des données massives), de leur hétérogénéité (variété) et de leur traitement en temps réel (vélocité). Ces enjeux relèvent de l'informatique, mais aussi de la statistique. Face à ce défi, de nombreux établissements proposent aujourd'hui des modules, voire des formations entières dédiées au " big data " , la demande en spécialistes de ce nouveau domaine étant très forte. La table ronde proposée abordera les enjeux pédagogiques liés à ces nouvelles formations. Mots-clés. Enseignement de la statistique, données massives

    Statistical tests to compare motif count exceptionalities

    Get PDF
    BACKGROUND: Finding over- or under-represented motifs in biological sequences is now a common task in genomics. Thanks to p-value calculation for motif counts, exceptional motifs are identified and represent candidate functional motifs. The present work addresses the related question of comparing the exceptionality of one motif in two different sequences. Just comparing the motif count p-values in each sequence is indeed not sufficient to decide if this motif is significantly more exceptional in one sequence compared to the other one. A statistical test is required. RESULTS: We develop and analyze two statistical tests, an exact binomial one and an asymptotic likelihood ratio test, to decide whether the exceptionality of a given motif is equivalent or significantly different in two sequences of interest. For that purpose, motif occurrences are modeled by Poisson processes, with a special care for overlapping motifs. Both tests can take the sequence compositions into account. As an illustration, we compare the octamer exceptionalities in the Escherichia coli K-12 backbone versus variable strain-specific loops. CONCLUSION: The exact binomial test is particularly adapted for small counts. For large counts, we advise to use the likelihood ratio test which is asymptotic but strongly correlated with the exact binomial test and very simple to use

    From industry-wide parameters to aircraft-centric on-flight inference: improving aeronautics performance prediction with machine learning

    Get PDF
    Aircraft performance models play a key role in airline operations, especially in planning a fuel-efficient flight. In practice, manufacturers provide guidelines which are slightly modified throughout the aircraft life cycle via the tuning of a single factor, enabling better fuel predictions. However this has limitations, in particular they do not reflect the evolution of each feature impacting the aircraft performance. Our goal here is to overcome this limitation. The key contribution of the present article is to foster the use of machine learning to leverage the massive amounts of data continuously recorded during flights performed by an aircraft and provide models reflecting its actual and individual performance. We illustrate our approach by focusing on the estimation of the drag and lift coefficients from recorded flight data. As these coefficients are not directly recorded, we resort to aerodynamics approximations. As a safety check, we provide bounds to assess the accuracy of both the aerodynamics approximation and the statistical performance of our approach. We provide numerical results on a collection of machine learning algorithms. We report excellent accuracy on real-life data and exhibit empirical evidence to support our modelling, in coherence with aerodynamics principles.Comment: Published in Data-Centric Engineerin

    Sélection prédictive d'un modèle génératif par le critère AICp

    Get PDF
    International audienceL'obtention de bonnes performances en analyse discriminante est conditionnée par le choix du modèle. Le critère de choix de modèle le plus utilisé dans ce contexte est la validation croisée. Cependant, ce dernier nécessite un temps de calcul important et est sujet à variations. Dans cet article on introduit la notion de dimension prédictive d'un modèle génératif. Cette notion reflète la complexité du modèle génératif compte tenu de la tâche de prédiction. Elle nous permet de construire un critère de choix de modèle alternatif, le critère AICp. Ce critère se compose de la log vraisemblance évaluée en les étiquettes conditionnellement aux covariables, pénalisée par la dimension prédictive du modèle. Contrairement à la validation croisée, la critère AICp se calcule rapidement. De plus, des expériences sur des données réelles démontrent son intérêt

    Simultaneous dimension reduction and multi-objective clustering using probabilistic factorial discriminant analysis

    Get PDF
    International audienceIn model based clustering of quantitative data it is often supposed that only one clustering variable explains the heterogeneity of all the others variables. However, when variables come from different sources, it is often unrealistic to suppose that the heterogeneity of the data can only be explained by one variable. If such an assumption is made, this could lead to a high number of clusters which could be difficult to interpret. A model based multi-objective clustering is proposed, is assumes the existence of several latent clustering variables, each one explaining the heterogeneity of the data on some clustering projection. In order to estimate the parameters of the model an EM algorithm is proposed, it mainly relies on a reinterpretation of the standard factorial discriminant analysis in a probabilistic way. The obtained results are projections of the data on some principal clustering components allowing some synthetic interpretation of the principal clusters raised by the data. The behavior of the model is illustrated on simulated and real data

    An end-to-end data-driven optimization framework for constrained trajectories

    Get PDF
    Abstract Many real-world problems require to optimize trajectories under constraints. Classical approaches are often based on optimal control methods but require an exact knowledge of the underlying dynamics and constraints, which could be challenging or even out of reach. In view of this, we leverage data-driven approaches to design a new end-to-end framework which is dynamics-free for optimized and realistic trajectories. Trajectories are here decomposed on function basis, trading the initial infinite dimension problem on a multivariate functional space for a parameter optimization problem. Then a maximum a posteriori approach which incorporates information from data is used to obtain a new penalized optimization problem. The penalized term narrows the search on a region centered on data and includes estimated features of the problem. We apply our data-driven approach to two settings in aeronautics and sailing routes optimization. The developed approach is implemented in the Python library PyRotor
    corecore