79 research outputs found

    Apprentissage sur Données Massives; trois cas d'usage avec R, Python et Spark.

    Get PDF
    International audienceManagement and analysis of big data are systematically associated with a data distributed architecture in the Hadoop and now Spark frameworks. This article offers an introduction for statisticians to these technologies by comparing the performance obtained by the direct use of three reference environments: R, Python Scikit-learn, Spark MLlib on three public use cases: character recognition, recommending films, categorizing products. As main result, it appears that, if Spark is very efficient for data munging and recommendation by collaborative filtering (non-negative factorization), current implementations of conventional learning methods (logistic regression, random forests) in MLlib or SparkML do not ou poorly compete habitual use of these methods (R, Python Scikit-learn) in an integrated or undistributed architectureLa gestion et l'analyse de données massives sont systématiquement associées à une architecture de données distribuées dans des environnements Hadoop et maintenant Spark. Cet article propose aux statisticiens une introduction à ces technologies en comparant les performances obtenues par l'utilisation élémentaire de trois environnements de référence : R, Python Scikit-learn, Spark MLlib sur trois cas d'usage publics : reconnaissance de caractères, recommandation de films, catégorisation de produits. Comme principal résultat, il en ressort que si Spark est très performant pour la préparation des données et la recommandation par filtrage collaboratif (factorisation non négative), les implémentations actuelles des méthodes classiques d'apprentissage (régression logistique, forêts aléatoires) dans MLlib ou SparkML ne concurrencent pas ou mal une utilisation habituelle de ces méthodes (R, Python Scikit-learn) dans une architecture intégrée au sens de non distribuée

    Destination Prediction by Trajectory Distribution Based Model

    Get PDF
    International audienceIn this paper we propose a new method to predict the final destination of vehicle trips based on their initial partial trajectories. We first review how we obtained clustering of trajectories that describes user behaviour. Then, we explain how we model main traffic flow patterns by a mixture of 2d Gaussian distributions. This yielded a density based clustering of locations, which produces a data driven grid of similar points within each pattern. We present how this model can be used to predict the final destination of a new trajectory based on their first locations using a two step procedure: We first assign the new trajectory to the clusters it mot likely belongs. Secondly, we use characteristics from trajectories inside these clusters to predict the final destination. Finally, we present experimental results of our methods for classification of trajectories and final destination prediction on datasets of timestamped GPS-Location of taxi trips. We test our methods on two different datasets, to assess the capacity of our method to adapt automatically to different subsets

    Review & Perspective for Distance Based Trajectory Clustering

    Get PDF
    In this paper we tackle the issue of clustering trajectories of geolocalized observations. Using clustering technics based on the choice of a distance between the observations, we first provide a comprehensive review of the different distances used in the literature to compare trajectories. Then based on the limitations of these methods, we introduce a new distance : Symmetrized Segment-Path Distance (SSPD). We finally compare this new distance to the others according to their corresponding clustering results obtained using both hierarchical clustering and affinity propagation methods

    Review & Perspective for Distance Based Clustering of Vehicle Trajectories

    Get PDF
    International audience—In this paper we tackle the issue of clustering trajectories of geolocalized observations based on distance between trajectories. We first provide a comprehensive review of the different distances used in the literature to compare trajectories. Then based on the limitations of these methods, we introduce a new distance: Symmetrized Segment-Path Distance (SSPD). We compare this new distance to the others according to their corresponding clustering results obtained using both the hierarchical clustering and affinity propagation methods. We finally present a python package : trajectory distance, which contains the methods for calculating the SSPD distance and the other distances reviewed in this paper

    Genetic analysis of milking ability in Lacaune dairy ewes

    Get PDF
    The milking ability of Lacaune ewes was characterised by derived traits of milk flow patterns, in an INRA experimental farm, from a divergent selection experiment in order to estimate the correlated effects of selection for protein and fat yields. The analysis of selected divergent line effects (involving 34 616 data and 1204 ewes) indicated an indirect improvement of milking traits (+17% for maximum milk flow and -10% for latency time) with a 25% increase in milk yield. Genetic parameters were estimated by multi-trait analysis with an animal model, on 751 primiparous ewes. The heritabilities of the traits expressed on an annual basis were high, especially for maximum flow (0.54) and for latency time (0.55). The heritabilities were intermediate for average flow (0.30), time at maximum flow (0.42) and phase of increasing flow (0.43), and low for the phase of decreasing flow (0.16) and the plateau of high flow (0.07). When considering test-day data, the heritabilities of maximum flow and latency time remained intermediate and stable throughout the lactation. Genetic correlations between milk yield and milking traits were all favourable, but latency time was less milk yield dependent (-0.22) than maximum flow (+0.46). It is concluded that the current dairy ewe selection based on milk solid yield is not antagonistic to milking ability

    Faire pâturer les chèvres : retour vers le futur

    No full text
    National audienc

    Wikistat 2.0: Ressources pédagogiques pour l'Intelligence Artificielle

    No full text
    Big data, data science, deep learning, artificial intelligence are the key words of intense hype related with a job market in full evolution, that impose to adapt the contents of our university professional trainings. Which artificial intelligence is mostly concerned by the job offers? Which methodologies and technologies should be favored in the training programs? Which objectives, tools and educational resources do we needed to put in place to meet these pressing needs? We answer these questions in describing the contents and operational resources in the Data Science orientation of the specialty Applied Mathematics at INSA Toulouse. We focus on basic mathematics training (Optimization, Probability, Statistics), associated with the practical implementation of the most performing statistical learning algorithms, with the most appropriate technologies and on real examples. Considering the huge volatility of the technologies, it is imperative to train students in seft-training, this will be their technological watch tool when they will be in professional activity. This explains the structuring of the educational site github.com/wikistat into a set of tutorials. Finally, to motivate the thorough practice of these tutorials, a serious game is organized each year in the form of a prediction contest between students of Master degrees in Applied Mathematics for IA.Big data, science des données, deep learning, intelligence artificielle, sont les mots clefs de battages médiatiques intenses en lien avec un marché de l'emploi en pleine évolution qui impose d'adapter les contenus de nos formations professionnelles universitaires. Quelle intelligence artificielle est principalement concernée par les offres d'emplois? Quelles sont les méthodologies et technologies qu'il faut privilégier dans la formation? Quels objectifs, outils et ressources pédagogiques est-il nécessaire de mettre en place pour répondre à ces besoins pressants? Nous répondons à ces questions en décrivant les contenus et ressources opérationnels dans la spécialité Mathématiques appliquées, majeure Science des Données, de l'INSA de Toulouse. L'accent est mis sur une formation en Mathématiques (Optimisation, Probabilités, Statistique) fondamentale ou de base associée à la mise en oeuvre pratique des algorithmes d'apprentissage statistique les plus performants, avec les technologies les plus adaptées et sur des exemples réels. Compte tenu de la très grande volatilité des technologies, il est impératif de former les étudiants à l'autoformation qui sera leur outil de veille technologique une fois en poste; c'est la raison de la structuration du site pédagogique github.com/wikistat en un ensemble de tutoriels. Enfin, pour motiver la pratique approfondie de ces tutoriels, un jeu sérieux est organisée chaque année sous la forme d'un concours de prévision entre étudiants de masters de Mathématique Appliquées pour l'IA
    corecore