284 research outputs found

    Mining a medieval social network by kernel SOM and related methods

    Get PDF
    This paper briefly presents several ways to understand the organization of a large social network (several hundreds of persons). We compare approaches coming from data mining for clustering the vertices of a graph (spectral clustering, self-organizing algorithms. . .) and provide methods for representing the graph from these analysis. All these methods are illustrated on a medieval social network and the way they can help to understand its organization is underlined

    Discrimination de courbes par régression inverse fonctionnelle

    Get PDF
    19 pagesNational audienceLes méthodes de régression inverse telles que la SIR (Li,1991) ont été développées dans le domaine de la régression multivariée pour éviter le célèbre fléau de la dimension. Elles ont été récemment étendues aux données fonctionnelles. Plusieurs approches ont été proposées et nous présentons ici un article de synthèse et de comparaison en abordant le cas où la variable réponse est un vecteur d'indicatrice d'appartenance à des classes. Nous montrons qu'alors la régression inverse conduit à une méthode de discrimination dont la pertinence est établie sur des données réelles et simulées

    Clustering a medieval social network by SOM using a kernel based distance measure

    Get PDF
    6 pagesInternational audienceIn order to explore the social organization of a medieval peasant community before the Hundred Years' War, we propose the use of an adaptation of the well-known Kohonen Self Organizing Map to dissimilarity data. In this paper, the algorithm is used with a distance based on a kernel which allows the choice of a smoothing parameter to control the importance of local or global proximities

    Storms prediction : Logistic regression vs random forest for unbalanced data

    Get PDF
    The aim of this study is to compare two supervised classification methods on a crucial meteorological problem. The data consist of satellite measurements of cloud systems which are to be classified either in convective or non convective systems. Convective cloud systems correspond to lightning and detecting such systems is of main importance for thunderstorm monitoring and warning. Because the problem is highly unbalanced, we consider specific performance criteria and different strategies. This case study can be used in an advanced course of data mining in order to illustrate the use of logistic regression and random forest on a real data set with unbalanced classes

    Un r\'esultat de consistance pour des SVM fonctionnels par interpolation spline

    Get PDF
    This Note proposes a new methodology for function classification with Support Vector Machine (SVM). Rather than relying on projection on a truncated Hilbert basis as in our previous work, we use an implicit spline interpolation that allows us to compute SVM on the derivatives of the studied functions. To that end, we propose a kernel defined directly on the discretizations of the observed functions. We show that this method is universally consistent.Comment: 6 page

    Neural Networks for Complex Data

    Full text link
    Artificial neural networks are simple and efficient machine learning tools. Defined originally in the traditional setting of simple vector data, neural network models have evolved to address more and more difficulties of complex real world problems, ranging from time evolving data to sophisticated data structures such as graphs and functions. This paper summarizes advances on those themes from the last decade, with a focus on results obtained by members of the SAMM team of Universit\'e Paris

    Random Forests for Big Data

    Get PDF
    Big Data is one of the major challenges of statistical science and has numerous consequences from algorithmic and theoretical viewpoints. Big Data always involve massive data but they also often include online data and data heterogeneity. Recently some statistical methods have been adapted to process Big Data, like linear regression models, clustering methods and bootstrapping schemes. Based on decision trees combined with aggregation and bootstrap ideas, random forests were introduced by Breiman in 2001. They are a powerful nonparametric statistical method allowing to consider in a single and versatile framework regression problems, as well as two-class and multi-class classification problems. Focusing on classification problems, this paper proposes a selective review of available proposals that deal with scaling random forests to Big Data problems. These proposals rely on parallel environments or on online adaptations of random forests. We also describe how related quantities -- such as out-of-bag error and variable importance -- are addressed in these methods. Then, we formulate various remarks for random forests in the Big Data context. Finally, we experiment five variants on two massive datasets (15 and 120 millions of observations), a simulated one as well as real world data. One variant relies on subsampling while three others are related to parallel implementations of random forests and involve either various adaptations of bootstrap to Big Data or to "divide-and-conquer" approaches. The fifth variant relates on online learning of random forests. These numerical experiments lead to highlight the relative performance of the different variants, as well as some of their limitations

    Analysis of the influence of a network on the values of its nodes : the use of spatial indexes.

    Get PDF
    National audienceA growing number of data are modeled by a graph that can sometimes be weighted: social network, biological network... In many situations, additional informations are provided with these relational data, related to each node of the graph: this can be a membership to a given social group (for social networks) or to a given proteins family (for protein interactions network). In this case, a important question is to understand if the value of this additional variable is influenced by the network. This paper presents exploratory tools to address this question that are based on tests coming from the field of spatial statistic. The use of these tests is illustrated on several examples, all coming from the social network framework

    Analyse de données pour des graphes étiquetés

    No full text
    International audienceNous proposons une méthode de fouille de données pour un graphe dont les sommets sont étiquetés. Deux approches sont décrites et illustrées sur un jeu de données réelles : elles permettent une représentation du graphe qui combine les informations sur sa structure et sur la valeur de ses étiquettes. Cette visualisation peut être utilisée à des fins d'interprétation pour apporter des informations plus nuancées sur la caractérisation des sommets du graphe

    sexy-rgtk: a package for programming RGtk2 GUI in a user-friendly manner

    No full text
    National audienceThere are many di erent ways to program Graphical User Interfaces (GUI) in R. (Lawrence and Verzani, 2012) provides an overview of the available methods, describing ways to program R GUI with RGtk2, qtbase and tcltk. More recently, the package shiny, for building interactive web applications, was also released (the rst version has been published on December, 2012). By automatically indexing all objects and methods available in RGtk2, we developed a method for creating GTK2-based GUI, in a friendlier and more compact manner. Widgets are accessible with simple functions and options, as is more natural for a R language programmer
    • …
    corecore