439 research outputs found

    Quand l'informatique observe les réseaux

    Get PDF
    National audienceEtat de l'art en deux parties : 1) Quelles disciplines sont impliquées dans l'observation et l'exploitation à grande échelle des traces de réseaux (réseaux sociaux, de citations, de liens entre pages Web, ...) ; 2) Ce qui se prépare dans les laboratoires de recherche

    Slimming down a high-dimensional binary datatable: relevant eigen-subspace and substantial content

    Get PDF
    ISBN : 978-3-7908-2603-6International audienceDetermining the number of relevant dimensions in the eigen-space of a data matrix is a central issue in many data-mining applications. We tackle here the sub-problem of finding the ''right'' dimensionality of a type of data matrices often encountered in the domains of text or usage mining: large, sparse, high-dimensional binary datatables. We present here the application of a randomization test to this problem. We validate our approach first on artificial datasets, then on a real documentary data collection, i.e. 1900 documents described in a 3600 keywords dataspace, where the actual, intrinsic dimension appears to be 28 times less than the number of keywords - an important information when preparing to cluster or discriminate such data. We also present preliminary results on the problem of clearing the datatable from non-essential information bits

    Visualiser les textes et les mots : approches numériques, approches par les graphes

    Get PDF
    Etats de l'art : - 1) passage d'une collection de textes à sa représentation vectorielle, - 2) techniques de visualisation d'une collection de vecteurs : par l'algèbre linéaire, par les graphes, et techniques hybrides

    Relevant Eigen-Subspace of a Graph: A Randomization Test.

    Get PDF
    12 pagesInternational audienceDetermining the number of relevant dimensions in the eigen-space of a graph Laplacian matrix is a central issue in many spectral graph-mining applications. We tackle here the sub-problem of finding the "right" dimensionality of Laplacian matrices, especially those often encountered in the domains of social or biological graphs: the ones underlying large, sparse, unoriented and unweighted graphs with a power-law degree distribution. We present here the application of a randomization test to this problem. We validate our approach first on an artificial sparse and powerlaw type graph, with two intermingled clusters, then on two real-world social graphs ("Football-league", "Mexican Politician Network"), where the actual, intrinsic dimensions appear to be 11 and 2 respectively ; we illustrate the optimality of the transformed dataspaces both visually, and numerically by means of a decision tree

    Jean-Baptiste Estoup and the origins of Zipf's law: a stenographer with a scientific mind (1868-1950)

    No full text
    International audienceStatistical distributions with a power law have been observed for over a century in many domains of social sciences, as well as in natural and life sciences. They are of utmost importance for those building models applicable to human activities (e.g. the "long tail" phenomena). We present here the life and accomplishments of J-B. Estoup, who was the first to notice this type of distribution in the language domain, and inspired the subsequent formulations by G.K. Zipf and B. Mandelbrot. This study, first presented at the seminar on the history of probabilities and statistics held at Ecole des Hautes Etudes en Sciences Sociales on December the 7th, 2007 in Paris, is also a family testimony, the author being the grandson of J-B. Estoup

    Representing interaction in multiway contingency tables: MIDOVA, CA and log-linear model

    Get PDF
    International audienceBeside CA and log-linear model, issued from the statistics domain, other research streams originating in Artificial Intelligence have coped with the interacting variables problem: we will present here the extension to categorical variables of our results on extracting and statistically validating " itemsets " in boolean datatables. We coined MIDOVA (Multidimensional Interaction Differential of Variation) our method for highlighting and representing complex links between qualitative variables, which includes interaction, well-suited to socio-economic data. We will compare it to the CA and log-linear model approaches, using the same 3-way example as Escofier and her colleagues. We will show that out method is effective for general N-way interactions (N may be far greater than 3), whether symmetrically or not, and results both in easy and detailed interpretability, as CA does, and in statistical significance testing, as the log-linear model does in the case of few variables

    Assessing livelihood and ecological benefits from restoration initiatives in the Philippines.

    Get PDF

    Prostitute Praising Represented by Male Novelists in Post-1998 Religious Society

    Get PDF
    Prostitute praising is represented by Remy Sylado in novel titled Ca-Bau-Kan: Hanya Sebuah Dosa (1999)  and Arswendo Atmowiloto in novel titled Dewi Kawi (2008). Prostitute praising in the novels written by males in religious society in the midst of discourse about freedom of expression flowing in post-1998 era in Indonesia becomes problem of this research. Regarding the problem, this research aims to identify: (1) how prostitute praising is represented by males in their novel, (2) why male novelists produce such representations by applying Stuart Hall’s representation theory in relation to production of meaning through language and production of knowledge through discourse. The theory application reveals that male novelists represent prostitute praising in private and public domain which are mixed up and that there is relation between male and female in the domains siding with male as constructed by post-1998 discursive formation involving the state and religions to uphold masculine domination

    A Proposition for Fixing the Dimensionality of a Laplacian Low-rank Approximation of any Binary Data-matrix

    Get PDF
    International audienceLaplacian low-rank approximations are much appreciated in the context of graph spectral methods and Correspondence Analysis. We address here the problem of determining the dimensionality K* of the relevant eigenspace of a general binary datatable by a statistically well-founded method. We propose 1) a general framework for graph adjacency matrices and any rectangular binary matrix, 2) a randomization test for fixing K*. We illustrate with both artificial and real data

    Espaces intrinsèques des relations entre mots : une exploration multi-échelle.

    Get PDF
    International audienceDéterminer les liens de co-occurrence entre les mots d'un ensemble de textes nécessite le choix d'un empan, c'est à dire d'un découpage en individus statistiques de plus ou moins grande taille : depuis le simple N-gramme (empan glissant de N mots) jusqu'au texte complet, en passant par le virgulot, la phrase, le paragraphe, etc. Ces liens peuvent donner lieu à diverses catégorisations des mots, selon la "focale" utilisée. Notre étude porte sur un corpus d'articles de presse (3 mois de controverses sur les OGM et les perturbateurs endocriniens) auquel nous appliquons 1) notre procédure Morph d'étiquetage morpho-syntactique, de façon à désambiguer, étiqueter et lemmatiser au mieux la séquence des formes présentes, 2) notre test de validation des liens, par randomisations multiples de la matrice de présence des lemmes étiquetés dans les unités textuelles du niveau choisi, 3) notre procédure de détermination de la dimension intrinsèque de cette matrice, dont découle une estimation du nombre de clusters pertinents pour chaque niveau de granularité de l'analyse. Nos résultats montrent que les niveaux les plus grands détectent les "histoires" dont il est question dans le corpus, ceux de grain intermédiaire détectent en premier lieu les styles, puis les collocations, de degré de figement plus ou moins important. Cette approche 1) généralise celle de l'étiquetage non-supervisé de Schütze et al. (1995), basée sur les N-grammes de mots, 2) détermine l'espace de représentation optimal des mots et des unités de texte choisies, i.e. celui des K* premiers facteurs non-triviaux d'analyse factorielle des correspondances de la matrice (binaire, jusqu'ici), où K* est déterminé par un test de randomisation, adapté à n'importe quelle répartition des effectifs en lignes et en colonnes
    • …