439 research outputs found
Quand l'informatique observe les réseaux
National audienceEtat de l'art en deux parties : 1) Quelles disciplines sont impliquées dans l'observation et l'exploitation à grande échelle des traces de réseaux (réseaux sociaux, de citations, de liens entre pages Web, ...) ; 2) Ce qui se prépare dans les laboratoires de recherche
Slimming down a high-dimensional binary datatable: relevant eigen-subspace and substantial content
ISBN : 978-3-7908-2603-6International audienceDetermining the number of relevant dimensions in the eigen-space of a data matrix is a central issue in many data-mining applications. We tackle here the sub-problem of finding the ''right'' dimensionality of a type of data matrices often encountered in the domains of text or usage mining: large, sparse, high-dimensional binary datatables. We present here the application of a randomization test to this problem. We validate our approach first on artificial datasets, then on a real documentary data collection, i.e. 1900 documents described in a 3600 keywords dataspace, where the actual, intrinsic dimension appears to be 28 times less than the number of keywords - an important information when preparing to cluster or discriminate such data. We also present preliminary results on the problem of clearing the datatable from non-essential information bits
Visualiser les textes et les mots : approches numériques, approches par les graphes
Etats de l'art : - 1) passage d'une collection de textes à sa représentation vectorielle, - 2) techniques de visualisation d'une collection de vecteurs : par l'algèbre linéaire, par les graphes, et techniques hybrides
Relevant Eigen-Subspace of a Graph: A Randomization Test.
12 pagesInternational audienceDetermining the number of relevant dimensions in the eigen-space of a graph Laplacian matrix is a central issue in many spectral graph-mining applications. We tackle here the sub-problem of finding the "right" dimensionality of Laplacian matrices, especially those often encountered in the domains of social or biological graphs: the ones underlying large, sparse, unoriented and unweighted graphs with a power-law degree distribution. We present here the application of a randomization test to this problem. We validate our approach first on an artificial sparse and powerlaw type graph, with two intermingled clusters, then on two real-world social graphs ("Football-league", "Mexican Politician Network"), where the actual, intrinsic dimensions appear to be 11 and 2 respectively ; we illustrate the optimality of the transformed dataspaces both visually, and numerically by means of a decision tree
Jean-Baptiste Estoup and the origins of Zipf's law: a stenographer with a scientific mind (1868-1950)
International audienceStatistical distributions with a power law have been observed for over a century in many domains of social sciences, as well as in natural and life sciences. They are of utmost importance for those building models applicable to human activities (e.g. the "long tail" phenomena). We present here the life and accomplishments of J-B. Estoup, who was the first to notice this type of distribution in the language domain, and inspired the subsequent formulations by G.K. Zipf and B. Mandelbrot. This study, first presented at the seminar on the history of probabilities and statistics held at Ecole des Hautes Etudes en Sciences Sociales on December the 7th, 2007 in Paris, is also a family testimony, the author being the grandson of J-B. Estoup
Representing interaction in multiway contingency tables: MIDOVA, CA and log-linear model
International audienceBeside CA and log-linear model, issued from the statistics domain, other research streams originating in Artificial Intelligence have coped with the interacting variables problem: we will present here the extension to categorical variables of our results on extracting and statistically validating " itemsets " in boolean datatables. We coined MIDOVA (Multidimensional Interaction Differential of Variation) our method for highlighting and representing complex links between qualitative variables, which includes interaction, well-suited to socio-economic data. We will compare it to the CA and log-linear model approaches, using the same 3-way example as Escofier and her colleagues. We will show that out method is effective for general N-way interactions (N may be far greater than 3), whether symmetrically or not, and results both in easy and detailed interpretability, as CA does, and in statistical significance testing, as the log-linear model does in the case of few variables
Assessing livelihood and ecological benefits from restoration initiatives in the Philippines.
Prostitute Praising Represented by Male Novelists in Post-1998 Religious Society
Prostitute praising is represented by Remy Sylado in novel titled Ca-Bau-Kan: Hanya Sebuah Dosa (1999) and Arswendo Atmowiloto in novel titled Dewi Kawi (2008). Prostitute praising in the novels written by males in religious society in the midst of discourse about freedom of expression flowing in post-1998 era in Indonesia becomes problem of this research. Regarding the problem, this research aims to identify: (1) how prostitute praising is represented by males in their novel, (2) why male novelists produce such representations by applying Stuart Hall’s representation theory in relation to production of meaning through language and production of knowledge through discourse. The theory application reveals that male novelists represent prostitute praising in private and public domain which are mixed up and that there is relation between male and female in the domains siding with male as constructed by post-1998 discursive formation involving the state and religions to uphold masculine domination
A Proposition for Fixing the Dimensionality of a Laplacian Low-rank Approximation of any Binary Data-matrix
International audienceLaplacian low-rank approximations are much appreciated in the context of graph spectral methods and Correspondence Analysis. We address here the problem of determining the dimensionality K* of the relevant eigenspace of a general binary datatable by a statistically well-founded method. We propose 1) a general framework for graph adjacency matrices and any rectangular binary matrix, 2) a randomization test for fixing K*. We illustrate with both artificial and real data
Espaces intrinsèques des relations entre mots : une exploration multi-échelle.
International audienceDéterminer les liens de co-occurrence entre les mots d'un ensemble de textes nécessite le choix d'un empan, c'est à dire d'un découpage en individus statistiques de plus ou moins grande taille : depuis le simple N-gramme (empan glissant de N mots) jusqu'au texte complet, en passant par le virgulot, la phrase, le paragraphe, etc. Ces liens peuvent donner lieu à diverses catégorisations des mots, selon la "focale" utilisée. Notre étude porte sur un corpus d'articles de presse (3 mois de controverses sur les OGM et les perturbateurs endocriniens) auquel nous appliquons 1) notre procédure Morph d'étiquetage morpho-syntactique, de façon à désambiguer, étiqueter et lemmatiser au mieux la séquence des formes présentes, 2) notre test de validation des liens, par randomisations multiples de la matrice de présence des lemmes étiquetés dans les unités textuelles du niveau choisi, 3) notre procédure de détermination de la dimension intrinsèque de cette matrice, dont découle une estimation du nombre de clusters pertinents pour chaque niveau de granularité de l'analyse. Nos résultats montrent que les niveaux les plus grands détectent les "histoires" dont il est question dans le corpus, ceux de grain intermédiaire détectent en premier lieu les styles, puis les collocations, de degré de figement plus ou moins important. Cette approche 1) généralise celle de l'étiquetage non-supervisé de Schütze et al. (1995), basée sur les N-grammes de mots, 2) détermine l'espace de représentation optimal des mots et des unités de texte choisies, i.e. celui des K* premiers facteurs non-triviaux d'analyse factorielle des correspondances de la matrice (binaire, jusqu'ici), où K* est déterminé par un test de randomisation, adapté à n'importe quelle répartition des effectifs en lignes et en colonnes
- …