87 research outputs found

    Un modÚle de mélange pour la classification croisée d'un tableau de données continue

    Get PDF
    National audienceContrairement aux mĂ©thodes de classification automatique habituelles, les mĂ©thodes de classification croisĂ©e traitent l'ensemble des lignes et l'ensemble des colonnes d'un tableau de donnĂ©es simultanĂ©ment en cherchant Ă  obtenir des blocs homogĂšnes. Dans cet article, nous abordons la classification croisĂ©e lorsque le tableau de donnĂ©es porte sur un ensemble d'individus dĂ©crits par des variables quantitatives et, pour tenir compte de cet objectif, nous proposons un modĂšle de mĂ©lange adaptĂ© Ă  la classification croisĂ©e conduisant Ă  des critĂšres originaux permettant de prendre en compte des situations plus complexes que les critĂšres habituellement utilisĂ©s dans ce contexte. Les paramĂštres sont alors estimĂ©s par un algorithme EM gĂ©nĂ©ralisĂ© (GEM) maximisant la vraisemblance des donnĂ©es observĂ©es. Nous proposons en outre une nouvelle expression du critĂšre bayĂ©sien de l'information, appelĂ©e BIC_B, adaptĂ©e Ă  notre situation pour Ă©valuer le nombre de blocs. Des expĂ©riences numĂ©riques portant sur des donnĂ©es synthĂ©tiques permettent d'Ă©valuer les performances de GEM et de BIC_B et de montrer l'intĂ©rĂȘt de cette approche

    Graph Cuts with Arbitrary Size Constraints Through Optimal Transport

    Full text link
    A common way of partitioning graphs is through minimum cuts. One drawback of classical minimum cut methods is that they tend to produce small groups, which is why more balanced variants such as normalized and ratio cuts have seen more success. However, we believe that with these variants, the balance constraints can be too restrictive for some applications like for clustering of imbalanced datasets, while not being restrictive enough for when searching for perfectly balanced partitions. Here, we propose a new graph cut algorithm for partitioning graphs under arbitrary size constraints. We formulate the graph cut problem as a regularized Gromov-Wasserstein problem. We then propose to solve it using accelerated proximal GD algorithm which has global convergence guarantees, results in sparse solutions and only incurs an additional ratio of O(log⁥(n))\mathcal{O}(\log(n)) compared to the classical spectral clustering algorithm but was seen to be more efficient

    More Discriminative Sentence Embeddings via Semantic Graph Smoothing

    Full text link
    This paper explores an empirical approach to learn more discriminantive sentence representations in an unsupervised fashion. Leveraging semantic graph smoothing, we enhance sentence embeddings obtained from pretrained models to improve results for the text clustering and classification tasks. Our method, validated on eight benchmarks, demonstrates consistent improvements, showcasing the potential of semantic graph smoothing in improving sentence embeddings for the supervised and unsupervised document categorization tasks.Comment: Accepted in EACL 202

    Exploring Topic Variants Through an Hybrid Biclustering Approach

    Get PDF
    In large text corpora, analytic journalists need to identify facts, verify them by locating corroborating documents and survey all related viewpoints. This requires them to make sense of document relationships at two levels of granularity: high-level topics and low-level topic variants. We propose a visual analytics software allowing analytic journalists to verify and refine hypotheses without having to read all documents. Our system relies on a hybrid biclustering approach. A new Topic Weighted Map visualization conveys all top-level topics reflecting their importance and their relative similarity. Then, coordinated multiple views allow to drill down into topic variants through an interactive term hierarchy visualization. Hence, the analyst can select, compare and filter out the subtle co-occurrences of terms shared by multiple documents to find interesting facts or stories. The usefulness of the tool is shown through a usage scenario and further assessed through a qualitative evaluation by an expert user.Dans des corpus textuels volumineux, les journalistes analytiques cherchent des documents et des rĂ©cits qui corroborent des faits, en les examinant sous tous les angles. Nous prĂ©sentons un outil de visualisation analytique leur permettant de vĂ©rifier, d’affiner et de gĂ©nĂ©rer des hypothĂšses sans avoir Ă  lire la totalitĂ© des contenus. Notre systĂšme repose sur une approche hybride de biclustering. Les sujets de haut niveau sont prĂ©sentĂ©s via une carte pondĂ©rĂ©e de sujets, reflĂ©tant Ă  la fois leur importance et leur similaritĂ© relative. Pour chaque sujet, une vue hiĂ©rarchique et interactive dresse un aperçu de toutes ses variantes, de maniĂšre Ă  identifier les documents traitĂ©s sous un mĂȘme angle ou partageant des faits communs. Des vues multiples et coordonnĂ©es permettent une analyse plus fine, en filtrant, sĂ©lectionnant et comparant les variantes de sujet, au regard des motifs de co-occurrence de termes les plus intĂ©ressants. L’utilitĂ© de l’outil est montrĂ©e par un scĂ©nario d’usage, puis Ă©valuĂ©e qualitativement par un journaliste analytique

    Scalable Multi-view Clustering via Explicit Kernel Features Maps

    Full text link
    A growing awareness of multi-view learning as an important component in data science and machine learning is a consequence of the increasing prevalence of multiple views in real-world applications, especially in the context of networks. In this paper we introduce a new scalability framework for multi-view subspace clustering. An efficient optimization strategy is proposed, leveraging kernel feature maps to reduce the computational burden while maintaining good clustering performance. The scalability of the algorithm means that it can be applied to large-scale datasets, including those with millions of data points, using a standard machine, in a few minutes. We conduct extensive experiments on real-world benchmark networks of various sizes in order to evaluate the performance of our algorithm against state-of-the-art multi-view subspace clustering methods and attributed-network multi-view approaches

    A survey on recent advances in named entity recognition

    Full text link
    Named Entity Recognition seeks to extract substrings within a text that name real-world objects and to determine their type (for example, whether they refer to persons or organizations). In this survey, we first present an overview of recent popular approaches, but we also look at graph- and transformer- based methods including Large Language Models (LLMs) that have not had much coverage in other surveys. Second, we focus on methods designed for datasets with scarce annotations. Third, we evaluate the performance of the main NER implementations on a variety of datasets with differing characteristics (as regards their domain, their size, and their number of classes). We thus provide a deep comparison of algorithms that are never considered together. Our experiments shed some light on how the characteristics of datasets affect the behavior of the methods that we compare.Comment: 30 page

    Generalized topographic block model

    No full text
    Co-clustering leads to parsimony in data visualisation with a number of parameters dramatically reduced in comparison to the dimensions of the data sample. Herein, we propose a new generalized approach for nonlinear mapping by a re-parameterization of the latent block mixture model. The densities modeling the blocks are in an exponential family such that the Gaussian, Bernoulli and Poisson laws are particular cases. The inference of the parameters is derived from the block expectation–maximization algorithm with a Newton–Raphson procedure at the maximization step. Empirical experiments with textual data validate the interest of our generalized model

    Block Mixture Model for the Biclustering of Microarray Data

    Get PDF
    This publication is a representation of what appears in the IEEE Digital Libraries.International audienceAn attractive way to make biclustering of genes and conditions is to adopt a Block Mixture Model (BMM). Approaches based on a BMM operate thanks to a Block Expectation Maximization (BEM) algorithm and/or a Block Classification Expectation Maximization (BCEM) one. The drawback of these approaches is their difficulty to choose a good strategy of initialization of the BEM and BCEM algorithms. This paper introduces existing biclustering approaches adopting a BMM and suggests a new fuzzy biclustering one. Our approach enables to choose a good strategy of initialization of the BEM and BCEM algorithms

    La politique de confidentialitĂ© d’un site marchand en tant que moyen pour renforcer la confiance des consommateurs en ligne

    Get PDF
    Qu’il soit considĂ©rĂ© comme mĂ©dia ou comme lieu d’achat, internet n’en finit pas de poser des problĂšmes de confiance aux consommateurs en ligne vis-Ă -vis des transactions commerciale. C’est pour cela, gagner la confiance des cyberconsommateurs devient essentiel pour les entreprises spĂ©cialisĂ©es dans le secteur du commerce Ă©lectronique. Afin d’essayer d’instaurer la confiance auprĂšs des consommateurs vis-Ă -vis d’internet et des sites marchands en particulier, de nombreux outils on Ă©tĂ© mis en place tels que le recours Ă  des labels de confiance, mais aussi Ă  des politiques de confidentialitĂ©s pour la protection des donnĂ©es personnelles et le respect de la vie privĂ©e. L’objectif de cet article est d’identifier, Ă  partir d’une Ă©tude qualitative, le moyen par lequel la politique de confidentialitĂ© d’un site marchand peut avoir un impact sur la confiance du consommateur
    • 

    corecore