87 research outputs found
Un modÚle de mélange pour la classification croisée d'un tableau de données continue
National audienceContrairement aux mĂ©thodes de classification automatique habituelles, les mĂ©thodes de classification croisĂ©e traitent l'ensemble des lignes et l'ensemble des colonnes d'un tableau de donnĂ©es simultanĂ©ment en cherchant Ă obtenir des blocs homogĂšnes. Dans cet article, nous abordons la classification croisĂ©e lorsque le tableau de donnĂ©es porte sur un ensemble d'individus dĂ©crits par des variables quantitatives et, pour tenir compte de cet objectif, nous proposons un modĂšle de mĂ©lange adaptĂ© Ă la classification croisĂ©e conduisant Ă des critĂšres originaux permettant de prendre en compte des situations plus complexes que les critĂšres habituellement utilisĂ©s dans ce contexte. Les paramĂštres sont alors estimĂ©s par un algorithme EM gĂ©nĂ©ralisĂ© (GEM) maximisant la vraisemblance des donnĂ©es observĂ©es. Nous proposons en outre une nouvelle expression du critĂšre bayĂ©sien de l'information, appelĂ©e BIC_B, adaptĂ©e Ă notre situation pour Ă©valuer le nombre de blocs. Des expĂ©riences numĂ©riques portant sur des donnĂ©es synthĂ©tiques permettent d'Ă©valuer les performances de GEM et de BIC_B et de montrer l'intĂ©rĂȘt de cette approche
Graph Cuts with Arbitrary Size Constraints Through Optimal Transport
A common way of partitioning graphs is through minimum cuts. One drawback of
classical minimum cut methods is that they tend to produce small groups, which
is why more balanced variants such as normalized and ratio cuts have seen more
success. However, we believe that with these variants, the balance constraints
can be too restrictive for some applications like for clustering of imbalanced
datasets, while not being restrictive enough for when searching for perfectly
balanced partitions. Here, we propose a new graph cut algorithm for
partitioning graphs under arbitrary size constraints. We formulate the graph
cut problem as a regularized Gromov-Wasserstein problem. We then propose to
solve it using accelerated proximal GD algorithm which has global convergence
guarantees, results in sparse solutions and only incurs an additional ratio of
compared to the classical spectral clustering algorithm
but was seen to be more efficient
More Discriminative Sentence Embeddings via Semantic Graph Smoothing
This paper explores an empirical approach to learn more discriminantive
sentence representations in an unsupervised fashion. Leveraging semantic graph
smoothing, we enhance sentence embeddings obtained from pretrained models to
improve results for the text clustering and classification tasks. Our method,
validated on eight benchmarks, demonstrates consistent improvements, showcasing
the potential of semantic graph smoothing in improving sentence embeddings for
the supervised and unsupervised document categorization tasks.Comment: Accepted in EACL 202
Exploring Topic Variants Through an Hybrid Biclustering Approach
In large text corpora, analytic journalists need to identify facts, verify them by locating corroborating documents and survey all related viewpoints. This requires them to make sense of document relationships at two levels of granularity: high-level topics and low-level topic variants. We propose a visual analytics software allowing analytic journalists to verify and refine hypotheses without having to read all documents. Our system relies on a hybrid biclustering approach. A new Topic Weighted Map visualization conveys all top-level topics reflecting their importance and their relative similarity. Then, coordinated multiple views allow to drill down into topic variants through an interactive term hierarchy visualization. Hence, the analyst can select, compare and filter out the subtle co-occurrences of terms shared by multiple documents to find interesting facts or stories. The usefulness of the tool is shown through a usage scenario and further assessed through a qualitative evaluation by an expert user.Dans des corpus textuels volumineux, les journalistes analytiques cherchent des documents et des rĂ©cits qui corroborent des faits, en les examinant sous tous les angles. Nous prĂ©sentons un outil de visualisation analytique leur permettant de vĂ©rifier, dâaffiner et de gĂ©nĂ©rer des hypothĂšses sans avoir Ă lire la totalitĂ© des contenus. Notre systĂšme repose sur une approche hybride de biclustering. Les sujets de haut niveau sont prĂ©sentĂ©s via une carte pondĂ©rĂ©e de sujets, reflĂ©tant Ă la fois leur importance et leur similaritĂ© relative. Pour chaque sujet, une vue hiĂ©rarchique et interactive dresse un aperçu de toutes ses variantes, de maniĂšre Ă identifier les documents traitĂ©s sous un mĂȘme angle ou partageant des faits communs. Des vues multiples et coordonnĂ©es permettent une analyse plus fine, en filtrant, sĂ©lectionnant et comparant les variantes de sujet, au regard des motifs de co-occurrence de termes les plus intĂ©ressants. LâutilitĂ© de lâoutil est montrĂ©e par un scĂ©nario dâusage, puis Ă©valuĂ©e qualitativement par un journaliste analytique
Scalable Multi-view Clustering via Explicit Kernel Features Maps
A growing awareness of multi-view learning as an important component in data
science and machine learning is a consequence of the increasing prevalence of
multiple views in real-world applications, especially in the context of
networks. In this paper we introduce a new scalability framework for multi-view
subspace clustering. An efficient optimization strategy is proposed, leveraging
kernel feature maps to reduce the computational burden while maintaining good
clustering performance. The scalability of the algorithm means that it can be
applied to large-scale datasets, including those with millions of data points,
using a standard machine, in a few minutes. We conduct extensive experiments on
real-world benchmark networks of various sizes in order to evaluate the
performance of our algorithm against state-of-the-art multi-view subspace
clustering methods and attributed-network multi-view approaches
A survey on recent advances in named entity recognition
Named Entity Recognition seeks to extract substrings within a text that name
real-world objects and to determine their type (for example, whether they refer
to persons or organizations). In this survey, we first present an overview of
recent popular approaches, but we also look at graph- and transformer- based
methods including Large Language Models (LLMs) that have not had much coverage
in other surveys. Second, we focus on methods designed for datasets with scarce
annotations. Third, we evaluate the performance of the main NER implementations
on a variety of datasets with differing characteristics (as regards their
domain, their size, and their number of classes). We thus provide a deep
comparison of algorithms that are never considered together. Our experiments
shed some light on how the characteristics of datasets affect the behavior of
the methods that we compare.Comment: 30 page
Generalized topographic block model
Co-clustering leads to parsimony in data visualisation with a number of parameters dramatically reduced in comparison to the dimensions of the data sample. Herein, we propose a new generalized approach for nonlinear mapping by a re-parameterization of the latent block mixture model. The densities modeling the blocks are in an exponential family such that the Gaussian, Bernoulli and Poisson laws are particular cases. The inference of the parameters is derived from the block expectationâmaximization algorithm with a NewtonâRaphson procedure at the maximization step. Empirical experiments with textual data validate the interest of our generalized model
Block Mixture Model for the Biclustering of Microarray Data
This publication is a representation of what appears in the IEEE Digital Libraries.International audienceAn attractive way to make biclustering of genes and conditions is to adopt a Block Mixture Model (BMM). Approaches based on a BMM operate thanks to a Block Expectation Maximization (BEM) algorithm and/or a Block Classification Expectation Maximization (BCEM) one. The drawback of these approaches is their difficulty to choose a good strategy of initialization of the BEM and BCEM algorithms. This paper introduces existing biclustering approaches adopting a BMM and suggests a new fuzzy biclustering one. Our approach enables to choose a good strategy of initialization of the BEM and BCEM algorithms
La politique de confidentialitĂ© dâun site marchand en tant que moyen pour renforcer la confiance des consommateurs en ligne
Quâil soit considĂ©rĂ© comme mĂ©dia ou comme lieu dâachat, internet nâen finit pas de poser des problĂšmes de confiance aux consommateurs en ligne vis-Ă -vis des transactions commerciale. Câest pour cela, gagner la confiance des cyberconsommateurs devient essentiel pour les entreprises spĂ©cialisĂ©es dans le secteur du commerce Ă©lectronique. Afin dâessayer dâinstaurer la confiance auprĂšs des consommateurs vis-Ă -vis dâinternet et des sites marchands en particulier, de nombreux outils on Ă©tĂ© mis en place tels que le recours Ă des labels de confiance, mais aussi Ă des politiques de confidentialitĂ©s pour la protection des donnĂ©es personnelles et le respect de la vie privĂ©e. Lâobjectif de cet article est dâidentifier, Ă partir dâune Ă©tude qualitative, le moyen par lequel la politique de confidentialitĂ© dâun site marchand peut avoir un impact sur la confiance du consommateur
- âŠ