Search CORE

3 research outputs found

Efficient supervised and semi-supervised approaches for affiliations disambiguation

Author: Bonvallot Valérie
Cuxac Pascal
Lamirel Jean-Charles
Publication venue: Springer Verlag
Publication date: 23/10/2012
Field of study

International audienceThe disambiguation of named entities is a challenge in many fields such as scientometrics, social networks, record linkage, citation analysis, semantic web...etc. The names ambiguities can arise from misspelling, typographical or OCR mistakes, abbreviations, omissions... Therefore, the search of names of persons or of organizations is difficult as soon as a single name might appear in many different forms. This paper proposes two approaches to disambiguate on the affiliations of authors of scientific papers in bibliographic databases: the first way considers that a training dataset is available, and uses a Naive Bayes model. The second way assumes that there is no learning resource, and uses a semi-supervised approach, mixing soft-clustering and Bayesian learning. The results are encouraging and the approach is already partially applied in a scientific survey department. However, our experiments also highlight that our approach has some limitations: it cannot process efficiently highly unbalanced data. Alternatives solutions are possible for future developments, particularly with the use of a recent clustering algorithm relying on feature maximization

Crossref

INRIA a CCSD electronic archive server

HAL Descartes

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Hal-Diderot

New efficient clustering quality indexes

Author: Cuxac Pascal
Dugué Nicolas
Lamirel Jean-Charles
Publication venue: HAL CCSD
Publication date: 24/07/2016
Field of study

International audienceThis paper deals with a major challenge in clustering that is optimal model selection. It presents new efficient clustering quality indexes relying on feature maximization, which is an alternative measure to usual distributional measures relying on entropy, Chi-square metric or vector-based measures such as Euclidean distance or correlation distance. First Experiments compare the behavior of these new indexes with usual cluster quality indexes based on Euclidean distance on different kinds of test datasets for which ground truth is available. This comparison clearly highlights altogether the superior accuracy and stability of the new method on these datasets, its efficiency from low to high dimensional range and its tolerance to noise. Further experiments are then conducted on " real life " textual data extracted from a multisource bibliographic database for which ground truth is unknown. These experiments show that the accuracy and stability of these new indexes allow to deal efficiently with diachronic analysis, when other indexes do not fit the requirements for this task

Crossref

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Variations to incremental growing neural gas algorithm based on label maximization

Author: Cuxac Pascal
Lamirel Jean-Charles
Mall Raghvendra
Safi Ghada
Publication venue: HAL CCSD
Publication date: 31/07/2011
Field of study

International audienc

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot