231 research outputs found

    High-dimensional clustering

    Get PDF
    International audienceHigh-dimensional (HD) data sets are now frequent, mostly motivated by technological reasons which concern automation in variable acquisition, cheaper availability of data storage and more powerful standard computers for quick data management possibility. All fields are impacted by this general phenomenon of variable number inflation, only the definition of ``high'' being domain dependent. In marketing, this number can be of order 10e2, in microarray gene expression between 10e2 and 10e4, in text mining 10e3 or more, of order 10e6 for single nucleotide polymorphism (SNP) data, etc. Note also that sometimes much more variables can be involved, what can be typically the case with discretized curves, for instance curves coming from temporal sequences.Such a technological revolution has a huge impact in other scientific fields, as societal or also mathematical ones. In particular, high-dimensional data management brings some new challenges to statisticians since standard (low-dimensional) data analysis methods struggle to directly apply to the new (high-dimensional) data sets. The reason can be twofold, sometimes linked, involving either combinatorial difficulties or disastrously large estimate variance increase. Data analysis methods are essential for providing a synthetic view of data sets, allowing data summary and data exploratory for future decision making for instance. This need is even more acute in the high-dimensional setting since on the one hand the large number of variables suggests that a lot of information is conveyed by data but, in the other hand, such information may be hidden behind their volume

    MASSICCC: A SaaS Platform for Clustering and Co-Clustering of Mixed Data

    Get PDF
    International audienc

    Mixture models

    Get PDF
    International audienceFinite mixture models are one of the probabilistic frameworks which reach an especially diverse community of people, including statisticians and practitioners (scientific or not). Initial reasons for being confronted with mixtures may be different for impacted communities but lead finally to close interconnections between them. Indeed, applied statisticians and practitioners usually discover finite mixture models from the numerous application fields where they meet numerous successes. It typically gathers {none,un,semi-} supervised classification and density estimation. The keys of these successes are both their high meaningfulness and flexibility. However, flexibility is in return a matter of algorithmic and mathematical questionings for methodological and theoretical statisticians. In particular, it addresses estimation and model selection issues, on both computational and mathematical aspects. But, solutions to be provided to these issues highly beneficiate to depend on initial related application fields

    BigStat for Big Data: Big Data clustering through the BigStat SaaS platform

    Get PDF
    International audienceBigStat is a web platform devoted to clustering of big data sets through two hosted software, MixtComp and BlockCluster. The former adresses mixed, missing and uncertain data in a moderate dimensional setting, whereas the latter is devoted to high dimensional data sets with non-mixed, non-missing and non-un certain data. Mathematical foundations of both rely on mixture models and related algorithms
    • …
    corecore