5,900 research outputs found

    Unsupervised extremely randomized trees

    Get PDF
    International audienceIn this paper we present a method to compute dissimilarities on unlabeled data, based on extremely randomized trees. This method, Unsupervised Extremely Randomized Trees, is used jointly with a novel randomized labeling scheme we describe here, and that we call AddCl3. Unlike existing methods such as AddCl1 and AddCl2, no synthetic instances are generated, thus avoiding an increase in the size of the dataset. The empirical study of this method shows that Unsupervised Extremely Randomized Trees with AddCl3 provides competitive results regarding the quality of resulting clusterings, while clearly outperforming previous similar methods in terms of running time

    From patterned response dependency to structured covariate dependency: categorical-pattern-matching

    Get PDF
    Data generated from a system of interest typically consists of measurements from an ensemble of subjects across multiple response and covariate features, and is naturally represented by one response-matrix against one covariate-matrix. Likely each of these two matrices simultaneously embraces heterogeneous data types: continuous, discrete and categorical. Here a matrix is used as a practical platform to ideally keep hidden dependency among/between subjects and features intact on its lattice. Response and covariate dependency is individually computed and expressed through mutliscale blocks via a newly developed computing paradigm named Data Mechanics. We propose a categorical pattern matching approach to establish causal linkages in a form of information flows from patterned response dependency to structured covariate dependency. The strength of an information flow is evaluated by applying the combinatorial information theory. This unified platform for system knowledge discovery is illustrated through five data sets. In each illustrative case, an information flow is demonstrated as an organization of discovered knowledge loci via emergent visible and readable heterogeneity. This unified approach fundamentally resolves many long standing issues, including statistical modeling, multiple response, renormalization and feature selections, in data analysis, but without involving man-made structures and distribution assumptions. The results reported here enhance the idea that linking patterns of response dependency to structures of covariate dependency is the true philosophical foundation underlying data-driven computing and learning in sciences.Comment: 32 pages, 10 figures, 3 box picture

    Statistics in the Big Data era

    Get PDF
    It is estimated that about 90% of the currently available data have been produced over the last two years. Of these, only 0.5% is effectively analysed and used. However, this data can be a great wealth, the oil of 21st century, when analysed with the right approach. In this article, we illustrate some specificities of these data and the great interest that they can represent in many fields. Then we consider some challenges to statistical analysis that emerge from their analysis, suggesting some strategies
    • …
    corecore