58 research outputs found

    Deep R Programming

    Full text link
    Deep R Programming is a comprehensive course on one of the most popular languages in data science (statistical computing, graphics, machine learning, data wrangling and analytics). It introduces the base language in-depth and is aimed at ambitious students, practitioners, and researchers who would like to become independent users of this powerful environment. This textbook is a non-profit project. Its online and PDF versions are freely available at . This early draft is distributed in the hope that it will be useful.Comment: Draft: v0.2.1 (2023-04-27

    A Framework for Benchmarking Clustering Algorithms

    Full text link
    The evaluation of clustering algorithms can involve running them on a variety of benchmark problems, and comparing their outputs to the reference, ground-truth groupings provided by experts. Unfortunately, many research papers and graduate theses consider only a small number of datasets. Also, the fact that there can be many equally valid ways to cluster a given problem set is rarely taken into account. In order to overcome these limitations, we have developed a framework whose aim is to introduce a consistent methodology for testing clustering algorithms. Furthermore, we have aggregated, polished, and standardised many clustering benchmark dataset collections referred to across the machine learning and data mining literature, and included new datasets of different dimensionalities, sizes, and cluster types. An interactive datasets explorer, the documentation of the Python API, a description of the ways to interact with the framework from other programming languages such as R or MATLAB, and other details are all provided at

    stringi: Fast and Portable Character String Processing in R

    Get PDF
    Effective processing of character strings is required at various stages of data analysis pipelines: from data cleansing and preparation, through information extraction, to report generation. Pattern searching, string collation and sorting, normalization, transliteration, and formatting are ubiquitous in text mining, natural language processing, and bioinformatics. This paper discusses and demonstrates how and why stringi, a mature R package for fast and portable handling of string data based on ICU (International Components for Unicode), should be included in each statistician's or data scientist's repertoire to complement their numerical computing and data wrangling skills

    Clustering with minimum spanning trees: How good can it be?

    Full text link
    Minimum spanning trees (MSTs) provide a convenient representation of datasets in numerous pattern recognition activities. Moreover, they are relatively fast to compute. In this paper, we quantify the extent to which they can be meaningful in data clustering tasks. By identifying the upper bounds for the agreement between the best (oracle) algorithm and the expert labels from a large battery of benchmark data, we discover that MST methods can overall be very competitive. Next, instead of proposing yet another algorithm that performs well on a limited set of examples, we review, study, extend, and generalise existing, the state-of-the-art MST-based partitioning schemes, which leads to a few new and interesting approaches. It turns out that the Genie method and the information-theoretic approaches often outperform the non-MST algorithms such as k-means, Gaussian mixtures, spectral clustering, BIRCH, and classical hierarchical agglomerative procedures

    Hierarchical Clustering with OWA-based Linkages, the Lance-Williams Formula, and Dendrogram Inversions

    Full text link
    Agglomerative hierarchical clustering based on Ordered Weighted Averaging (OWA) operators not only generalises the single, complete, and average linkages, but also includes intercluster distances based on a few nearest or farthest neighbours, trimmed and winsorised means of pairwise point similarities, amongst many others. We explore the relationships between the famous Lance-Williams update formula and the extended OWA-based linkages with weights generated via infinite coefficient sequences. Furthermore, we provide some conditions for the weight generators to guarantee the resulting dendrograms to be free from unaesthetic inversions

    The use of fuzzy relations in the assessment of information resources producers' performance

    Get PDF
    Abstract. The producers assessment problem has many important practical instances: it is an abstract model for intelligent systems evaluating e.g. the quality of computer software repositories, web resources, social networking services, and digital libraries. Each producer's performance is determined according not only to the overall quality of the items he/she outputted, but also to the number of such items (which may be different for each agent). Recent theoretical results indicate that the use of aggregation operators in the process of ranking and evaluation producers may not necessarily lead to fair and plausible outcomes. Therefore, to overcome some weaknesses of the most often applied approach, in this preliminary study we encourage the use of a fuzzy preference relation-based setting and indicate why it may provide better control over the assessment process

    Gini-stable Lorenz curves and their relation to the generalised Pareto distribution

    Full text link
    We introduce an iterative discrete information production process where we can extend ordered normalised vectors by new elements based on a simple affine transformation, while preserving the predefined level of inequality, G, as measured by the Gini index. Then, we derive the family of empirical Lorenz curves of the corresponding vectors and prove that it is stochastically ordered with respect to both the sample size and G which plays the role of the uncertainty parameter. We prove that asymptotically, we obtain all, and only, Lorenz curves generated by a new, intuitive parametrisation of the finite-mean Pickands' Generalised Pareto Distribution (GPD) that unifies three other families, namely: the Pareto Type II, exponential, and scaled beta distributions. The family is not only totally ordered with respect to the parameter G, but also, thanks to our derivations, has a nice underlying interpretation. Our result may thus shed a new light on the genesis of this family of distributions. Our model fits bibliometric, informetric, socioeconomic, and environmental data reasonably well. It is quite user-friendly for it only depends on the sample size and its Gini index

    gagolews/stringx: stringx_0.2.6

    No full text
    <h2>0.2.6 (2023-11-30)</h2> <ul> <li><p>[BACKWARD INCOMPATIBILITY] <code>strptime</code> fills missing fields based on today's midnight, due to a change in <em>stringi</em>-1.8.1.</p> </li> <li><p>[BUGFIX] #13: Subtracting of objects of class <code>POSIXxt</code> resulted in an error.</p> </li> </ul&gt

    A critique of the bounded fuzzy possibilistic method

    Full text link

    On the relation between effort-dominating and symmetric minitive aggregation operators

    Full text link
    • 

    corecore