Search CORE

58 research outputs found

Deep R Programming

Author: Gagolewski Marek
Publication venue
Publication date: 29/12/2022
Field of study

Deep R Programming is a comprehensive course on one of the most popular languages in data science (statistical computing, graphics, machine learning, data wrangling and analytics). It introduces the base language in-depth and is aimed at ambitious students, practitioners, and researchers who would like to become independent users of this powerful environment. This textbook is a non-profit project. Its online and PDF versions are freely available at . This early draft is distributed in the hope that it will be useful.Comment: Draft: v0.2.1 (2023-04-27

arXiv.org e-Print Archive

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

A Framework for Benchmarking Clustering Algorithms

Author: Gagolewski Marek
Publication venue: 'Elsevier BV'
Publication date: 06/10/2022
Field of study

The evaluation of clustering algorithms can involve running them on a variety of benchmark problems, and comparing their outputs to the reference, ground-truth groupings provided by experts. Unfortunately, many research papers and graduate theses consider only a small number of datasets. Also, the fact that there can be many equally valid ways to cluster a given problem set is rarely taken into account. In order to overcome these limitations, we have developed a framework whose aim is to introduce a consistent methodology for testing clustering algorithms. Furthermore, we have aggregated, polished, and standardised many clustering benchmark dataset collections referred to across the machine learning and data mining literature, and included new datasets of different dimensionalities, sizes, and cluster types. An interactive datasets explorer, the documentation of the Python API, a description of the ways to interact with the framework from other programming languages such as R or MATLAB, and other details are all provided at

arXiv.org e-Print Archive

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

stringi: Fast and Portable Character String Processing in R

Author: Gagolewski Marek
Publication venue: Foundation for Open Access Statistics
Publication date: 02/07/2022
Field of study

Effective processing of character strings is required at various stages of data analysis pipelines: from data cleansing and preparation, through information extraction, to report generation. Pattern searching, string collation and sorting, normalization, transliteration, and formatting are ubiquitous in text mining, natural language processing, and bioinformatics. This paper discusses and demonstrates how and why stringi, a mature R package for fast and portable handling of string data based on ICU (International Components for Unicode), should be included in each statistician's or data scientist's repertoire to complement their numerical computing and data wrangling skills

ZENODO

Journal of Statistical Software

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Clustering with minimum spanning trees: How good can it be?

Author: Bartoszuk Maciej
Brzozowski Łukasz
Cena Anna
Gagolewski Marek
Publication venue
Publication date: 09/03/2023
Field of study

Minimum spanning trees (MSTs) provide a convenient representation of datasets in numerous pattern recognition activities. Moreover, they are relatively fast to compute. In this paper, we quantify the extent to which they can be meaningful in data clustering tasks. By identifying the upper bounds for the agreement between the best (oracle) algorithm and the expert labels from a large battery of benchmark data, we discover that MST methods can overall be very competitive. Next, instead of proposing yet another algorithm that performs well on a limited set of examples, we review, study, extend, and generalise existing, the state-of-the-art MST-based partitioning schemes, which leads to a few new and interesting approaches. It turns out that the Genie method and the information-theoretic approaches often outperform the non-MST algorithms such as k-means, Gaussian mixtures, spectral clustering, BIRCH, and classical hierarchical agglomerative procedures

arXiv.org e-Print Archive

Hierarchical Clustering with OWA-based Linkages, the Lance-Williams Formula, and Dendrogram Inversions

Author: Beliakov Gleb
Cena Anna
Gagolewski Marek
James Simon
Publication venue
Publication date: 09/03/2023
Field of study

Agglomerative hierarchical clustering based on Ordered Weighted Averaging (OWA) operators not only generalises the single, complete, and average linkages, but also includes intercluster distances based on a few nearest or farthest neighbours, trimmed and winsorised means of pairwise point similarities, amongst many others. We explore the relationships between the famous Lance-Williams update formula and the extended OWA-based linkages with weights generated via infinite coefficient sequences. Furthermore, we provide some conditions for the weight generators to guarantee the resulting dendrograms to be free from unaesthetic inversions

arXiv.org e-Print Archive

The use of fuzzy relations in the assessment of information resources producers' performance

Author: Jan Lasek
Marek Gagolewski
Publication venue
Publication date: 23/04/2020
Field of study

Abstract. The producers assessment problem has many important practical instances: it is an abstract model for intelligent systems evaluating e.g. the quality of computer software repositories, web resources, social networking services, and digital libraries. Each producer's performance is determined according not only to the overall quality of the items he/she outputted, but also to the number of such items (which may be different for each agent). Recent theoretical results indicate that the use of aggregation operators in the process of ranking and evaluation producers may not necessarily lead to fair and plausible outcomes. Therefore, to overcome some weaknesses of the most often applied approach, in this preliminary study we encourage the use of a fuzzy preference relation-based setting and indicate why it may provide better control over the assessment process

CiteSeerX

Gini-stable Lorenz curves and their relation to the generalised Pareto distribution

Author: Bertoli-Barsotti Lucio
Gagolewski Marek
Siudem Grzegorz
Żogała-Siudem Barbara
Publication venue
Publication date: 15/01/2024
Field of study

We introduce an iterative discrete information production process where we can extend ordered normalised vectors by new elements based on a simple affine transformation, while preserving the predefined level of inequality, G, as measured by the Gini index. Then, we derive the family of empirical Lorenz curves of the corresponding vectors and prove that it is stochastically ordered with respect to both the sample size and G which plays the role of the uncertainty parameter. We prove that asymptotically, we obtain all, and only, Lorenz curves generated by a new, intuitive parametrisation of the finite-mean Pickands' Generalised Pareto Distribution (GPD) that unifies three other families, namely: the Pareto Type II, exponential, and scaled beta distributions. The family is not only totally ordered with respect to the parameter G, but also, thanks to our derivations, has a nice underlying interpretation. Our result may thus shed a new light on the genesis of this family of distributions. Our model fits bibliometric, informetric, socioeconomic, and environmental data reasonably well. It is quite user-friendly for it only depends on the sample size and its Gini index

arXiv.org e-Print Archive

gagolews/stringx: stringx_0.2.6

Author: Marek Gagolewski
Publication venue: Zenodo
Publication date: 13/10/2022
Field of study

<h2>0.2.6 (2023-11-30)</h2> <ul> <li><p>[BACKWARD INCOMPATIBILITY] <code>strptime</code> fills missing fields based on today's midnight, due to a change in <em>stringi</em>-1.8.1.</p> </li> <li><p>[BUGFIX] #13: Subtracting of objects of class <code>POSIXxt</code> resulted in an error.</p> </li> </ul&gt

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

A critique of the bounded fuzzy possibilistic method

Author: Gagolewski Marek
Publication venue: 'Elsevier BV'
Publication date: 01/01/2022
Field of study

Deakin Research Online

On the relation between effort-dominating and symmetric minitive aggregation operators

Author: Gagolewski Marek
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

Deakin Research Online