58 research outputs found
Deep R Programming
Deep R Programming is a comprehensive course on one of the most popular
languages in data science (statistical computing, graphics, machine learning,
data wrangling and analytics). It introduces the base language in-depth and is
aimed at ambitious students, practitioners, and researchers who would like to
become independent users of this powerful environment. This textbook is a
non-profit project. Its online and PDF versions are freely available at
. This early draft is distributed in the hope
that it will be useful.Comment: Draft: v0.2.1 (2023-04-27
A Framework for Benchmarking Clustering Algorithms
The evaluation of clustering algorithms can involve running them on a variety
of benchmark problems, and comparing their outputs to the reference,
ground-truth groupings provided by experts. Unfortunately, many research papers
and graduate theses consider only a small number of datasets. Also, the fact
that there can be many equally valid ways to cluster a given problem set is
rarely taken into account. In order to overcome these limitations, we have
developed a framework whose aim is to introduce a consistent methodology for
testing clustering algorithms. Furthermore, we have aggregated, polished, and
standardised many clustering benchmark dataset collections referred to across
the machine learning and data mining literature, and included new datasets of
different dimensionalities, sizes, and cluster types. An interactive datasets
explorer, the documentation of the Python API, a description of the ways to
interact with the framework from other programming languages such as R or
MATLAB, and other details are all provided at
stringi: Fast and Portable Character String Processing in R
Effective processing of character strings is required at various stages of data analysis pipelines: from data cleansing and preparation, through information extraction, to report generation. Pattern searching, string collation and sorting, normalization, transliteration, and formatting are ubiquitous in text mining, natural language processing, and bioinformatics. This paper discusses and demonstrates how and why stringi, a mature R package for fast and portable handling of string data based on ICU (International Components for Unicode), should be included in each statistician's or data scientist's repertoire to complement their numerical computing and data wrangling skills
Clustering with minimum spanning trees: How good can it be?
Minimum spanning trees (MSTs) provide a convenient representation of datasets
in numerous pattern recognition activities. Moreover, they are relatively fast
to compute. In this paper, we quantify the extent to which they can be
meaningful in data clustering tasks. By identifying the upper bounds for the
agreement between the best (oracle) algorithm and the expert labels from a
large battery of benchmark data, we discover that MST methods can overall be
very competitive. Next, instead of proposing yet another algorithm that
performs well on a limited set of examples, we review, study, extend, and
generalise existing, the state-of-the-art MST-based partitioning schemes, which
leads to a few new and interesting approaches. It turns out that the Genie
method and the information-theoretic approaches often outperform the non-MST
algorithms such as k-means, Gaussian mixtures, spectral clustering, BIRCH, and
classical hierarchical agglomerative procedures
Hierarchical Clustering with OWA-based Linkages, the Lance-Williams Formula, and Dendrogram Inversions
Agglomerative hierarchical clustering based on Ordered Weighted Averaging
(OWA) operators not only generalises the single, complete, and average
linkages, but also includes intercluster distances based on a few nearest or
farthest neighbours, trimmed and winsorised means of pairwise point
similarities, amongst many others. We explore the relationships between the
famous Lance-Williams update formula and the extended OWA-based linkages with
weights generated via infinite coefficient sequences. Furthermore, we provide
some conditions for the weight generators to guarantee the resulting
dendrograms to be free from unaesthetic inversions
The use of fuzzy relations in the assessment of information resources producers' performance
Abstract. The producers assessment problem has many important practical instances: it is an abstract model for intelligent systems evaluating e.g. the quality of computer software repositories, web resources, social networking services, and digital libraries. Each producer's performance is determined according not only to the overall quality of the items he/she outputted, but also to the number of such items (which may be different for each agent). Recent theoretical results indicate that the use of aggregation operators in the process of ranking and evaluation producers may not necessarily lead to fair and plausible outcomes. Therefore, to overcome some weaknesses of the most often applied approach, in this preliminary study we encourage the use of a fuzzy preference relation-based setting and indicate why it may provide better control over the assessment process
Gini-stable Lorenz curves and their relation to the generalised Pareto distribution
We introduce an iterative discrete information production process where we
can extend ordered normalised vectors by new elements based on a simple affine
transformation, while preserving the predefined level of inequality, G, as
measured by the Gini index.
Then, we derive the family of empirical Lorenz curves of the corresponding
vectors and prove that it is stochastically ordered with respect to both the
sample size and G which plays the role of the uncertainty parameter. We prove
that asymptotically, we obtain all, and only, Lorenz curves generated by a new,
intuitive parametrisation of the finite-mean Pickands' Generalised Pareto
Distribution (GPD) that unifies three other families, namely: the Pareto Type
II, exponential, and scaled beta distributions. The family is not only totally
ordered with respect to the parameter G, but also, thanks to our derivations,
has a nice underlying interpretation. Our result may thus shed a new light on
the genesis of this family of distributions.
Our model fits bibliometric, informetric, socioeconomic, and environmental
data reasonably well. It is quite user-friendly for it only depends on the
sample size and its Gini index
gagolews/stringx: stringx_0.2.6
<h2>0.2.6 (2023-11-30)</h2>
<ul>
<li><p>[BACKWARD INCOMPATIBILITY] <code>strptime</code> fills missing fields based
on today's midnight, due to a change in <em>stringi</em>-1.8.1.</p>
</li>
<li><p>[BUGFIX] #13: Subtracting of objects of class <code>POSIXxt</code> resulted in an error.</p>
</li>
</ul>
- âŠ