11 research outputs found
A framework for benchmarking clustering algorithms
The evaluation of clustering algorithms can involve running them on a variety of benchmark problems, and comparing their outputs to the reference, ground-truth groupings provided by experts. Unfortunately, many research papers and graduate theses consider only a small number of datasets. Also, the fact that there can be many equally valid ways to cluster a given problem set is rarely taken into account. In order to overcome these limitations, we have developed a framework whose aim is to introduce a consistent methodology for testing clustering algorithms. Furthermore, we have aggregated, polished, and standardised many clustering benchmark dataset collections referred to across the machine learning and data mining literature, and included new datasets of different dimensionalities, sizes, and cluster types. An interactive datasets explorer, the documentation of the Python API, a description of the ways to interact with the framework from other programming languages such as R or MATLAB, and other details are all provided at https://clustering-benchmarks.gagolewski.com
Minimalist Data Wrangling with Python
Minimalist Data Wrangling with Python is envisaged as a student's first introduction to data science, providing a high-level overview as well as discussing key concepts in detail. We explore methods for cleaning data gathered from different sources, transforming, selecting, and extracting features, performing exploratory data analysis and dimensionality reduction, identifying naturally occurring data clusters, modelling patterns in data, comparing data between groups, and reporting the results. This textbook is a non-profit project. Its online and PDF versions are freely available at ttps://datawranglingpy.gagolewski.com/.</p
Time to vote: Temporal clustering of user activity on Stack Overflow
Question-and-answer (Q&A) sites improve access to information and ease transfer of knowledge. In recent years, they have grown in popularity and importance, enabling research on behavioral patterns of their users. We study the dynamics related to the casting of 7 M votes across a sample of 700 k posts on Stack Overflow, a large community of professional software developers. We employ log-Gaussian mixture modeling and Markov chains to formulate a simple yet elegant description of the considered phenomena. We indicate that the interevent times can naturally be clustered into 3 typical time scales: those which occur within hours, weeks, and months and show how the events become rarer and rarer as time passes. It turns out that the posts' popularity in a short period after publication is a weak predictor of its overall success, contrary to what was observed, for example, in case of YouTube clips. Nonetheless, the sleeping beauties sometimes awake and can receive bursts of votes following each other relatively quickly
Power laws, the Price model, and the Pareto type-2 distribution
We consider a version of D. Price's model for the growth of a bibliographic network, where in each iteration, a constant number of citations is randomly allocated according to a weighted combination of the accidental (uniformly distributed) and the preferential (rich-get-richer) rule. Instead of relying on the typical master equation approach, we formulate and solve this problem in terms of the rank–size distribution. We show that, asymptotically, such a process leads to a Pareto-type 2 distribution with a new, appealingly interpretable parametrisation. We prove that the solution to the Price model expressed in terms of the rank–size distribution coincides with the expected values of order statistics in an independent Paretian sample. An empirical analysis of a large repository of academic papers yields a good fit not only in the tail of the distribution (as it is usually the case in the power law-like framework), but also across a significantly larger fraction of the data domain
Accidentality in journal citation patterns
We study an agent-based model for generating citation distributions in complex networks of scientific papers, where a fraction of citations is allotted according to the preferential attachment rule (rich get richer) and the remainder is allocated accidentally (purely at random, uniformly). Previously, we derived and analysed such a process in the context of describing individual authors, but now we apply it to scientific journals in computer and information sciences. Based on the large DBLP dataset as well as the CORE (Computing Research and Education Association of Australasia) journal ranking, we find that the impact of journals is correlated with the degree of accidentality of their citation distribution. Citations to impactful journals tend to be more preferential, while citations to lower-ranked journals are distributed in a more accidental manner. Further, applied fields of research such as artificial intelligence seem to be driven by a stronger preferential component - and hence have a higher degree of inequality - than the more theoretical ones, e.g., mathematics and computation theory
Random generation of linearly constrained fuzzy measures and domain coverage performance evaluation
The random generation of fuzzy measures under complex linear constraints holds significance in various fields, including optimization solutions, machine learning, decision making, and property investigation. However, most existing random generation methods primarily focus on addressing the monotonicity and normalization conditions inherent in the construction of fuzzy measures, rather than the linear constraints that are crucial for representing special families of fuzzy measures and additional preference information. In this paper, we present two categories of methods to address the generation of linearly constrained fuzzy measures using linear programming models. These methods enable a comprehensive exploration and coverage of the entire feasible convex domain. The first category involves randomly selecting a subset and assigning measure values within the allowable range under given linear constraints. The second category utilizes convex combinations of constrained extreme fuzzy measures and vertex fuzzy measures. Then we employ some indices of fuzzy measures, objective functions, and distances to domain boundaries to evaluate the coverage performance of these methods across the entire feasible domain. We further provide enhancement techniques to improve the coverage ratios. Finally, we discuss and demonstrate potential applications of these generation methods in practical scenarios
Hierarchical clustering with OWA-based linkages, the Lance–Williams formula, and dendrogram inversions
Agglomerative hierarchical clustering based on Ordered Weighted Averaging (OWA) operators not only generalises the single, complete, and average linkages, but also includes intercluster distances based on a few nearest or farthest neighbours, trimmed and winsorised means of pairwise point similarities, amongst many others. We explore the relationships between the famous Lance–Williams update formula and the extended OWA-based linkages with weights generated via infinite coefficient sequences. Furthermore, we provide some conditions for the weight generators to guarantee the resulting dendrograms to be free from unaesthetic inversions
A benchmark-type generalization of the Sugeno integral with applications in bibliometrics
We propose a new generalization of the classical Sugeno integral motivated by the Hirsch, Woeginger, and other geometrically-inspired indices of scientific impact. The new integral adapts to the rank-size curve better as it allows for putting more emphasis on highly-valued items and/or the tail of the distribution (level measure). We study its fundamental properties and give the conditions guaranteeing the fulfillment of subadditivity as well as the Jensen, Liapunov, Hardy, Markov, and Paley-Zygmund type inequalities. We discuss its applications in scientometrics
Fairness in the three-dimensional model for citation impact
We analyse the usefulness of Jain’s fairness measure and the related Prathap’s bibliometric z-index as proxies when estimating the parameters of the 3DSI (three dimensions of scientific impact) model
Interpretable reparameterisations of citation models
This paper aims to find the reasons why some citation models can predict a set of specific bibliometric indices extremely well. We show why fitting a model that preserves the total sum of a vector can be beneficial in the case of heavy-tailed data that are frequently observed in informetrics and similar disciplines. Based on this observation, we introduce the reparameterised versions of the discrete generalised beta distribution (DGBD) and power law models that preserve the total sum of elements in a citation vector and, as a byproduct, they enjoy much better predictive power when predicting many bibliometric indices as well as partial cumulative sums. This also results in the underlying model parameters' being easier to fit numerically. Moreover, they are also more interpretable. Namely, just like in our recently-introduced 3DSI (three dimensions of scientific impact) model, we have a clear distinction between the coefficients determining the total productivity (size), total impact (sum), and those that affect the shape of the resulting theoretical curve
