Search CORE

356 research outputs found

A CLUE for CLUster Ensembles

Author: Kurt Hornik
Publication venue
Publication date
Field of study

Cluster ensembles are collections of individual solutions to a given clustering problem which are useful or necessary to consider in a wide range of applications. The R package clue provides an extensible computational environment for creating and analyzing cluster ensembles, with basic data structures for representing partitions and hierarchies, and facilities for computing on these, including methods for measuring proximity and obtaining consensus and "secondary" clusterings.

Research Papers in Economics

New probabilistic interest measures for association rules

Author: Hahsler Michael
Hornik Kurt
Publication venue
Publication date: 07/02/2008
Field of study

Mining association rules is an important technique for discovering meaningful patterns in transaction databases. Many different measures of interestingness have been proposed for association rules. However, these measures fail to take the probabilistic properties of the mined data into account. In this paper, we start with presenting a simple probabilistic framework for transaction data which can be used to simulate transaction data when no associations are present. We use such data and a real-world database from a grocery outlet to explore the behavior of confidence and lift, two popular interest measures used for rule mining. The results show that confidence is systematically influenced by the frequency of the items in the left hand side of rules and that lift performs poorly to filter random noise in transaction data. Based on the probabilistic framework we develop two new interest measures, hyper-lift and hyper-confidence, which can be used to filter or order mined association rules. The new measures show significantly better performance than lift for applications where spurious rules are problematic

arXiv.org e-Print Archive

CiteSeerX

mistr: A Computational Framework for Mixture and Composite Distributions

Author: Hornik Kurt
Sablica Lukas
Publication venue: The R Foundation for Statistical Computing
Publication date: 01/01/2020
Field of study

Finite mixtures and composite distributions allow to model the probabilistic representation of data with more generality than simple distributions and are useful to consider in a wide range of applications. The R package mistr provides an extensible computational framework for creating, transforming, and evaluating these models, together with multiple methods for their visualization and description. In this paper we present the main computational framework of the package and illustrate its application. In addition, we provide and show functions for data modeling using two specific composite distributions as well as a numerical example where a composite distribution is estimated to describe the log-returns of selected stocks

Elektronische Publikationen der Wirtschaftsuniversität Wien

Generalized and Customizable Sets in R

Author: David Meyer
Kurt Hornik
Publication venue
Publication date
Field of study

We present data structures and algorithms for sets and some generalizations thereof (fuzzy sets, multisets, and fuzzy multisets) available for R through the sets package. Fuzzy (multi-)sets are based on dynamically bound fuzzy logic families. Further extensions include user-definable iterators and matching functions.

Research Papers in Economics

topicmodels: An R Package for Fitting Topic Models

Author: Bettina Grün
Kurt Hornik
Publication venue
Publication date
Field of study

Topic models allow the probabilistic modeling of term frequency occurrences in documents. The fitted model can be used to estimate the similarity between documents as well as between a set of specified keywords using an additional layer of latent variables which are referred to as topics. The R package topicmodels provides basic infrastructure for fitting topic models based on data structures from the text mining package tm. The package includes interfaces to two algorithms for fitting topic models: the variational expectation-maximization algorithm provided by David M. Blei and co-authors and an algorithm using Gibbs sampling by Xuan-Hieu Phan and co-authors.

Research Papers in Economics

TSP--Infrastructure for the Traveling Salesperson Problem

Author: Kurt Hornik
Michael Hahsler
Publication venue
Publication date
Field of study

The traveling salesperson (or, salesman) problem (TSP) is a well known and important combinatorial optimization problem. The goal is to find the shortest tour that visits each city in a given list exactly once and then returns to the starting city. Despite this simple problem statement, solving the TSP is difficult since it belongs to the class of NP-complete problems. The importance of the TSP arises besides from its theoretical appeal from the variety of its applications. Typical applications in operations research include vehicle routing, computer wiring, cutting wallpaper and job sequencing. The main application in statistics is combinatorial data analysis, e.g., reordering rows and columns of data matrices or identifying clusters. In this paper, we introduce the R package TSP which provides a basic infrastructure for handling and solving the traveling salesperson problem. The package features S3 classes for specifying a TSP and its (possibly optimal) solution as well as several heuristics to find good solutions. In addition, it provides an interface to Concorde, one of the best exact TSP solvers currently available.

Research Papers in Economics

Implications of probabilistic data modeling for rule mining

Author: Hahsler Michael
Hornik Kurt
Reutterer Thomas
Publication venue: Institut für Statistik und Mathematik, WU Vienna University of Economics and Business
Publication date: 01/01/2005
Field of study

Mining association rules is an important technique for discovering meaningful patterns in transaction databases. In the current literature, the properties of algorithms to mine associations are discussed in great detail. In this paper we investigate properties of transaction data sets from a probabilistic point of view. We present a simple probabilistic framework for transaction data and its implementation using the R statistical computing environment. The framework can be used to simulate transaction data when no associations are present. We use such data to explore the ability to filter noise of confidence and lift, two popular interest measures used for rule mining. Based on the framework we develop the measure hyperlift and we compare this new measure to lift using simulated data and a real-world grocery database.Series: Research Report Series / Department of Statistics and Mathematic

Elektronische Publikationen der Wirtschaftsuniversität Wien

Bookmaker Consensus and Agreement for the UEFA Champions League 2008/09

Author: Hornik Kurt
Leitner Christoph
Zeileis Achim
Publication venue: Department of Statistics and Mathematics, WU Vienna University of Economics and Business
Publication date: 01/01/2009
Field of study

Bookmakers odds are an easily available source of ``prospective" information that is thus often employed for forecasting the outcome of sports events. To investigate the statistical properties of bookmakers odds from a variety of bookmakers for a number of different potential outcomes of a sports event, a class of mixed-effects models is explored, providing information about both consensus and (dis)agreement across bookmakers. In an empirical study for the UEFA Champions League, the most prestigious football club competition in Europe, model selection yields a simple and intuitive model with team-specific means for capturing consensus and team-specific standard deviations reflecting agreement across bookmakers. The resulting consensus forecast performs well in practice, exhibiting high correlation with the actual tournament outcome. Furthermore, the teams' agreement can be shown to be strongly correlated with the predicted consensus and can thus be incorporated in a more parsimonious model for agreement while preserving the same consensus fit.Series: Research Report Series / Department of Statistics and Mathematic

Elektronische Publikationen der Wirtschaftsuniversität Wien

Prospects and Challenges in R Package Development

Author: Hornik Kurt
Ligges Uwe
Theußl Stefan
Publication venue: Institute for Statistics and Mathematics, WU Vienna University of Economics and Business
Publication date: 01/01/2010
Field of study

R, a software package for statistical computing and graphics, has evolved into the lingua franca of (computational) statistics. One of the cornerstones of R's success is the decentralized and modularized way of creating software using a multi-tiered development model: The R Development Core Team provides the "base system", which delivers basic statistical functionality, and many other developers contribute code in the form of extensions in a standardized format via so-called packages. In order to be accessible by a broader audience, packages are made available via standardized source code repositories. To support such a loosely coupled development model, repositories should be able to verify that the provided packages meet certain formal quality criteria and "work": both relative to the development of the base R system as well as with other packages (interoperability). However, established quality assurance systems and collaborative infrastructures typically face several challenges, some of which we will discuss in this paper.Series: Research Report Series / Department of Statistics and Mathematic

Elektronische Publikationen der Wirtschaftsuniversität Wien