1,135 research outputs found
Fleet management in free-floating bike sharing systems using predictive modelling and explorative tools
For redistribution and operating bikes in a free-floating systems, two measures are of highest priority. First, the information about the expected number of rentals on a day is an important measure for service providers for management and service of their fleet. The estimation of the expected number of bookings is carried out with a simple model and a more complex model based on meterological information, as the number of loans depends strongly on the current and forecasted weather. Secondly, the knowledge of a service level violation in future on a fine spatial resolution is important for redistribution of bikes.
With this information, the service provider can set reward zones where service level violations will occur in the near future. To forecast a service level violation on a fine geographical resolution the current distribution of bikes as well as the time and space information of past rentals has to be taken into account. A Markov Chain Model is formulated to integrate this information.
We develop a management tool that describes in an explorative way important information about past, present and predicted future counts on rentals in time and space. It integrates all estimation procedures. The management tool is running in the browser and continuously updates the information and predictions since the bike distribution over the observed area is in continous flow as well as new data are generated continuously
An Object-Oriented Framework for Statistical Simulation: The R Package simFrame
Simulation studies are widely used by statisticians to gain insight into the quality of developed methods. Usually some guidelines regarding, e.g., simulation designs, contamination, missing data models or evaluation criteria are necessary in order to draw meaningful conclusions. The R package simFrame is an object-oriented framework for statistical simulation, which allows researchers to make use of a wide range of simulation designs with a minimal effort of programming. Its object-oriented implementation provides clear interfaces for extensions by the user. Since statistical simulation is an embarrassingly parallel process, the framework supports parallel computing to increase computational performance. Furthermore, an appropriate plot method is selected automatically depending on the structure of the simulation results. In this paper, the implementation of simFrame is discussed in great detail and the functionality of the framework is demonstrated in examples for different simulation designs.
Feedback-based integration of the whole process of data anonymization in a graphical interface
The interactive, web-based point-and-click application presented in this article, allows anonymizing data without any knowledge in a programming language. Anonymization in data mining, but creating safe, anonymized data is by no means a trivial task. Both the methodological issues as well as know-how from subject matter specialists should be taken into account when anonymizing data. Even though specialized software such as sdcMicro exists, it is often difficult for nonexperts in a particular software and without programming skills to actually anonymize datasets without an appropriate app. The presented app is not restricted to apply disclosure limitation techniques but rather facilitates the entire anonymization process. This interface allows uploading data to the system, modifying them and to create an object defining the disclosure scenario. Once such a statistical disclosure control (SDC) problem has been defined, users can apply anonymization techniques to this object and get instant feedback on the impact on risk and data utility after SDC methods have been applied. Additional features, such as an Undo Button, the possibility to export the anonymized dataset or the required code for reproducibility reasons, as well its interactive features, make it convenient both for experts and nonexperts in R – the free software environment for statistical computing and graphics – to protect a dataset using this app
Imputation with the R Package VIM
The package VIM is developed to explore and analyze the structure of missing values in data using visualization methods, to impute these missing values with the built-in imputation methods and to verify the imputation process using visualization tools, as well as to produce high-quality graphics for publications. This article focuses on the different imputation techniques available in the package. Four different imputation methods are currently implemented in VIM, namely hot-deck imputation, k-nearest neighbor imputation, regression imputation and iterative robust model-based imputation. All of these methods are implemented in a flexible manner with many options for customization. Furthermore in this article practical examples are provided to highlight the use of the implemented methods on real-world applications. In addition, the graphical user interface of VIM has been re-implemented from scratch resulting in the package VIMGUI to enable users without extensive R skills to access these imputation and visualization methods
Die Theorie lebt in der Praxis : ein Interview mit Ernst Stadlober
Das Interview mit Ernst Stadlober wurde von Herwig Friedl und Matthias Templ am 18. Dezember 2015 geführt. Es zeichnet ein Bild des beruflichen Werdeganges von Ernst Stadlober, von seinen Anfängen wo er mit fix gesetztem seed auf deterministischem Wege über die Random Number Generation zu seiner sehr breiten Ausrichtung der Statistik fand. Viele erfolgreich angewandte Forschungsprojekte mit Partnern aus Verwaltung, Industrie und Wirtschaft bezeugen ebenso seine Erfolgsgeschichte als auch die beispiellose intensive Betreuung von Studenten an der TU Graz. Man kann zurecht behaupten, dass Ernst Stadlober ein breites Methodenspektrum aus dem Gebiet der Statistik beherrscht und es trotzdem schaffte in viele Spezialgebiete auch tiefer vorzudringen.
Ernst Stadlobers berufliche Heimat war und ist das Statistikinstitut der TU Graz, das er seit 1998 auch leitet. Dazwischen war er auf Forschungsaufenthalten an der Stanford University/USA und der TH Darmstadt und hatte eine Lehrstuhlvertretung an der Universität Kiel. Bis heute hat er 12 Dissertationen und mehr als 90 Diplom-/Masterarbeiten betreut. Zum Repertoire seiner Lehre zählt die (Angewandte) Statistik, Zeitreihenanalyse, Stochastische Modellierung und Simulation, Versuchsplanung und einiges mehr. Zusätzlich blickt er heute auf eine Reihe von über 100 Vorträgen sowie auf etwa 80 Publikationen aus dem Bereich der Biostatistik, Computerstatistik und Angewandten Statistik zurück
A systematic overview on methods to protect sensitive data provided for various analyses
In view of the various methodological developments regarding the protection of sensitive data, especially with respect to privacy-preserving computation and federated learning, a conceptual categorization and comparison between various methods stemming from different fields is often desired. More concretely, it is important to provide guidance for the practice, which lacks an overview over suitable approaches for certain scenarios, whether it is differential privacy for interactive queries, k-anonymity methods and synthetic data generation for data publishing, or secure federated analysis for multiparty computation without sharing the data itself. Here, we provide an overview based on central criteria describing a context for privacy-preserving data handling, which allows informed decisions in view of the many alternatives. Besides guiding the practice, this categorization of concepts and methods is destined as a step towards a comprehensive ontology for anonymization. We emphasize throughout the paper that there is no panacea and that context matters
An Open Source Approach for Modern Teaching Methods: The Interactive TGUI System
In order to facilitate teaching complex topics in an interactive way, the authors developed a computer-assisted teaching system, a graphical user interface named TGUI (Teaching Graphical User Interface). TGUI was introduced at the beginning of 2009 in the Austrian Journal of Statistics (Dinges and Templ 2009) as being an effective instrument to train and teach staff on mathematical and statistical topics. While the fundamental principles were retained, the current TGUI system has been undergone a complete redesign. The ultimate goal behind the reimplementation was to share the advantages of TGUI and provide teachers and people who need to hold training courses with a strong tool that can enrich their lectures with interactive features. The idea was to go a step beyond the current modular blended-learning systems (see, e.g., Da Rin 2003) or the related teaching techniques of classroom-voting (see, e.g., Cline 2006). In this paper the authors have attempted to exemplify basic idea and concept of TGUI by means of statistics seminars held at Statistics Austria. The powerful open source software R (R Development Core Team 2010a) is the backend for TGUI, which can therefore be used to process even complex statistical contents. However, with specifically created contents the interactive TGUI system can be used to support a wide range of courses and topics. The open source R packages TGUICore and TGUITeaching are freely available from the Comprehensive R Archive Network at http://CRAN.R-project.org/.
Statistical analysis of chemical element compositions in Food Science : problems and possibilities
In recent years, many analyses have been carried out to investigate the chemical components of food data. However, studies rarely consider the compositional pitfalls of such analyses. This is problematic as it may lead to arbitrary results when non-compositional statistical analysis is applied to compositional datasets. In this study, compositional data analysis (CoDa), which is widely used in other research fields, is compared with classical statistical analysis to demonstrate how the results vary depending on the approach and to show the best possible statistical analysis. For example, honey and saffron are highly susceptible to adulteration and imitation, so the determination of their chemical elements requires the best possible statistical analysis. Our study demonstrated how principle component analysis (PCA) and classification results are influenced by the pre-processing steps conducted on the raw data, and the replacement strategies for missing values and non-detects. Furthermore, it demonstrated the differences in results when compositional and non-compositional methods were applied. Our results suggested that the outcome of the log-ratio analysis provided better separation between the pure and adulterated data and allowed for easier interpretability of the results and a higher accuracy of classification. Similarly, it showed that classification with artificial neural networks (ANNs) works poorly if the CoDa pre-processing steps are left out. From these results, we advise the application of CoDa methods for analyses of the chemical elements of food and for the characterization and authentication of food products
Modeling and prediction of the impact factor of journals using open-access databases
This article is motivated by the work as editor-in-chief of the Austrian Journal of Statistics and contains detailed analyses about the impact of the Austrian Journal of Statistics.
The impact of a journal is typically expressed by journal metrics indicators. One of the important ones, the journal impact factor is calculated from the Web of Science (WoS) database by Clarivate Analytics.
It is known that newly established journals or journals without membership in big publishers often face difficulties to be included, e.g., in the Science Citation Index (SCI) and thus they do not receive a WoS journal impact factor, as it is the case for example, for the Austrian Journal of Statistics.
In this study, a novel approach is pursued modeling and predicting the WoS impact factor of journals using open access or partly open-access databases, like Google Scholar, ResearchGate, and Scopus. I hypothesize a functional linear dependency between citation counts in these databases and the journal impact factor. These functional relationships enable the development of a model that may allow estimating the impact factor for new, small, and independent journals not listed in SCI. However, only good results could be achieved with robust linear regression and well-chosen models.
In addition, this study demonstrates that the WoS impact factor of SCI listed journals can be successfully estimated without using the Web of Science database and therefore the dependency of researchers and institutions to this popular database can be minimized. These results suggest that the statistical model developed here can be well applied to predict the WoS impact factor using alternative open-access databases
Statistical Disclosure Control for Micro-Data Using the R Package sdcMicro
The demand for data from surveys, censuses or registers containing sensible information on people or enterprises has increased significantly over the last years. However, before data can be provided to the public or to researchers, confidentiality has to be respected for any data set possibly containing sensible information about individual units. Confidentiality can be achieved by applying statistical disclosure control (SDC) methods to the data in order to decrease the disclosure risk of data.The R package sdcMicro serves as an easy-to-handle, object-oriented S4 class implementation of SDC methods to evaluate and anonymize confidential micro-data sets. It includes all popular disclosure risk and perturbation methods. The package performs automated recalculation of frequency counts, individual and global risk measures, information loss and data utility statistics after each anonymization step. All methods are highly optimized in terms of computational costs to be able to work with large data sets. Reporting facilities that summarize the anonymization process can also be easily used by practitioners. We describe the package and demonstrate its functionality with a complex household survey test data set that has been distributed by the International Household Survey Network
- …