Digital Data, Administrative Data, and Survey Compared: Updating the Classical Toolbox for Assessing Data Quality of Big Data, Exemplified by the Generation of Corruption Data

Abstract

In the digital age, new data types have become available that can, potentially, be used in social science research. Besides data that were originally created for scientific purposes (research-elicited data), administrative mass data (traditional-type big data) and data from digital devices (new-type big data) have become more and more relevant for research processes. Both data types can be subsumed under the term “big data.” In this paper, we scrutinize the quality of administrative mass data on corruption in contrast to research-elicited data (e.g., survey data). Since data quality is crucial for the measurement of a social phenomenon such as corruption, we pose the question of how a social phenomenon can be measured by means of data from these different sources. As a first step, we refer to the so-called Bick-Mueller-Model. It was developed in the 1980s for observing the special features and particularities of administrative mass data (traditional-type big data). We contrast this model with the so-called Error-Approach that is typically applied in survey research. In order to account for new trends in data generation and application, we show the progress that has been made since Bick and Mueller introduced their model and discuss new features of digitalism and new technologies. We conclude that the features of the so-called Bick-Mueller are useful for tackling the particularities of administrative data and also – to some degree – new-type big data. The “error” perspective that is inherent both in the classical survey research and in the so-called Bick-Mueller model also applies to new-type big data when it comes to assessing their quality. Moreover, it is possible that the data from these different sources can complement each other. For this, researchers must be aware of the fact that neither data source actually measures corruption directly. For answering specific research questions, it is crucial to consider the advantages and disadvantages of using specific data types

    Similar works