23 research outputs found

    Synthetic data for open and reproducible methodological research in social sciences and official statistics

    Get PDF
    In der Forschung nehmen Vergleichbarkeit und Reproduzierbarkeit immer mehr an Bedeutung zu. Die empirische Forschung profitiert dabei von Forschungsdatenzentren und Scientific Use Files. Für angewandte Methodenforschung dagegen sind geeignete Datenquellen kaum verfügbar, obwohl gerade in den Wirtschafts- und Sozialwissenschaften komplexe Stichprobendesigns bei der Entwicklung und Anwendung von Schätzmethoden berücksichtigt werden müssen. In dieser Arbeit wird ein synthetischer, jedoch realistischer Datensatz vorgestellt, der gerade die Evaluierung und Entwicklung von Schätzmethoden in den Sozial- und Wirtschaftswissenschaften unterstützt. Der Schwerpunkt liegt dabei auf vergleichbarer und reproduzierbarer Forschung in einer realistischen Umgebung in Bezug auf Individual- und Haushaltsdaten. Dieser Datensatz wird der Forschungsgemeinde frei zur Verfügung gestellt.Open and reproducible research receives more and more attention in the research community. Whereas empirical research may benefit from research data centres or scientific use files that foster using data in a safe environment or with remote access, methodological research suffers from the availability of adequate data sources. In economic and social sciences, an additional drawback results from the presence of complex survey designs in the data generating process, that has to be considered when developing and applying estimators. In the present paper, we present a synthetic but realistic dataset based on social science data, that fosters evaluating and developing estimators in social sciences. The focus is on supporting comparable and reproducible research in a realistic framework providing individual and household data. The outcome is provided as an open research data resource

    Mixed-Integer Quadratic Optimization and Iterative Clustering Techniques for Semi-Supervised Support Vector Machines

    Full text link
    Among the most famous algorithms for solving classification problems are support vector machines (SVMs), which find a separating hyperplane for a set of labeled data points. In some applications, however, labels are only available for a subset of points. Furthermore, this subset can be non-representative, e.g., due to self-selection in a survey. Semi-supervised SVMs tackle the setting of labeled and unlabeled data and can often improve the reliability of the results. Moreover, additional information about the size of the classes can be available from undisclosed sources. We propose a mixed-integer quadratic optimization (MIQP) model that covers the setting of labeled and unlabeled data points as well as the overall number of points in each class. Since the MIQP's solution time rapidly grows as the number of variables increases, we introduce an iterative clustering approach to reduce the model's size. Moreover, we present an update rule for the required big-MM values, prove the correctness of the iterative clustering method as well as derive tailored dimension-reduction and warm-starting techniques. Our numerical results show that our approach leads to a similar accuracy and precision than the MIQP formulation but at much lower computational cost. Thus, we can solve solve larger problems. With respect to the original SVM formulation, we observe that our approach has even better accuracy and precision for biased samples.Comment: 33 pages,18 figure

    SAE TEACHING USING SIMULATIONS

    Get PDF

    Mixed-integer programming techniques for the minimum sum-of-squares clustering problem

    Get PDF
    The minimum sum-of-squares clustering problem is a very important problem in data mining and machine learning with very many applications in, e.g., medicine or social sciences. However, it is known to be NP-hard in all relevant cases and to be notoriously hard to be solved to global optimality in practice. In this paper, we develop and test different tailored mixed-integer programming techniques to improve the performance of state-of-the-art MINLP solvers when applied to the problem—among them are cutting planes, propagation techniques, branching rules, or primal heuristics. Our extensive numerical study shows that our techniques significantly improve the performance of the open-source MINLP solver SCIP. Consequently, using our novel techniques, we can solve many instances that are not solvable with SCIP without our techniques and we obtain much smaller gaps for those instances that can still not be solved to global optimality

    Das Stichprobendesign des registergestützten Zensus 2011

    Full text link
    "Im Rahmen der europaweiten Zensus-Erhebungsrunde im Jahr 2011 wird zum ersten Mal seit 1987 auch im vereinigten Deutschland wieder eine Volkszählung stattfinden, diesmal allerdings nicht in Form einer Vollerhebung, sondern in Form einer kosten- und ressourcenschonenden registergestützten Erhebung. Diese wird flankiert durch eine Haushaltsstichprobe, aus der erstens in den Registern nicht erfasste Informationen gewonnen werden sollen und zweitens eine Abschätzung der Zahl der Karteileichen (KAL) und Fehlbestände (FEB) in den Melderegistern erfolgen soll. Aus den Register- und Stichprobendaten sollen möglichst verlässliche und genaue Schätzungen der Totalwerte vorgenommen werden. Ziel des von DESTATIS eingesetzten Stichprobenforschungsprojektes ist es, Antworten auf die Frage zu geben, welches Stichprobendesign unter den gegebenen Restriktionen empfohlen werden kann. Darüber hinaus sollen Schätzstrategien entwickelt werden, die zur Verwendung im Zensus 2011 vorgeschlagen werden können. Der vorliegende Aufsatz stellt einige wichtige Erkenntnisse aus dem Forschungsprojekt dar, wobei ein Schwerpunkt auf der Darstellung eines optimalen Stichprobendesigns liegt." (Autorenreferat)"Within the context of the Europe-wide census elicitation in 2011 there will be the first population census in reunified Germany. In contrast to the last German census in 1987, where all households were interviewed, the new census will be conducted by means of a cost- and resource-effective register-assisted census. In addition to the register information, a household sample will be drawn. On the one hand this sample will provide information that is not included in the register, on the other hand it will allow for the estimation of over- and undercounts in the register. Reliable estimates for total values of interest are to be derived from the register and sample data. The aim of the research project, which was initiated by DESTATIS, is to elaborate an efficient sample design as well as to develop estimation strategies which allow accurate estimates for the census 2011. This article presents some important findings from the research project. However, one focus is on the description of an optimal sample design." (author's abstract

    National and subnational short-term forecasting of COVID-19 in Germany and Poland during early 2021

    Get PDF
    We compare forecasts of weekly case and death numbers for COVID-19 in Germany and Poland based on 15 different modelling approaches. These cover the period from January to April 2021 and address numbers of cases and deaths one and two weeks into the future, along with the respective uncertainties. We find that combining different forecasts into one forecast can enable better predictions. However, case numbers over longer periods were challenging to predict. Additional data sources, such as information about different versions of the SARS-CoV-2 virus present in the population, might improve forecasts in the future

    The Forward Physics Facility at the High-Luminosity LHC

    Get PDF

    SAE Teaching Using Simulations

    No full text
    The increasing interest in applying small area estimation methods urges the needs for training in small area estimation. To better understand the behaviour of small area estimators in practice, simulations are a feasible way for evaluating and teaching properties of the estimators of interest. By designing such simulation studies, students gain a deeper understanding of small area estimation methods. Thus, we encourage to use appropriate simulations as an additional interactive tool in teaching small area estimation methods

    SMALL AREA ESTIMATION IN THE GERMAN CENSUS 2011

    No full text
    In 2011, Germany conducted the first census after the reunification. In contrast to a classical census, a register-assisted census was implemented using population register data and an additional sample. This paper provides an overview of how the sampling design recommendations were set up in order to fulfil legal requirements and to guarantee an optimal but still flexible source of information. The aim was to develop a design that fosters an accurate estimation of the main objective of the census, the total population counts. Further, the design should also adequately support the application of small area estimation methods. Some empirical results are given to provide an assessment of selected methods. The research was conducted within the German Census Sampling and Estimation research project, financially supported by the German Federal Statistical Office
    corecore