68 research outputs found

    Kernel Density Estimation for Heaped Data

    Get PDF
    In self-reported data usually a phenomenon called `heaping' occurs, i.e. survey participants round the values of their income, weight or height to some degree. Additionally, respondents may be more prone to round off or up due to social desirability. By ignoring the heaping process a severe bias in terms of spikes and bumps is introduced when applying kernel density methods naively to the rounded data. A generalized Stochastic Expectation Maximization (SEM) approach accounting for heaping with potentially asymmetric rounding behaviour in univariate kernel density estimation is presented in this work. The introduced methods are applied to survey data of the German Socio-Economic Panel and exhibit very good performance simulations

    a case study for student residents in Berlin

    Get PDF
    The transformation of area aggregates between non-hierarchical area systems is a standard problem of official statistics. We introduce a new method which is based on kernel density estimates. It is a modification of the SEM algorithm proposed by Gross et al. (2016), which was used for the transformation of totals on rectangular areas to kernel densities estimates. As a by-product of the routine one obtains simulated geo-coordinates for each unit. With the help of these geo-coordinates it is possible to calculate case numbers for a new area system. The method is applied to student resident figures from Berlin. These are known only at the level of ZIP codes but they are needed for administrative planning districts. Our method is evaluated on a similar, simulated data set with known exact geo-coordinates. In the empirical part results for changes in the student residential areas between 2005 and 2015 are presented. It is demonstrated that the transformation via kernel density estimates offers additional useful features to display concentration areas

    Die Glättung räumlicher Datensätze auf administrativen Flächen: Eine Fallstudie mit Berliner Wahldaten

    Get PDF
    Composite spatial data on administrative area level are often presented by maps. The aim is to detect regional differences in the concentration of subpopulations, like elderly persons, ethnic minorities, low-educated persons, voters of a political party or persons with a certain disease. Thematic collections of such maps are presented in different atlases. The standard presentation is by Choropleth maps where each administrative unit is represented by a single value. These maps can be criticized under three aspects: the implicit assumption of a uniform distribution within the area, the instability of the resulting map with respect to a change of the reference area and the discontinuities of the maps at the borderlines of the reference areas which inhibit the detection of regional clusters. In order to address these problems we use a density approach in the construction of maps. This approach does not enforce a local uniform distribution. It does not depend on a specific choice of area reference system and there are no discontinuities in the displayed maps. A standard estimation procedure of densities are Kernel density estimates. However, these estimates need the geo-coordinates of the single units which are not at disposal as we have only access to the aggregates of some area system. To overcome this hurdle, we use a statistical simulation concept. This can be interpreted as a Simulated Expectation Maximisation (SEM) algorithm of Celeux et al (1996). We simulate observations from the current density estimates which are consistent with the aggregation information (S-step). Then we apply the Kernel density estimator to the simulated sample which gives the next density estimate (E-Step). This concept has been first applied for grid data with rectangular areas, see Groß et al (2017), for the display of ethnic minorities. In a second application we demonstrated the use of this approach for the so-called “change of support” (Bradley et al 2016) problem. Here Groß et al (2020) used the SEM algorithm to recalculate case numbers between non-hierarchical administrative area systems. Recently Rendtel et al (2021) applied the SEM algorithm to display spatial-temporal clusters of Corona infections in Germany. Here we present three modifications of the basic SEM algorithm: 1) We introduce a boundary correction which removes the underestimation of kernel density estimates at the borders of the population area. 2) We recognize unsettled areas, like lakes, parks and industrial areas, in the computation of the kernel density. 3) We adapt the SEM algorithm for the computation of local percentages which are important especially in voting analysis. We evaluate our approach against several standard maps by means of the local voting register with known addresses. In the empirical part we apply our approach for the display of voting results for the 2016 election of the Berlin parliament. We contrast our results against Choropleth maps and show new possibilities for reporting spatial voting results.Räumliche Daten auf der Ebene administrativer Flächeneinheiten werden häufig über Karten dargestellt. Das Ziel ist es dabei regionale Unterschiede für interessierenden Bevölkerungsgruppen aufzudecken. Dies betrifft beispielsweise ältere Personen, ethnische Minderheiten, Personen mit geringer Bildung aber auch Wähler einer politischen Partei sowie Personen, die sich mit einer bestimmten Krankheit infiziert haben. Die Zusammenfassung derartiger Karten wird in Atlanten präsentiert. Eine Standarddarstellung benutzt Choroplethen, wo jede administrative Einheit durch einen einzigen Wert repräsentiert wird. Diese Karten können unter drei Aspekten kritisiert werden: Die implizite Annahme einer gleichmäßigen Verteilung innerhalb der Fläche der Einheit, die Instabilität der Darstellung beim Wechsel der administrativen Einheit sowie die Sprünge an den Grenzlinien der Einheiten, die das Aufdecken von regionalen Clustern erschweren. Um diese Probleme zu beseitigen, verwenden wir eine Kartenkonstruktion auf der Basis von Dichten. Dieser Ansatz vermeidet eine zwangsläufige gleichmäßige Dichte innerhalb der Referenzflächen. Er ist unabhängig von der Wahl eines spezifischen Referenzsystems und vermeidet Sprungstellen. Ein Standardverfahren würde Kerndichteschätzer verwenden. Allerdings werden hierfür die Geokoordinaten der einzelnen Einheiten benötigt. Diese stehen aber nicht zur Verfügung sondern lediglich die Aggregate der jeweiligen Flächeneinheit. Um diese Hürde zu umgehen, verwenden wir ein statistisches Simulationskonzept. Es kann als Simulierter EM (SEM) Algorithmus von Celeux et al (1996) beschrieben werden. Auf Basis der gegenwärtigen Dichteschätzung simulieren wir Beobachtungen, die mit der Aggregatsinformation konsistent sind (S-Schritt). Dann wenden wir den Kerndichteschätzer auf die simulierte Stichprobe an, die die nächste Dichteschätzung liefert (E-Schritt). Dieses Konzept wurde erstmals für Gitterdaten auf Rechtecken zur Darstellung von ethnischen Minderheiten angewendet, Groß et al (2017). Eine weitere Anwendung fand dieser Ansatz beim sogenannten „Change of Support“ Problem, (Bradley et al 2016). Hier nutzten Groß et al (2020) den SEM Algorithmus bei der Umrechnung von Fallzahlen zwischen nicht-hierarchischen Flächensystemen. Jüngst haben Rendtel et al (2021) den SEM Algorithmus für die Darstellung räumlich-zeitlicher Konzentrationen von Corona Infektionen in Deutschland verwendet. Hier präsentieren wir drei Modifikationen des SEM Algorithmus: 1) Wir führen eine Randkorrektur ein, die die Unterschätzung der Kerndichteschätzung an den Grenzen der Population beseitigt. 2) Wir berücksichtigen unbewohnte Bereiche wie Parks, Seen und Industriegebiete bei der Berechnung der Kerndichteschätzung. 3) Wir passen den SEM Algorithmus für die Berechnung lokaler Prozentsätze an, die insbesondere für Wahlanalysen interessant sind. Wir evaluieren unseren Ansatz gegen verschiedene Standardkarten auf Basis eines lokalen Wählerregisters mit bekannten Adressen. Im empirischen Teil wenden wir unseren Ansatz auf die Darstellung von Wahlergebnissen zur Wahl des Berliner Abgeordnetenhauses 2016 an. Wir vergleichen unsere Ergebnisse mit Choroplethenkarten und zeigen neue Möglichkeiten zur Berichterstattung räumlicher Wahlergebnisse

    Multivariate kernel density estimation applied to sensitive geo-referenced administrative data protected via measurement error

    Get PDF
    Modern systems of official statistics require the timely estimation of area- specific densities of sub-populations. Ideally estimates should be based on precise geo-coded information, which is not available due to confidentiality constraints. One approach for ensuring confidentiality is by rounding the geo- coordinates. We propose multivariate non-parametric kernel density estimation that reverses the rounding process by using a Bayesian measurement error model. The methodology is applied to the Berlin register of residents for deriving density estimates of ethnic minorities and aged people. Estimates are used for identifying areas with a need for new advisory centres for migrants and infrastructure for older people

    Simulated geo-coordinates as a tool for map-based regional analysis

    Get PDF
    Map-based regional analysis is interested to detect areas with a large concentration of certain populations. Here kernel density estimates (KDE) offer advantages over classical choropleth maps. However, kernel density estimation needs exact geo-coordinates. In a recent paper Groß et al. (2017) have proposed a measurement error model which uses local aggregates for kernel density estimation. Their algorithm simulates "exact" geo-coordinates which reflect the information on the aggregates. In this article we suggest two extensions of this approach. First, we consider boundary constraints, which are usually ignored in the KDE framework. This concerns not only the outer limits of a municipality but also unsettled regions within a city like parks, lakes and industrial areas. Without a boundary correction standard KDEs underestimate the density in the vicinity of boundaries. Here we propose a modification of the original algorithm which uses rescaled kernel functions. Regional maps often display local percentages, for example, voters for a special party among all voters in each voting district. Here we derive a smooth representation of percentages which is based on the ratio of two densities. Again, the original algorithm is modified to cope with the estimation of a ratio of two densities. Our empirical examples refer to voting results from Berlin. It is shown that the proposed methodology reveals a lot of regional insight which is not produced by standard choropleth maps

    Sex-Specific Associations of Brain-Derived Neurotrophic Factor and Cardiorespiratory Fitness in the General Population

    Get PDF
    The brain-derived neurotrophic factor (BDNF) was initially considered to be neuron-specific. Meanwhile, this neurotrophin is peripherally also secreted by skeletal muscle cells and increases due to exercise. Whether BDNF is related to cardiorespiratory fitness (CRF) is currently unclear. We analyzed the association of serum BDNF levels with CRF in the general population (Study of Health in Pomerania (SHIP-TREND) from Northeast Germany; n = 1607, 51% female; median age 48 years). Sex-stratified linear regression models adjusted for age, height, smoking, body fat, lean mass, physical activity, and depression analyzed the association between BDNF and maximal oxygen consumption (VO2peak), maximal oxygen consumption normalized for body weight (VO2peak/kg), and oxygen consumption at the anaerobic threshold (VO2@AT). In women, 1mL/min higher VO2peak, VO2peak/kg, and VO2@AT were associated with a 2.43 pg/mL (95% confidence interval [CI]: 1.16 to 3.69 pg/mL; p = 0.0002), 150.66 pg/mL (95% CI: 63.42 to 237.90 pg/mL; p = 0.0007), and 2.68 pg/mL (95% CI: 0.5 to 4.8 pg/mL; p = 0.01) higher BDNF serum concentration, respectively. No significant associations were found in men. Further research is needed to understand the sex-specific association between CRF and BDNF. © 2019 by the authors. Licensee MDPI, Basel, Switzerland

    Enolase represents a metabolic checkpoint controlling the differential exhaustion programmes of hepatitis virus-specific CD8 + T cells

    Get PDF
    Objective: Exhausted T cells with limited effector function are enriched in chronic hepatitis B and C virus (HBV and HCV) infection. Metabolic regulation contributes to exhaustion, but it remains unclear how metabolism relates to different exhaustion states, is impacted by antiviral therapy, and if metabolic checkpoints regulate dysfunction. Design: Metabolic state, exhaustion and transcriptome of virus-specific CD8+ T cells from chronic HBV-infected (n=31) and HCV-infected patients (n=52) were determined ex vivo and during direct-acting antiviral (DAA) therapy. Metabolic flux and metabolic checkpoints were tested in vitro. Intrahepatic virus-specific CD8+ T cells were analysed by scRNA-Seq in a HBV-replicating murine in vivo model of acute and chronic infection. Results: HBV-specific (core18-27, polymerase455-463) and HCV-specific (NS31073-1081, NS31406-1415, NS5B2594-2602) CD8+ T cell responses exhibit heterogeneous metabolic profiles connected to their exhaustion states. The metabolic state was connected to the exhaustion profile rather than the aetiology of infection. Mitochondrial impairment despite intact glucose uptake was prominent in severely exhausted T cells linked to elevated liver inflammation in chronic HCV infection and in HBV polymerase455-463 -specific CD8+ T cell responses. In contrast, relative metabolic fitness was observed in HBeAg-negative HBV infection in HBV core18-27-specific responses. DAA therapy partially improved mitochondrial programmes in severely exhausted HCV-specific T cells and enriched metabolically fit precursors. We identified enolase as a metabolic checkpoint in exhausted T cells. Metabolic bypassing improved glycolysis and T cell effector function. Similarly, enolase deficiency was observed in intrahepatic HBV-specific CD8+ T cells in a murine model of chronic infection. Conclusion: Metabolism of HBV-specific and HCV-specific T cells is strongly connected to their exhaustion severity. Our results highlight enolase as metabolic regulator of severely exhausted T cells. They connect differential bioenergetic fitness with distinct exhaustion subtypes and varying liver disease, with implications for therapeutic strategies

    Measurement error models for survey statistics and economic archaeology

    No full text
    Die vorliegende Arbeit befasst sich mit sogenannten Messfehlermodellen in der angewandten Statistik. Dabei wurden Daten aus zwei sehr verschiedenen Fachgebieten analysiert und verarbeitet. Zum einen Umfrage- und Registerdaten, welche in der Survey-Statistik Anwendung finden und zum anderen anthropologische Daten zu prähistorischen Skeletten. Beiden gemeinsam ist, dass einige Variablen nicht hinreichend genau erfasst werden können. Dies kann etwa aus Datenschutzgründen beabsichtigt sein oder auf (Mess-) Ungenauigkeiten beruhen. Diesen Umstand kann man unter den Oberbegriffen Messfehler oder Fehler-in-den-Variablen zusammenfassen. Diese Messfehler können fatale Auswirkungen in der statistischen Analyse, wie z.B. stark verzerrte Schätzer oder stark erschwerte grafische Analyse, haben. Trotz dieser teilweise folgenschweren Auswirkungen werden Messfehler in statistischen Analysen in der Anwendung fast immer ignoriert. Diese Arbeit entwickelt daher für bekannte statistische Verfahren wie (multivariate) Kerndichteschätzung und nichtparametrische Regression eine Korrektur anhand konkreter Anwendungen. Viele Techniken zur Korrektur auf Messfehler sind nur für relativ einfache Messfehlermodelle und statistische Verfahren wie die lineare Regression realisierbar. In dieser Arbeit wird daher ein Ansatz mit sogenannten Pseudo- Samples bevorzugt. Die entwickelten Algorithmen lassen sich als stochastischer Expectation-Maximization- oder als voll-Bayesianischer Markov-Chain-Monte- Carlo-Verfahren klassifizieren. Die Arbeit ist in zwei Teile mit insgesamt 5 Kapiteln gegliedert. Teil I behandelt zunächst zwei Fragestellungen aus der Survey-Statistik. In Kapitel 1 wurden über einen Rundungsfehler anonymisierte Geokoordinaten der Wohnsitze von Menschen bestimmter Bevölkerungsgruppen in Berlin analysiert. Um eine sinnvolle nichtparametrische Kerndichteschätzung der Populationsverteilung zu erhalten, wurde der Rundungsprozess mittels eines stochastischen Expectation-Maximization-Algorithmus umgekehrt. In Kapitel 2 wurde dieser Algorithmus stark erweitert, um die Verteilung von Antworten in Survey-Daten zu modellieren. Die dabei üblicherweise auftretende Häufung von bestimmten Werten wird dabei über eine Rundung mit unbekannter Genauigkeit als Zufallsvariable modelliert. Teil II der Arbeit befasst sich mit den Ergebnissen aus dem Emmy-Noether-Projekt „Lebensbedingungen und biologischer Lebensstandard in der Vorgeschichte" – LiVES. Ein Hauptbestandteil des Projekts war die Zusammenführung von drei existierenden Datenbanken prähistorischer Skelette zu einer modernen, web-basierten MySQL-Datenbank. In Kapitel 3 und 4 wurden die bereits korrigierten Daten der Datenbank für eine Vorabanalyse genutzt. Hierbei sollte die Forschungsfrage beantwortet werden, wie sich die Körperhöhe als Proxy für den Lebensstandard in der Vorgeschichte entwickelt hat. Die Körperhöhe wird dabei aus den vorhandenen Langknochenmaßen rekonstruiert. Der Autor hat in diesem Zusammenhang ein voll-Bayesianisches additives gemischtes Messfehlermodell entwickelt, welches die räumlich- zeitliche Entwicklung der Körperhöhe modelliert. Dabei wurde insbesondere die Unsicherheit bzw. der Messfehler in der chronologischen Einordnung der Skelette als auch die Unsicherheit über das Geschlecht jeweils über ein Berkson-Fehler-Modell berücksichtigt. Abschließend befasst sich Kapitel 5 mit der Körperhöhenschätzung und der Frage wie sich diese aus den vorhandendenen Langknochen der prähistorischen Skelette optimal schätzen lässt.The present work is concerned with so-called measurement error models in applied statistics. The data were analyzed and processed from two very different fields. On the one hand survey and register data, which are used in the Survey statistics and on the other hand anthropological data on prehistoric skeletons. For both fields the problem arises that some variables cannot be measured with sufficient accuracy. This can be due to privacy or measuring inaccuracies. This circumstance can be summarized under the headings measurement error or error-in-the-variables. These measurement errors can have fatal effects in the statistical analysis, such as strongly biased estimates or highly complicated graphical analysis. Despite these consequences, measurement errors are almost always ignored in statistical analyzes. This work therefore developed a correction for specific applications of known statistical methods such as (multivariate) kernel density estimation and nonparametric regression. Many techniques for correcting measurement errors are feasible only for relatively simple measurement error models and statistical methods such as linear regression. In this work, therefore, an approach with so-called pseudo-samples is preferred. The developed algorithms can be classified as stochastic Expectation-Maximization method or as a fully- Bayesian Markov-Chain-Monte-Carlo method. The work is structured into two parts with a total of 5 chapters. Part I deals with two questions from the survey statistics. In Chapter 1 geographical coordinates of residences of people of certain population groups in Berlin were anonymized by rounding of these coordinates. In order to obtain a useful non-parametric kernel density estimation of the population distribution the rounding process was reversed by means of a stochastic expectation-maximization algorithm. This algorithm has been greatly expanded to model the distribution of responses in survey data in Chapter 2. The usual heaping of certain values is modeled via rounding of unknown accuracy as a random variable. Part II of this work deals with the results of the Emmy-Noether-project "living conditions and biological standard of living in prehistory." – LiVES. A major component of the project was to merge three existing databases of prehistoric skeletons to a modern, web-based MySQL database. Already corrected data from the database were used for a preliminary analysis in Chapters 3 and 4. The central research question to be answered in these chapters was: How did the body height as a proxy for the standard of living developed in spatio-temporally prehistory? The body height is hereby reconstructed from the existing long bone dimensions. In this context, a fully Bayesian additive mixed measurement error model, which models the spatial and temporal evolution of the body height, was developed. In particular, the uncertainty in the chronological classification of the skeletons as well as the uncertainty concerning the sex of the skeletons were considered by a Berkson error model. Finally, Chapter 5 deals stature estimation and the question how stature can be optimally estimated given the available long bones of the prehistoric skeletons
    corecore