68 research outputs found
Kernel Density Estimation for Heaped Data
In self-reported data usually a phenomenon called `heaping' occurs, i.e.
survey participants round the values of their income, weight or height to some
degree. Additionally, respondents may be more prone to round off or up due to
social desirability. By ignoring the heaping process a severe bias in terms of
spikes and bumps is introduced when applying kernel density methods naively to
the rounded data. A generalized Stochastic Expectation Maximization (SEM)
approach accounting for heaping with potentially asymmetric rounding behaviour
in univariate kernel density estimation is presented in this work. The
introduced methods are applied to survey data of the German Socio-Economic
Panel and exhibit very good performance simulations
a case study for student residents in Berlin
The transformation of area aggregates between non-hierarchical area systems is
a standard problem of official statistics. We introduce a new method which is
based on kernel density estimates. It is a modification of the SEM algorithm
proposed by Gross et al. (2016), which was used for the transformation of
totals on rectangular areas to kernel densities estimates. As a by-product of
the routine one obtains simulated geo-coordinates for each unit. With the help
of these geo-coordinates it is possible to calculate case numbers for a new
area system. The method is applied to student resident figures from Berlin.
These are known only at the level of ZIP codes but they are needed for
administrative planning districts. Our method is evaluated on a similar,
simulated data set with known exact geo-coordinates. In the empirical part
results for changes in the student residential areas between 2005 and 2015 are
presented. It is demonstrated that the transformation via kernel density
estimates offers additional useful features to display concentration areas
Die Glättung räumlicher Datensätze auf administrativen Flächen: Eine Fallstudie mit Berliner Wahldaten
Composite spatial data on administrative area level are often presented by maps. The aim is to detect regional differences in the concentration of subpopulations, like elderly persons, ethnic minorities, low-educated persons, voters of a political party or persons with a certain disease. Thematic collections of such maps are presented in different atlases. The standard presentation is by Choropleth maps where each administrative unit is represented by a single value. These maps can be criticized under three aspects: the implicit assumption of a uniform distribution within the area, the instability of the resulting map with respect to a change of the reference area and the discontinuities of the maps at the borderlines of the reference areas which inhibit the detection of regional clusters.
In order to address these problems we use a density approach in the construction of maps. This approach does not enforce a local uniform distribution. It does not depend on a specific choice of area reference system and there are no discontinuities in the displayed maps. A standard estimation procedure of densities are Kernel density estimates. However, these estimates need the geo-coordinates of the single units which are not at disposal as we have only access to the aggregates of some area system. To overcome this hurdle, we use a statistical simulation concept. This can be interpreted as a Simulated Expectation Maximisation (SEM) algorithm of Celeux et al (1996). We simulate observations from the current density estimates which are consistent with the aggregation information (S-step). Then we apply the Kernel density estimator to the simulated sample which gives the next density estimate (E-Step).
This concept has been first applied for grid data with rectangular areas, see Groß et al (2017), for the display of ethnic minorities. In a second application we demonstrated the use of this approach for the so-called “change of support” (Bradley et al 2016) problem. Here Groß et al (2020) used the SEM algorithm to recalculate case numbers between non-hierarchical administrative area systems. Recently Rendtel et al (2021) applied the SEM algorithm to display spatial-temporal clusters of Corona infections in Germany.
Here we present three modifications of the basic SEM algorithm: 1) We introduce a boundary correction which removes the underestimation of kernel density estimates at the borders of the population area. 2) We recognize unsettled areas, like lakes, parks and industrial areas, in the computation of the kernel density. 3) We adapt the SEM algorithm for the computation of local percentages which are important especially in voting analysis.
We evaluate our approach against several standard maps by means of the local voting register with known addresses. In the empirical part we apply our approach for the display of voting results for the 2016 election of the Berlin parliament. We contrast our results against Choropleth maps and show new possibilities for reporting spatial voting results.Räumliche Daten auf der Ebene administrativer Flächeneinheiten werden häufig über Karten dargestellt. Das Ziel ist es dabei regionale Unterschiede für interessierenden Bevölkerungsgruppen aufzudecken. Dies betrifft beispielsweise ältere Personen, ethnische Minderheiten, Personen mit geringer Bildung aber auch Wähler einer politischen Partei sowie Personen, die sich mit einer bestimmten Krankheit infiziert haben. Die Zusammenfassung derartiger Karten wird in Atlanten präsentiert. Eine Standarddarstellung benutzt Choroplethen, wo jede administrative Einheit durch einen einzigen Wert repräsentiert wird. Diese Karten können unter drei Aspekten kritisiert werden: Die implizite Annahme einer gleichmäßigen Verteilung innerhalb der Fläche der Einheit, die Instabilität der Darstellung beim Wechsel der administrativen Einheit sowie die Sprünge an den Grenzlinien der Einheiten, die das Aufdecken von regionalen Clustern erschweren.
Um diese Probleme zu beseitigen, verwenden wir eine Kartenkonstruktion auf der Basis von Dichten. Dieser Ansatz vermeidet eine zwangsläufige gleichmäßige Dichte innerhalb der Referenzflächen. Er ist unabhängig von der Wahl eines spezifischen Referenzsystems und vermeidet Sprungstellen. Ein Standardverfahren würde Kerndichteschätzer verwenden. Allerdings werden hierfür die Geokoordinaten der einzelnen Einheiten benötigt. Diese stehen aber nicht zur Verfügung sondern lediglich die Aggregate der jeweiligen Flächeneinheit. Um diese Hürde zu umgehen, verwenden wir ein statistisches Simulationskonzept. Es kann als Simulierter EM (SEM) Algorithmus von Celeux et al (1996) beschrieben werden. Auf Basis der gegenwärtigen Dichteschätzung simulieren wir Beobachtungen, die mit der Aggregatsinformation konsistent sind (S-Schritt). Dann wenden wir den Kerndichteschätzer auf die simulierte Stichprobe an, die die nächste Dichteschätzung liefert (E-Schritt).
Dieses Konzept wurde erstmals für Gitterdaten auf Rechtecken zur Darstellung von ethnischen Minderheiten angewendet, Groß et al (2017). Eine weitere Anwendung fand dieser Ansatz beim sogenannten „Change of Support“ Problem, (Bradley et al 2016). Hier nutzten Groß et al (2020) den SEM Algorithmus bei der Umrechnung von Fallzahlen zwischen nicht-hierarchischen Flächensystemen. Jüngst haben Rendtel et al (2021) den SEM Algorithmus für die Darstellung räumlich-zeitlicher Konzentrationen von Corona Infektionen in Deutschland verwendet.
Hier präsentieren wir drei Modifikationen des SEM Algorithmus: 1) Wir führen eine Randkorrektur ein, die die Unterschätzung der Kerndichteschätzung an den Grenzen der Population beseitigt. 2) Wir berücksichtigen unbewohnte Bereiche wie Parks, Seen und Industriegebiete bei der Berechnung der Kerndichteschätzung. 3) Wir passen den SEM Algorithmus für die Berechnung lokaler Prozentsätze an, die insbesondere für Wahlanalysen interessant sind.
Wir evaluieren unseren Ansatz gegen verschiedene Standardkarten auf Basis eines lokalen Wählerregisters mit bekannten Adressen. Im empirischen Teil wenden wir unseren Ansatz auf die Darstellung von Wahlergebnissen zur Wahl des Berliner Abgeordnetenhauses 2016 an. Wir vergleichen unsere Ergebnisse mit Choroplethenkarten und zeigen neue Möglichkeiten zur Berichterstattung räumlicher Wahlergebnisse
Multivariate kernel density estimation applied to sensitive geo-referenced administrative data protected via measurement error
Modern systems of official statistics require the timely estimation of area-
specific densities of sub-populations. Ideally estimates should be based on
precise geo-coded information, which is not available due to confidentiality
constraints. One approach for ensuring confidentiality is by rounding the geo-
coordinates. We propose multivariate non-parametric kernel density estimation
that reverses the rounding process by using a Bayesian measurement error
model. The methodology is applied to the Berlin register of residents for
deriving density estimates of ethnic minorities and aged people. Estimates are
used for identifying areas with a need for new advisory centres for migrants
and infrastructure for older people
Simulated geo-coordinates as a tool for map-based regional analysis
Map-based regional analysis is interested to detect areas with a large
concentration of certain populations. Here kernel density estimates (KDE)
offer advantages over classical choropleth maps. However, kernel density
estimation needs exact geo-coordinates. In a recent paper Groß et al. (2017)
have proposed a measurement error model which uses local aggregates for kernel
density estimation. Their algorithm simulates "exact" geo-coordinates which
reflect the information on the aggregates. In this article we suggest two
extensions of this approach. First, we consider boundary constraints, which
are usually ignored in the KDE framework. This concerns not only the outer
limits of a municipality but also unsettled regions within a city like parks,
lakes and industrial areas. Without a boundary correction standard KDEs
underestimate the density in the vicinity of boundaries. Here we propose a
modification of the original algorithm which uses rescaled kernel functions.
Regional maps often display local percentages, for example, voters for a
special party among all voters in each voting district. Here we derive a
smooth representation of percentages which is based on the ratio of two
densities. Again, the original algorithm is modified to cope with the
estimation of a ratio of two densities. Our empirical examples refer to voting
results from Berlin. It is shown that the proposed methodology reveals a lot
of regional insight which is not produced by standard choropleth maps
Sex-Specific Associations of Brain-Derived Neurotrophic Factor and Cardiorespiratory Fitness in the General Population
The brain-derived neurotrophic factor (BDNF) was initially considered to be neuron-specific. Meanwhile, this neurotrophin is peripherally also secreted by skeletal muscle cells and increases due to exercise. Whether BDNF is related to cardiorespiratory fitness (CRF) is currently unclear. We analyzed the association of serum BDNF levels with CRF in the general population (Study of Health in Pomerania (SHIP-TREND) from Northeast Germany; n = 1607, 51% female; median age 48 years). Sex-stratified linear regression models adjusted for age, height, smoking, body fat, lean mass, physical activity, and depression analyzed the association between BDNF and maximal oxygen consumption (VO2peak), maximal oxygen consumption normalized for body weight (VO2peak/kg), and oxygen consumption at the anaerobic threshold (VO2@AT). In women, 1mL/min higher VO2peak, VO2peak/kg, and VO2@AT were associated with a 2.43 pg/mL (95% confidence interval [CI]: 1.16 to 3.69 pg/mL; p = 0.0002), 150.66 pg/mL (95% CI: 63.42 to 237.90 pg/mL; p = 0.0007), and 2.68 pg/mL (95% CI: 0.5 to 4.8 pg/mL; p = 0.01) higher BDNF serum concentration, respectively. No significant associations were found in men. Further research is needed to understand the sex-specific association between CRF and BDNF. © 2019 by the authors. Licensee MDPI, Basel, Switzerland
Recommended from our members
Sex-Specific associations of brain-derived neurotrophic factor and cardiorespiratory fitness in the general population
The brain-derived neurotrophic factor (BDNF) was initially considered to be neuron-specific. Meanwhile, this neurotrophin is peripherally also secreted by skeletal muscle cells and increases due to exercise. Whether BDNF is related to cardiorespiratory fitness (CRF) is currently unclear. We analyzed the association of serum BDNF levels with CRF in the general population (Study of Health in Pomerania (SHIP-TREND) from Northeast Germany; n = 1607, 51% female; median age 48 years). Sex-stratified linear regression models adjusted for age, height, smoking, body fat, lean mass, physical activity, and depression analyzed the association between BDNF and maximal oxygen consumption (VO2peak), maximal oxygen consumption normalized for body weight (VO2peak/kg), and oxygen consumption at the anaerobic threshold (VO2@AT). In women, 1mL/min higher VO2peak, VO2peak/kg, and VO2@AT were associated with a 2.43 pg/mL (95% confidence interval [CI]: 1.16 to 3.69 pg/mL; p = 0.0002), 150.66 pg/mL (95% CI: 63.42 to 237.90 pg/mL; p = 0.0007), and 2.68 pg/mL (95% CI: 0.5 to 4.8 pg/mL; p = 0.01) higher BDNF serum concentration, respectively. No significant associations were found in men. Further research is needed to understand the sex-specific association between CRF and BDNF. © 2019 by the authors. Licensee MDPI, Basel, Switzerland
Enolase represents a metabolic checkpoint controlling the differential exhaustion programmes of hepatitis virus-specific CD8 + T cells
Objective: Exhausted T cells with limited effector function are enriched in chronic hepatitis B and C virus (HBV and HCV) infection. Metabolic regulation contributes to exhaustion, but it remains unclear how metabolism relates to different exhaustion states, is impacted by antiviral therapy, and if metabolic checkpoints regulate dysfunction. Design: Metabolic state, exhaustion and transcriptome of virus-specific CD8+ T cells from chronic HBV-infected (n=31) and HCV-infected patients (n=52) were determined ex vivo and during direct-acting antiviral (DAA) therapy. Metabolic flux and metabolic checkpoints were tested in vitro. Intrahepatic virus-specific CD8+ T cells were analysed by scRNA-Seq in a HBV-replicating murine in vivo model of acute and chronic infection. Results: HBV-specific (core18-27, polymerase455-463) and HCV-specific (NS31073-1081, NS31406-1415, NS5B2594-2602) CD8+ T cell responses exhibit heterogeneous metabolic profiles connected to their exhaustion states. The metabolic state was connected to the exhaustion profile rather than the aetiology of infection. Mitochondrial impairment despite intact glucose uptake was prominent in severely exhausted T cells linked to elevated liver inflammation in chronic HCV infection and in HBV polymerase455-463 -specific CD8+ T cell responses. In contrast, relative metabolic fitness was observed in HBeAg-negative HBV infection in HBV core18-27-specific responses. DAA therapy partially improved mitochondrial programmes in severely exhausted HCV-specific T cells and enriched metabolically fit precursors. We identified enolase as a metabolic checkpoint in exhausted T cells. Metabolic bypassing improved glycolysis and T cell effector function. Similarly, enolase deficiency was observed in intrahepatic HBV-specific CD8+ T cells in a murine model of chronic infection. Conclusion: Metabolism of HBV-specific and HCV-specific T cells is strongly connected to their exhaustion severity. Our results highlight enolase as metabolic regulator of severely exhausted T cells. They connect differential bioenergetic fitness with distinct exhaustion subtypes and varying liver disease, with implications for therapeutic strategies
Measurement error models for survey statistics and economic archaeology
Die vorliegende Arbeit befasst sich mit sogenannten Messfehlermodellen in der
angewandten Statistik. Dabei wurden Daten aus zwei sehr verschiedenen
Fachgebieten analysiert und verarbeitet. Zum einen Umfrage- und Registerdaten,
welche in der Survey-Statistik Anwendung finden und zum anderen
anthropologische Daten zu prähistorischen Skeletten. Beiden gemeinsam ist,
dass einige Variablen nicht hinreichend genau erfasst werden können. Dies kann
etwa aus Datenschutzgründen beabsichtigt sein oder auf (Mess-) Ungenauigkeiten
beruhen. Diesen Umstand kann man unter den Oberbegriffen Messfehler oder
Fehler-in-den-Variablen zusammenfassen. Diese Messfehler können fatale
Auswirkungen in der statistischen Analyse, wie z.B. stark verzerrte Schätzer
oder stark erschwerte grafische Analyse, haben. Trotz dieser teilweise
folgenschweren Auswirkungen werden Messfehler in statistischen Analysen in der
Anwendung fast immer ignoriert. Diese Arbeit entwickelt daher für bekannte
statistische Verfahren wie (multivariate) Kerndichteschätzung und
nichtparametrische Regression eine Korrektur anhand konkreter Anwendungen.
Viele Techniken zur Korrektur auf Messfehler sind nur für relativ einfache
Messfehlermodelle und statistische Verfahren wie die lineare Regression
realisierbar. In dieser Arbeit wird daher ein Ansatz mit sogenannten Pseudo-
Samples bevorzugt. Die entwickelten Algorithmen lassen sich als stochastischer
Expectation-Maximization- oder als voll-Bayesianischer Markov-Chain-Monte-
Carlo-Verfahren klassifizieren. Die Arbeit ist in zwei Teile mit insgesamt 5
Kapiteln gegliedert. Teil I behandelt zunächst zwei Fragestellungen aus der
Survey-Statistik. In Kapitel 1 wurden über einen Rundungsfehler anonymisierte
Geokoordinaten der Wohnsitze von Menschen bestimmter Bevölkerungsgruppen in
Berlin analysiert. Um eine sinnvolle nichtparametrische Kerndichteschätzung
der Populationsverteilung zu erhalten, wurde der Rundungsprozess mittels eines
stochastischen Expectation-Maximization-Algorithmus umgekehrt. In Kapitel 2
wurde dieser Algorithmus stark erweitert, um die Verteilung von Antworten in
Survey-Daten zu modellieren. Die dabei üblicherweise auftretende Häufung von
bestimmten Werten wird dabei über eine Rundung mit unbekannter Genauigkeit als
Zufallsvariable modelliert. Teil II der Arbeit befasst sich mit den
Ergebnissen aus dem Emmy-Noether-Projekt „Lebensbedingungen und biologischer
Lebensstandard in der Vorgeschichte" – LiVES. Ein Hauptbestandteil des
Projekts war die Zusammenführung von drei existierenden Datenbanken
prähistorischer Skelette zu einer modernen, web-basierten MySQL-Datenbank. In
Kapitel 3 und 4 wurden die bereits korrigierten Daten der Datenbank für eine
Vorabanalyse genutzt. Hierbei sollte die Forschungsfrage beantwortet werden,
wie sich die Körperhöhe als Proxy für den Lebensstandard in der Vorgeschichte
entwickelt hat. Die Körperhöhe wird dabei aus den vorhandenen Langknochenmaßen
rekonstruiert. Der Autor hat in diesem Zusammenhang ein voll-Bayesianisches
additives gemischtes Messfehlermodell entwickelt, welches die räumlich-
zeitliche Entwicklung der Körperhöhe modelliert. Dabei wurde insbesondere die
Unsicherheit bzw. der Messfehler in der chronologischen Einordnung der
Skelette als auch die Unsicherheit über das Geschlecht jeweils über ein
Berkson-Fehler-Modell berücksichtigt. Abschließend befasst sich Kapitel 5 mit
der Körperhöhenschätzung und der Frage wie sich diese aus den vorhandendenen
Langknochen der prähistorischen Skelette optimal schätzen lässt.The present work is concerned with so-called measurement error models in
applied statistics. The data were analyzed and processed from two very
different fields. On the one hand survey and register data, which are used in
the Survey statistics and on the other hand anthropological data on
prehistoric skeletons. For both fields the problem arises that some variables
cannot be measured with sufficient accuracy. This can be due to privacy or
measuring inaccuracies. This circumstance can be summarized under the headings
measurement error or error-in-the-variables. These measurement errors can have
fatal effects in the statistical analysis, such as strongly biased estimates
or highly complicated graphical analysis. Despite these consequences,
measurement errors are almost always ignored in statistical analyzes. This
work therefore developed a correction for specific applications of known
statistical methods such as (multivariate) kernel density estimation and
nonparametric regression. Many techniques for correcting measurement errors
are feasible only for relatively simple measurement error models and
statistical methods such as linear regression. In this work, therefore, an
approach with so-called pseudo-samples is preferred. The developed algorithms
can be classified as stochastic Expectation-Maximization method or as a fully-
Bayesian Markov-Chain-Monte-Carlo method. The work is structured into two
parts with a total of 5 chapters. Part I deals with two questions from the
survey statistics. In Chapter 1 geographical coordinates of residences of
people of certain population groups in Berlin were anonymized by rounding of
these coordinates. In order to obtain a useful non-parametric kernel density
estimation of the population distribution the rounding process was reversed by
means of a stochastic expectation-maximization algorithm. This algorithm has
been greatly expanded to model the distribution of responses in survey data in
Chapter 2. The usual heaping of certain values is modeled via rounding of
unknown accuracy as a random variable. Part II of this work deals with the
results of the Emmy-Noether-project "living conditions and biological standard
of living in prehistory." – LiVES. A major component of the project was to
merge three existing databases of prehistoric skeletons to a modern, web-based
MySQL database. Already corrected data from the database were used for a
preliminary analysis in Chapters 3 and 4. The central research question to be
answered in these chapters was: How did the body height as a proxy for the
standard of living developed in spatio-temporally prehistory? The body height
is hereby reconstructed from the existing long bone dimensions. In this
context, a fully Bayesian additive mixed measurement error model, which models
the spatial and temporal evolution of the body height, was developed. In
particular, the uncertainty in the chronological classification of the
skeletons as well as the uncertainty concerning the sex of the skeletons were
considered by a Berkson error model. Finally, Chapter 5 deals stature
estimation and the question how stature can be optimally estimated given the
available long bones of the prehistoric skeletons
- …