27 research outputs found

    Random projections for Bayesian regression

    Get PDF
    This article deals with random projections applied as a data reduction technique for Bayesian regression analysis. We show sufficient conditions under which the entire dd-dimensional distribution is approximately preserved under random projections by reducing the number of data points from nn to k∈O(poly⁥(d/Δ))k\in O(\operatorname{poly}(d/\varepsilon)) in the case n≫dn\gg d. Under mild assumptions, we prove that evaluating a Gaussian likelihood function based on the projected data instead of the original data yields a (1+O(Δ))(1+O(\varepsilon))-approximation in terms of the ℓ2\ell_2 Wasserstein distance. Our main result shows that the posterior distribution of Bayesian linear regression is approximated up to a small error depending on only an Δ\varepsilon-fraction of its defining parameters. This holds when using arbitrary Gaussian priors or the degenerate case of uniform distributions over Rd\mathbb{R}^d for ÎČ\beta. Our empirical evaluations involve different simulated settings of Bayesian linear regression. Our experiments underline that the proposed method is able to recover the regression model up to small error while considerably reducing the total running time

    Bayesian and frequentist regression approaches for very large data sets

    Get PDF
    This thesis is concerned with the analysis of frequentist and Bayesian regression models for data sets with a very large number of observations. Such large data sets pose a challenge when conducting regression analysis, because of the memory required (mainly for frequentist regression models) and the running time of the analysis (mainly for Bayesian regression models). I present two different approaches that can be employed in this setting. The first approach is based on random projections and reduces the number of observations to manageable level as a first step before the regression analysis. The reduced number of observations depends on the number of variables in the data set and the desired goodness of the approximation. It is, however, independent of the number of observations in the original data set, making it especially useful for very large data sets. Theoretical guarantees for Bayesian linear regression are presented, which extend known guarantees for the frequentist case. The fundamental theorem covers Bayesian linear regression with arbitrary normal distributions or non-informative uniform distributions as prior distributions. I evaluate how close the posterior distributions of the original model and the reduced data set are for this theoretically covered case as well as for extensions towards hierarchical models and models using q-generalised normal distributions as prior. The second approach presents a transfer of the Merge & Reduce-principle from data structures to regression models. In Computer Science, Merge & Reduce is employed in order to enable the use of static data structures in a streaming setting. Here, I present three possibilities of employing Merge & Reduce directly on regression models. This enables sequential or parallel analysis of subsets of the data set. The partial results are then combined in a way that recovers the regression model on the full data set well. This approach is suitable for a wide range of regression models. I evaluate the performance on simulated and real world data sets using linear and Poisson regression models. Both approaches are able to recover regression models on the original data set well. They thus offer scalable versions of frequentist or Bayesian regression analysis for linear regression as well as extensions to generalised linear models, hierarchical models, and q-generalised normal distributions as prior distribution. Application on data streams or in distributed settings is also possible. Both approaches can be combined with multiple algorithms for frequentist or Bayesian regression analysis

    Streaming statistical models via Merge & Reduce

    Get PDF
    Merge & Reduce is a general algorithmic scheme in the theory of data structures. Its main purpose is to transform static data structures—that support only queries—into dynamic data structures—that allow insertions of new elements—with as little overhead as possible. This can be used to turn classic offline algorithms for summarizing and analyzing data into streaming algorithms. We transfer these ideas to the setting of statistical data analysis in streaming environments. Our approach is conceptually different from previous settings where Merge & Reduce has been employed. Instead of summarizing the data, we combine the Merge & Reduce framework directly with statistical models. This enables performing computationally demanding data analysis tasks on massive data sets. The computations are divided into small tractable batches whose size is independent of the total number of observations n. The results are combined in a structured way at the cost of a bounded O(logn) factor in their memory requirements. It is only necessary, though nontrivial, to choose an appropriate statistical model and design merge and reduce operations on a casewise basis for the specific type of model. We illustrate our Merge & Reduce schemes on simulated and real-world data employing (Bayesian) linear regression models, Gaussian mixture models and generalized linear models

    Providing Information by Resource- Constrained Data Analysis

    Get PDF
    The Collaborative Research Center SFB 876 (Providing Information by Resource-Constrained Data Analysis) brings together the research fields of data analysis (Data Mining, Knowledge Discovery in Data Bases, Machine Learning, Statistics) and embedded systems and enhances their methods such that information from distributed, dynamic masses of data becomes available anytime and anywhere. The research center approaches these problems with new algorithms respecting the resource constraints in the different scenarios. This Technical Report presents the work of the members of the integrated graduate school

    Validity of stable isotope data in doping control: perspectives and proposals

    No full text
    ?13C and d13C values of endogenous urinary steroids represent physiological random variables. Measurement uncertainty and biological scatter likewise contribute to the variances. The statistical distributions of negative controls are well investigated, but there is little knowledge about the corresponding distributions of steroid-users. For these reasons valid discrimination of steroid users from non-users by 13C/12C analysis of endogenous steroids requires elaborate statistical treatment. Corresponding Bayesian approaches are presented following an introduction to the rationale. The use of mixture models appears appropriate. The distribution of routine data has been deconvolved and characterized accordingly. The mixture components, which presumably represent steroid users and non-users, exhibit considerable overlap. The validity of a given result depends on both the analytical uncertainty and the prior probability of doping offenses. Low analytical uncertainties but high prior probabilities facilitate valid detection of doping offenses. Two recommendations can be deduced. First, before starting an 13C/12C analysis, any initial suspicion should be well-substantiated. This precludes use of permissive criteria derived from the steroid profile. Secondly, knowledge of relevant 13C/12C distributions is required. This must cover representative numbers of authentic steroid users. Finally, it is desirable that the conditional probability for steroid administration rather than the measurement uncertainty is calculated and reported. This quantity possesses superior validity and it is largely independent of laboratory bias. The findings suggest and facilitate flexible handling of decision limits. Proposals for the evaluation of stable isotope data are presented. Copyright (c) 2012 John Wiley & Sons, Ltd

    Random projections for Bayesian regression

    No full text
    This article introduces random projections applied as a data reduction technique for Bayesian regression analysis. We show sufficient conditions under which the entire d -dimensional distribution is preserved under random projections by reducing the number of data points from n to k element of O(poly(d/epsilon)) in the case n >> d . Under mild assumptions, we prove that evaluating a Gaussian likelihood function based on the projected data instead of the original data yields a (1+ O(epsilon))-approximation in the l_2-Wasserstein distance. Our main result states that the posterior distribution of a Bayesian linear regression is approximated up to a small error depending on only an epsilon-fraction of its defining parameters when using either improper non-informative priors or arbitrary Gaussian priors. Our empirical evaluations involve different simulated settings of Bayesian linear regression. Our experiments underline that the proposed method is able to recover the regression model while considerably reducing the total run-time

    Mortality Among Very Low-Birthweight Infants in Hospitals Serving Minority Populations

    No full text
    Objective. We investigated whether the proportion of Black very low-birth-weight (VLBW) infants treated by hospitals is associated with neonatal mortality for Black and White VLBW infants. Methods. We analyzed medical records linked to secondary data sources for 74050 Black and White VLBW infants (501 g to 1500 g) treated by 332 hospitals participating in the Vermont Oxford Network from 1995 to 2000. Hospitals where more than 35% of VLBW infants treated were Black were defined as “minority-serving.” Results. Compared with hospitals where less than 15% of the VLBW infants were Black, minority-serving hospitals had significantly higher risk-adjusted neonatal mortality rates (White infants: odds ratio [OR]=1.30, 95% confidence interval [CI] = 1.09, 1.56; Black infants: OR = 1.29, 95% CI = 1.01, 1.64; Pooled: OR = 1.28, 95% CI=1.10, 1.50). Higher neonatal mortality in minority-serving hospitals was not explained by either hospital or treatment variables. Conclusions. Minority-serving hospitals may provide lower quality of care to VLBW infants compared with other hospitals. Because VLBW Black infants are disproportionately treated by minority-serving hospitals, higher neonatal mortality rates at these hospitals may contribute to racial disparities in infant mortality in the United States

    Influence of breast cancer risk factors and intramammary biotransformation on estrogen homeostasis in the human breast

    No full text
    Understanding intramammary estrogen homeostasis constitutes the basis of understanding the role of lifestyle factors in breast cancer etiology. Thus, the aim of the present study was to identify variables influencing levels of the estrogens present in normal breast glandular and adipose tissues (GLT and ADT, i.e., 17ÎČ-estradiol, estrone, estrone-3-sulfate, and 2-methoxy-estrone) by multiple linear regression models. Explanatory variables (exVARs) considered were (a) levels of metabolic precursors as well as levels of transcripts encoding proteins involved in estrogen (biotrans)formation, (b) data on breast cancer risk factors (i.e., body mass index, BMI, intake of estrogen-active drugs, and smoking) collected by questionnaire, and (c) tissue characteristics (i.e., mass percentage of oil, oil%, and lobule type of the GLT). Levels of estrogens in GLT and ADT were influenced by both extramammary production (menopausal status, intake of estrogen-active drugs, and BMI) thus showing that variables known to affect levels of circulating estrogens influence estrogen levels in breast tissues as well for the first time. Moreover, intratissue (biotrans)formation (by aromatase, hydroxysteroid-17beta-dehydrogenase 2, and beta-glucuronidase) influenced intratissue estrogen levels, as well. Distinct differences were observed between the exVARs exhibiting significant influence on (a) levels of specific estrogens and (b) the same dependent variables in GLT and ADT. Since oil% and lobule type of GLT influenced levels of some estrogens, these variables may be included in tissue characterization to prevent sample bias. In conclusion, evidence for the intracrine activity of the human breast supports biotransformation-based strategies for breast cancer prevention. The susceptibility of estrogen homeostasis to systemic and tissue-specific modulation renders both beneficial and adverse effects of further variables associated with lifestyle and the environment possible

    Influence of breast cancer risk factors on proliferation and DNA damage in human breast glandular tissues: role of intracellular estrogen levels, oxidative stress and estrogen biotransformation

    No full text
    Breast cancer etiology is associated with both proliferation and DNA damage induced by estrogens. Breast cancer risk factors (BCRF) such as body mass index (BMI), smoking, and intake of estrogen-active drugs were recently shown to influence intratissue estrogen levels. Thus, the aim of the present study was to investigate the influence of BCRF on estrogen-induced proliferation and DNA damage in 41 well-characterized breast glandular tissues derived from women without breast cancer. Influence of intramammary estrogen levels and BCRF on estrogen receptor (ESR) activation, ESR-related proliferation (indicated by levels of marker transcripts), oxidative stress (indicated by levels of GCLC transcript and oxidative derivatives of cholesterol), and levels of transcripts encoding enzymes involved in estrogen biotransformation was identified by multiple linear regression models. Metabolic fluxes to adducts of estrogens with DNA (E-DNA) were assessed by a metabolic network model (MNM) which was validated by comparison of calculated fluxes with data on methoxylated and glucuronidated estrogens determined by GC- and UHPLC-MS/MS. Intratissue estrogen levels significantly influenced ESR activation and fluxes to E-DNA within the MNM. Likewise, all BCRF directly and/or indirectly influenced ESR activation, proliferation, and key flux constraints influencing E-DNA (i.e., levels of estrogens, CYP1B1, SULT1A1, SULT1A2, and GSTP1). However, no unambiguous total effect of BCRF on proliferation became apparent. Furthermore, BMI was the only BCRF to indeed influence fluxes to E-DNA (via congruent adverse influence on levels of estrogens, CYP1B1 and SULT1A2)
    corecore