258 research outputs found
Bounded Influence Regression in the Presence of Heteroskedasticity of Unknown Form
In a regression model with conditional heteroskedasticity of unknown form, we propose a general class of M-estimators scaled by nonparametric estimates of the conditional standard deviations of the dependent variable. We give regularity conditions under which these estimators are asymptotically equivalent to M-estimators scaled by the true conditional standard deviations. The practical performance of these estimators is investigated through a Monte Carlo experiment
Yet another breakdown point notion: EFSBP - illustrated at scale-shape models
The breakdown point in its different variants is one of the central notions
to quantify the global robustness of a procedure. We propose a simple
supplementary variant which is useful in situations where we have no obvious or
only partial equivariance: Extending the Donoho and Huber(1983) Finite Sample
Breakdown Point, we propose the Expected Finite Sample Breakdown Point to
produce less configuration-dependent values while still preserving the finite
sample aspect of the former definition. We apply this notion for joint
estimation of scale and shape (with only scale-equivariance available),
exemplified for generalized Pareto, generalized extreme value, Weibull, and
Gamma distributions. In these settings, we are interested in highly-robust,
easy-to-compute initial estimators; to this end we study Pickands-type and
Location-Dispersion-type estimators and compute their respective breakdown
points.Comment: 21 pages, 4 figure
Private Drinking Water Wells as a Source of Exposure to Perfluorooctanoic Acid (PFOA) in Communities Surrounding a Fluoropolymer Production Facility
BACKGROUND: The C8 Health Project was established in 2005 to collect data on perfluorooctanoic acid (PFOA, or C8) and human health in Ohio and West Virginia communities contaminated by a fluoropolymer production facility. OBJECTIVE: We assessed PFOA exposure via contaminated drinking water in a subset of C8 Health Project participants who drank water from private wells. METHODS: Participants provided demographic information and residential, occupational, and medical histories. Laboratory analyses were conducted to determine serum-PFOA concentrations. PFOA data were collected from 2001 through 2005 from 62 private drinking water wells. We examined the relationship between drinking water and PFOA levels in serum using robust regression methods. As a comparison with regression models, we used a first-order, single-compartment pharmacokinetic model to estimate the serum:drinking-water concentration ratio at steady state. RESULTS: The median serum PFOA concentration in 108 study participants who used private wells was 75.7 μg/L, approximately 20 times greater than the levels in the U.S. general population but similar to those of local residents who drank public water. Each 1 μg/L increase in PFOA levels in drinking water was associated with an increase in serum concentrations of 141.5 μg/L (95% confidence interval, 134.9-148.1). The serum:drinking-water concentration ratio for the steady-state pharmacokinetic model was 114. CONCLUSIONS: PFOA-contaminated drinking water is a significant contributor to PFOA levels in serum in the study population. Regression methods and pharmacokinetic modeling produced similar estimates of the relationship
On the Schoenberg Transformations in Data Analysis: Theory and Illustrations
The class of Schoenberg transformations, embedding Euclidean distances into
higher dimensional Euclidean spaces, is presented, and derived from theorems on
positive definite and conditionally negative definite matrices. Original
results on the arc lengths, angles and curvature of the transformations are
proposed, and visualized on artificial data sets by classical multidimensional
scaling. A simple distance-based discriminant algorithm illustrates the theory,
intimately connected to the Gaussian kernels of Machine Learning
Combining estimates of interest in prognostic modelling studies after multiple imputation: current practice and guidelines
Background: Multiple imputation (MI) provides an effective approach to handle missing covariate
data within prognostic modelling studies, as it can properly account for the missing data
uncertainty. The multiply imputed datasets are each analysed using standard prognostic modelling
techniques to obtain the estimates of interest. The estimates from each imputed dataset are then
combined into one overall estimate and variance, incorporating both the within and between
imputation variability. Rubin's rules for combining these multiply imputed estimates are based on
asymptotic theory. The resulting combined estimates may be more accurate if the posterior
distribution of the population parameter of interest is better approximated by the normal
distribution. However, the normality assumption may not be appropriate for all the parameters of
interest when analysing prognostic modelling studies, such as predicted survival probabilities and
model performance measures.
Methods: Guidelines for combining the estimates of interest when analysing prognostic modelling
studies are provided. A literature review is performed to identify current practice for combining
such estimates in prognostic modelling studies.
Results: Methods for combining all reported estimates after MI were not well reported in the
current literature. Rubin's rules without applying any transformations were the standard approach
used, when any method was stated.
Conclusion: The proposed simple guidelines for combining estimates after MI may lead to a wider
and more appropriate use of MI in future prognostic modelling studies
Assessing Levels of Attention Using Low Cost Eye Tracking
The emergence of mobile eye trackers embedded in next generation smartphones
or VR displays will make it possible to trace not only what objects we look at
but also the level of attention in a given situation. Exploring whether we can
quantify the engagement of a user interacting with a laptop, we apply mobile
eye tracking in an in-depth study over 2 weeks with nearly 10.000 observations
to assess pupil size changes, related to attentional aspects of alertness,
orientation and conflict resolution. Visually presenting conflicting cues and
targets we hypothesize that it's feasible to measure the allocated effort when
responding to confusing stimuli. Although such experiments are normally carried
out in a lab, we are able to differentiate between sustained alertness and
complex decision making even with low cost eye tracking "in the wild". From a
quantified self perspective of individual behavioral adaptation, the
correlations between the pupil size and the task dependent reaction time and
error rates may longer term provide a foundation for modifying smartphone
content and interaction to the users perceived level of attention.Comment: 12 pages, 6 figures, 2 tables. The final publication will be
available at Springer via http://dx.doi.org/DOIxxx, when published as part of
the HCI International 2016 Conference Proceeding
Locating previously unknown patterns in data-mining results: a dual data- and knowledge-mining method
BACKGROUND: Data mining can be utilized to automate analysis of substantial amounts of data produced in many organizations. However, data mining produces large numbers of rules and patterns, many of which are not useful. Existing methods for pruning uninteresting patterns have only begun to automate the knowledge acquisition step (which is required for subjective measures of interestingness), hence leaving a serious bottleneck. In this paper we propose a method for automatically acquiring knowledge to shorten the pattern list by locating the novel and interesting ones. METHODS: The dual-mining method is based on automatically comparing the strength of patterns mined from a database with the strength of equivalent patterns mined from a relevant knowledgebase. When these two estimates of pattern strength do not match, a high "surprise score" is assigned to the pattern, identifying the pattern as potentially interesting. The surprise score captures the degree of novelty or interestingness of the mined pattern. In addition, we show how to compute p values for each surprise score, thus filtering out noise and attaching statistical significance. RESULTS: We have implemented the dual-mining method using scripts written in Perl and R. We applied the method to a large patient database and a biomedical literature citation knowledgebase. The system estimated association scores for 50,000 patterns, composed of disease entities and lab results, by querying the database and the knowledgebase. It then computed the surprise scores by comparing the pairs of association scores. Finally, the system estimated statistical significance of the scores. CONCLUSION: The dual-mining method eliminates more than 90% of patterns with strong associations, thus identifying them as uninteresting. We found that the pruning of patterns using the surprise score matched the biomedical evidence in the 100 cases that were examined by hand. The method automates the acquisition of knowledge, thus reducing dependence on the knowledge elicited from human expert, which is usually a rate-limiting step
Fish Consumption and Mercury Exposure among Louisiana Recreational Anglers
Ba c k g r o u n d: Methylmercury (MeHg) exposure assessments among average fish consumers in the United States may underestimate exposures among U.S. subpopulations with high intakes of region-ally specific fish.
obj e c t i v e s: We examined relationships among fish consumption, estimated mercury (Hg) intake, and measured Hg exposure within one such potentially highlyexposed group, recreational anglers in the state of Louisiana, USA.
Me t h o d s: We surveyed 534 anglers in 2006 using interviews at boat launches and fishing tourna-ments combined with an Internet-based survey method. Hair samples from 402 of these anglers were collected and analyzed for total Hg. Questionnaires provided information on species-specific fish consumption during the 3 months before the survey.
re s u l t s: Anglers’ median hairHg concentration was 0.81 μg/g (n = 398; range, 0.02–10.7 μg/g);40% of participants had levels >1 μg/g, which approximately corresponds to the U.S. Environmental Protection Agency’s reference dose. Fish consumption and Hg intake were significantly positively associated with hairHg. Participants reported consuming nearly 80 different fish types, many of which are specific to the region. Unlike the general U.S. population, which acquires most of its Hg from commercial seafood sources, approximately 64% of participants’ fish meals and 74% of their estimated Hg intake came from recreationally caught seafood.
co n c l u s i o n s: Study participants had relatively elevated hairHg concentrations and reported con-sumption of a wide variety of fish, particularly locally caught fish. This group represents a highlyexposed subpopulation with an exposure profile that differs from fish consumers in other regions of the United States, suggesting a need for more regionallyspecific exposure estimates and public health advisories.ISSN:1552-9924ISSN:0091-676
Infinitesimally Robust Estimation in General Smoothly Parametrized Models
We describe the shrinking neighborhood approach of Robust Statistics, which
applies to general smoothly parametrized models, especially, exponential
families. Equal generality is achieved by object oriented implementation of the
optimally robust estimators. We evaluate the estimates on real datasets from
literature by means of our R packages ROptEst and RobLox
- …