921 research outputs found
Boosted Classification Trees and Class Probability/Quantile Estimation
The standard by which binary classifiers are usually judged, misclassification error, assumes equal costs of misclassifying the two classes or, equivalently, classifying at the 1/2 quantile of the conditional class probability function P[y = 1jx]. Boosted classification trees are known to perform quite well for such problems. In this article we consider the use of standard, off-the-shelf boosting for two more general problems: 1) classification with unequal costs or, equivalently, classification at quantiles other than 1/2, and 2) estimation of the conditional class probability function P[y = 1jx]. We first examine whether the latter problem, estimation of P[y = 1jx], can be solved with Logit- Boost, and with AdaBoost when combined with a natural link function. The answer is negative: both approaches are often ineffective because they overfit P[y = 1jx] even though they perform well as classifiers. A major negative point of the present article is the disconnect between class probability estimation and classification. Next we consider the practice of over/under-sampling of the two classes. We present an algorithm that uses AdaBoost in conjunction with Over/Under-Sampling and Jittering of the data (“JOUS-Boost”). This algorithm is simple, yet successful, and it preserves the advantage of relative protection against overfitting, but for arbitrary misclassification costs and, equivalently, arbitrary quantile boundaries. We then use collections of classifiers obtained from a grid of quantiles to form estimators of class probabilities. The estimates of the class probabilities compare favorably to those obtained by a variety of methods across both simulated and real data sets
Comment: Boosting Algorithms: Regularization, Prediction and Model Fitting
The authors are doing the readers of Statistical Science a true service with a well-written and up-to-date overview of boosting that originated with the seminal algorithms of Freund and Schapire. Equally, we are grateful for high-level software that will permit a larger readership to experiment with, or simply apply, boosting-inspired model fitting. The authors show us a world of methodology that illustrates how a fundamental innovation can penetrate every nook and cranny of statistical thinking and practice. They introduce the reader to one particular interpretation of boosting and then give a display of its potential with extensions from classification (where it all started) to least squares, exponential family models, survival analysis, to base-learners other than trees such as smoothing splines, to degrees of freedom and regularization, and to fascinating recent work in model selection. The uninitiated reader will find that the authors did a nice job of presenting a certain coherent and useful interpretation of boosting. The other reader, though, who has watched the business of boosting for a while, may have quibbles with the authors over details of the historic record and, more importantly, over their optimism about the current state of theoretical knowledge. In fact, as much as the statistical view has proven fruitful, it has also resulted in some ideas about why boosting works that may be misconceived, and in some recommendations that may be misguided
Geocoding health data with Geographic Information Systems: a pilot study in northeast Italy for developing a standardized data-acquiring format
Introduction. Geographic Information Systems (GIS) have
become an innovative and somewhat crucial tool for analyzing
relationships between public health data and environment. This
study, though focusing on a Local Health Unit of northeastern
Italy, could be taken as a benchmark for developing a standardized
national data-acquiring format, providing a step-by-step
instructions on the manipulation of address elements specific for
Italian language and traditions.
Methods. Geocoding analysis was carried out on a health database
comprising 268,517 records of the Local Health Unit of
Rovigo in the Veneto region, covering a period of 10 years, starting
from 2001 up to 2010. The Map Service provided by the Environmental
Research System Institute (ESRI, Redlands, CA), and
ArcMap 10.0 by ESRI\uae were, respectively, the reference data and
the GIS software, employed in the geocoding process.
Results. The first attempt of geocoding produced a poor quality
result, having about 40% of the addresses matched. A procedure
of manual standardization was performed in order to enhance the
quality of the results, consequently a set of guiding principle were
expounded which should be pursued for geocoding health data.
High-level geocoding detail will provide a more precise geographic
representation of health related events.
Conclusions. The main achievement of this study was to outline
some of the difficulties encountered during the geocoding of
health data and to put forward a set of guidelines, which could
be useful to facilitate the process and enhance the quality of the
results. Public health informatics represents an emerging specialty
that highlights on the application of information science
and technology to public health practice and research. Therefore,
this study could draw the attention of the National Health Service
to the underestimated problem of geocoding accuracy in health
related data for environmental risk assessment
Is depression a real risk factor for acute myocardial infarction mortality? A retrospective cohort study
Background: Depression has been associated with a higher risk of cardiovascular events and a higher mortality in patients with one or more comorbidities. This study investigated whether continuative use of antidepressants (ADs), considered as a proxy of a state of depression, prior to acute myocardial infarction (AMI) is associated with a higher mortality afterwards. The outcome to assess was mortality by AD use. Methods: A retrospective cohort study was conducted in the Veneto Region on hospital discharge records with a primary diagnosis of AMI in 2002-2015. Subsequent deaths were ascertained from mortality records. Drug purchases were used to identify AD users. A descriptive analysis was conducted on patients' demographics and clinical data. Survival after discharge was assessed with a Kaplan-Meier survival analysis and Cox's multiple regression model. Results: Among 3985 hospital discharge records considered, 349 (8.8%) patients were classified as AD users'. The mean AMI-related hospitalization rate was 164.8/100,000 population/year, and declined significantly from 204.9 in 2002 to 130.0 in 2015, but only for AD users (-40.4%). The mean overall follow-up was 4.64.1years. Overall, 523 patients (13.1%) died within 30days of their AMI. The remainder survived a mean 5.3 +/- 4.0years. After adjusting for potential confounders, use of antidepressants was independently associated with mortality (adj OR=1.75, 95% CI: 1.40-2.19). Conclusions: Our findings show that AD users hospitalized for AMI have a worse prognosis in terms of mortality. The use of routinely-available records can prove an efficient way to monitor trends in the state of health of specific subpopulations, enabling the early identification of AMI survivors with a history of antidepressant use
Identifying research priorities to advance climate services
Climate services involve the timely production, translation, and delivery of useful climate data, information, and knowledge for societal decision-making. They rely on a range of expertise and are underpinned by research in climate and related sciences, sectoral applications (e.g., agriculture, water, health, energy, disasters), and a number of social science fields, including political science, sociology, anthropology, and economics. Feedback and engagement between these research communities and the communities involved in developing and/or using climate services is thus critical, ensuring that climate services are built on the best available science and providing researchers with guidance regarding priority challenges in the development of climate services that should warrant their attention. This paper reports the results of an international survey to gauge community perspective on research priorities for climate services, highlighting several areas in which respondents agree on the need for future work. The survey results indicate an overarching interest in research that can better connect climate information to users, particularly around the communication of climate information, the mapping of climate information needs, and the evaluation and prioritization of capacity building efforts. They also reveal significant interest in climate research to advance the skill of forecasts at subseasonal-to-seasonal scales – considered more broadly useful to decision makers than information at the end-of-century timescale – and to identify the drivers of extreme events. To support climate-related research, survey respondents underscore the need to continually develop and maintain the observational network. In analyzing these results, the paper offers guidance to researchers and to other members of the climate services community that may find these priorities useful in directing their own work to address the challenges posed by climate variability and change
Inequalities and Positive-Definite Functions Arising From a Problem in Multidimensional Scaling
We solve the following variational problem: Find the maximum of E ∥ X−Y ∥ subject to E ∥ X ∥2 ≤ 1, where X and Y are i.i.d. random n-vectors, and ∥⋅∥ is the usual Euclidean norm on Rn. This problem arose from an investigation into multidimensional scaling, a data analytic method for visualizing proximity data. We show that the optimal X is unique and is (1) uniform on the surface of the unit sphere, for dimensions n ≥ 3, (2) circularly symmetric with a scaled version of the radial density ρ/(1−ρ2)1/2, 0 ≤ ρ ≤1, for n=2, and (3) uniform on an interval centered at the origin, for n=1 (Plackett\u27s theorem). By proving spherical symmetry of the solution, a reduction to a radial problem is achieved. The solution is then found using the Wiener-Hopf technique for (real) n \u3c 3. The results are reminiscent of classical potential theory, but they cannot be reduced to it. Along the way, we obtain results of independent interest: for any i.i.d. random n-vectors X and Y,E ∥ X−Y ∥ ≤ E ∥ X+Y ∥. Further, the kernel Kp, β(x,y) = ∥ x+y ∥βp− ∥x−y∥βp, x, y∈Rn and ∥ x ∥ p=(∑|xi|p)1/p, is positive-definite, that is, it is the covariance of a random field, Kp,β(x,y) = E [ Z(x)Z(y) ] for some real-valued random process Z(x), for 1 ≤ p ≤ 2 and 0 \u3c β ≤ p ≤ 2 (but not for β \u3ep or p\u3e2 in general). Although this is an easy consequence of known results, it appears to be new in a strict sense. In the radial problem, the average distance D(r1,r2) between two spheres of radii r1 and r2 is used as a kernel. We derive properties of D(r1,r2), including nonnegative definiteness on signed measures of zero integral
Uncompleted Emergency Department Care (UEDC): A 5-year population-based study in the Veneto Region, Italy
Introduction: Uncompleted visits to emergency departments (UEDC) are a patient safety concern. The purpose of this study was to investigate risk factors for UEDC, describing not only the sociodemographic characteristics of patients who left against medical advice (AMA) and those who left without being seen (LWBS), but also the characteristics of their access to the emergency department (ED) and of the hospital structure. Methods: This was a cross sectional study on anonymized administrative data in a population-based ED database. Results: A total of 9,147,415 patients attended EDs in the Veneto Region from 2011 to 2015. The UEDC rate was 28.7\u2030, with a slightly higher rate of AMA than of LWBS (15.3\u2030 vs 13.4\u2030). Age, sex, citizenship, and residence were sociodemographic factors associated with UEDC, and so were certain characteristics of access, such as mode of admission, type of referral, emergency level, waiting time before being seen, and type of medical issue (trauma or other). Some characteristics of the hospital structure, such as the type of hospital and the volume of patients managed, could also be associated with UEDC. Conclusion: Cases of UEDC, which may involve patients who leave AMA and those who LWBS, differ considerably from other cases managed at the ED. The present findings are important for the purpose of planning and staffing health services. Decision-makers should identify and target the factors associated with UEDC to minimize walkouts from public hospital EDs
Damaging de novo mutations diminish motor skills in children on the autism spectrum
In individuals with autism spectrum disorder (ASD), de novo mutations have previously been shown to be significantly correlated with lower IQ but not with the core characteristics of ASD: deficits in social communication and interaction and restricted interests and repetitive patterns of behavior. We extend these findings by demonstrating in the Simons Simplex Collection that damaging de novo mutations in ASD individuals are also significantly and convincingly correlated with measures of impaired motor skills. This correlation is not explained by a correlation between IQ and motor skills. We find that IQ and motor skills are distinctly associated with damaging mutations and, in particular, that motor skills are a more sensitive indicator of mutational severity than is IQ, as judged by mutational type and target gene. We use this finding to propose a combined classification of phenotypic severity: mild (little impairment of either), moderate (impairment mainly to motor skills), and severe (impairment of both IQ and motor skills)
Measuring shared variants in cohorts of discordant siblings with applications to autism
We develop a method of analysis [affected to discordant sibling pairs (A2DS)] that tests if shared variants contribute to a disorder. Using a standard measure of genetic relation, test individuals are compared with a cohort of discordant sibling pairs (CDS) to derive a comparative similarity score. We ask if a test individual is more similar to an unrelated affected than to the unrelated unaffected sibling from the CDS and then, sum over such individuals and pairs. Statistical significance is judged by randomly permuting the affected status in the CDS. In the analysis of published genotype data from the Simons Simplex Collection (SSC) and the Autism Genetic Resource Exchange (AGRE) cohorts of children with autism spectrum disorder (ASD), we find strong statistical significance that the affected are more similar to the affected than to the unaffected of the CDS (P value approximately 0.00001). Fathers in multiplex families have marginally greater similarity (P value = 0.02) to unrelated affected individuals. These results do not depend on ethnic matching or gender
The Power to See: A New Graphical Test of Normality
Many statistical procedures assume the underlying data generating process involves Gaussian errors. Among the well-known procedures are ANOVA, multiple regression, linear discriminant analysis and many more. There are a few popular procedures that are commonly used to test for normality such as the Kolmogorov-Smirnov test and the ShapiroWilk test. Excluding the Kolmogorov-Smirnov testing procedure, these methods do not have a graphical representation. As such these testing methods offer very little insight as to how the observed process deviates from the normality assumption. In this paper we discuss a simple new graphical procedure which provides confidence bands for a normal quantile-quantile plot. These bands define a test of normality and are much narrower in the tails than those related to the Kolmogorov-Smirnov test. Correspondingly the new procedure has much greater power to detect deviations from normality in the tails
- …