41,018 research outputs found
Boosting insights in insurance tariff plans with tree-based machine learning methods
Pricing actuaries typically operate within the framework of generalized
linear models (GLMs). With the upswing of data analytics, our study puts focus
on machine learning methods to develop full tariff plans built from both the
frequency and severity of claims. We adapt the loss functions used in the
algorithms such that the specific characteristics of insurance data are
carefully incorporated: highly unbalanced count data with excess zeros and
varying exposure on the frequency side combined with scarce, but potentially
long-tailed data on the severity side. A key requirement is the need for
transparent and interpretable pricing models which are easily explainable to
all stakeholders. We therefore focus on machine learning with decision trees:
starting from simple regression trees, we work towards more advanced ensembles
such as random forests and boosted trees. We show how to choose the optimal
tuning parameters for these models in an elaborate cross-validation scheme, we
present visualization tools to obtain insights from the resulting models and
the economic value of these new modeling approaches is evaluated. Boosted trees
outperform the classical GLMs, allowing the insurer to form profitable
portfolios and to guard against potential adverse risk selection
Optimization of Signal Significance by Bagging Decision Trees
An algorithm for optimization of signal significance or any other
classification figure of merit suited for analysis of high energy physics (HEP)
data is described. This algorithm trains decision trees on many bootstrap
replicas of training data with each tree required to optimize the signal
significance or any other chosen figure of merit. New data are then classified
by a simple majority vote of the built trees. The performance of this algorithm
has been studied using a search for the radiative leptonic decay B->gamma l nu
at BaBar and shown to be superior to that of all other attempted classifiers
including such powerful methods as boosted decision trees. In the B->gamma e nu
channel, the described algorithm increases the expected signal significance
from 2.4 sigma obtained by an original method designed for the B->gamma l nu
analysis to 3.0 sigma.Comment: 8 pages, 2 figures, 1 tabl
Efficient Version-Space Reduction for Visual Tracking
Discrminative trackers, employ a classification approach to separate the
target from its background. To cope with variations of the target shape and
appearance, the classifier is updated online with different samples of the
target and the background. Sample selection, labeling and updating the
classifier is prone to various sources of errors that drift the tracker. We
introduce the use of an efficient version space shrinking strategy to reduce
the labeling errors and enhance its sampling strategy by measuring the
uncertainty of the tracker about the samples. The proposed tracker, utilize an
ensemble of classifiers that represents different hypotheses about the target,
diversify them using boosting to provide a larger and more consistent coverage
of the version-space and tune the classifiers' weights in voting. The proposed
system adjusts the model update rate by promoting the co-training of the
short-memory ensemble with a long-memory oracle. The proposed tracker
outperformed state-of-the-art trackers on different sequences bearing various
tracking challenges.Comment: CRV'17 Conferenc
Recommended from our members
California Almond Yield Prediction at the Orchard Level With a Machine Learning Approach.
California's almond growers face challenges with nitrogen management as new legislatively mandated nitrogen management strategies for almond have been implemented. These regulations require that growers apply nitrogen to meet, but not exceed, the annual N demand for crop and tree growth and nut production. To accurately predict seasonal nitrogen demand, therefore, growers need to estimate block-level almond yield early in the growing season so that timely N management decisions can be made. However, methods to predict almond yield are not currently available. To fill this gap, we have developed statistical models using the Stochastic Gradient Boosting, a machine learning approach, for early season yield projection and mid-season yield update over individual orchard blocks. We collected yield records of 185 orchards, dating back to 2005, from the major almond growers in the Central Valley of California. A large set of variables were extracted as predictors, including weather and orchard characteristics from remote sensing imagery. Our results showed that the predicted orchard-level yield agreed well with the independent yield records. For both the early season (March) and mid-season (June) predictions, a coefficient of determination (R 2) of 0.71, and a ratio of performance to interquartile distance (RPIQ) of 2.6 were found on average. We also identified several key determinants of yield based on the modeling results. Almond yield increased dramatically with the orchard age until about 7 years old in general, and the higher long-term mean maximum temperature during April-June enhanced the yield in the southern orchards, while a larger amount of precipitation in March reduced the yield, especially in northern orchards. Remote sensing metrics such as annual maximum vegetation indices were also dominant variables for predicting the yield potential. While these results are promising, further refinement is needed; the availability of larger data sets and incorporation of additional variables and methodologies will be required for the model to be used as a fertilization decision support tool for growers. Our study has demonstrated the potential of automatic almond yield prediction to assist growers to manage N adaptively, comply with mandated requirements, and ensure industry sustainability
VAT tax gap prediction: a 2-steps Gradient Boosting approach
Tax evasion is the illegal evasion of taxes by individuals, corporations, and
trusts. The revenue loss from tax avoidance can undermine the effectiveness and
equity of the government policies. A standard measure of tax evasion is the tax
gap, that can be estimated as the difference between the total amounts of tax
theoretically collectable and the total amounts of tax actually collected in a
given period. This paper presents an original contribution to bottom-up
approach, based on results from fiscal audits, through the use of Machine
Learning. The major disadvantage of bottom-up approaches is represented by
selection bias when audited taxpayers are not randomly selected, as in the case
of audits performed by the Italian Revenue Agency. Our proposal, based on a
2-steps Gradient Boosting model, produces a robust tax gap estimate and, embeds
a solution to correct for the selection bias which do not require any
assumptions on the underlying data distribution. The 2-steps Gradient Boosting
approach is used to estimate the Italian Value-added tax (VAT) gap on
individual firms on the basis of fiscal and administrative data income tax
returns gathered from Tax Administration Data Base, for the fiscal year 2011.
The proposed method significantly boost the performance in predicting with
respect to the classical parametric approaches.Comment: 27 pages, 4 figures, 8 tables Presented at NTTS 2019 conference Under
review at another peer-reviewed journa
Recommended from our members
State-of-the-art on research and applications of machine learning in the building life cycle
Fueled by big data, powerful and affordable computing resources, and advanced algorithms, machine learning has been explored and applied to buildings research for the past decades and has demonstrated its potential to enhance building performance. This study systematically surveyed how machine learning has been applied at different stages of building life cycle. By conducting a literature search on the Web of Knowledge platform, we found 9579 papers in this field and selected 153 papers for an in-depth review. The number of published papers is increasing year by year, with a focus on building design, operation, and control. However, no study was found using machine learning in building commissioning. There are successful pilot studies on fault detection and diagnosis of HVAC equipment and systems, load prediction, energy baseline estimate, load shape clustering, occupancy prediction, and learning occupant behaviors and energy use patterns. None of the existing studies were adopted broadly by the building industry, due to common challenges including (1) lack of large scale labeled data to train and validate the model, (2) lack of model transferability, which limits a model trained with one data-rich building to be used in another building with limited data, (3) lack of strong justification of costs and benefits of deploying machine learning, and (4) the performance might not be reliable and robust for the stated goals, as the method might work for some buildings but could not be generalized to others. Findings from the study can inform future machine learning research to improve occupant comfort, energy efficiency, demand flexibility, and resilience of buildings, as well as to inspire young researchers in the field to explore multidisciplinary approaches that integrate building science, computing science, data science, and social science
- …