23 research outputs found

    Fast Approximation of the Shapley Values Based on Order-of-Addition Experimental Designs

    Full text link
    Shapley value is originally a concept in econometrics to fairly distribute both gains and costs to players in a coalition game. In the recent decades, its application has been extended to other areas such as marketing, engineering and machine learning. For example, it produces reasonable solutions for problems in sensitivity analysis, local model explanation towards the interpretable machine learning, node importance in social network, attribution models, etc. However, its heavy computational burden has been long recognized but rarely investigated. Specifically, in a dd-player coalition game, calculating a Shapley value requires the evaluation of d!d! or 2d2^d marginal contribution values, depending on whether we are taking the permutation or combination formulation of the Shapley value. Hence it becomes infeasible to calculate the Shapley value when dd is reasonably large. A common remedy is to take a random sample of the permutations to surrogate for the complete list of permutations. We find an advanced sampling scheme can be designed to yield much more accurate estimation of the Shapley value than the simple random sampling (SRS). Our sampling scheme is based on combinatorial structures in the field of design of experiments (DOE), particularly the order-of-addition experimental designs for the study of how the orderings of components would affect the output. We show that the obtained estimates are unbiased, and can sometimes deterministically recover the original Shapley value. Both theoretical and simulations results show that our DOE-based sampling scheme outperforms SRS in terms of estimation accuracy. Surprisingly, it is also slightly faster than SRS. Lastly, real data analysis is conducted for the C. elegans nervous system and the 9/11 terrorist network

    Risk controlled decision trees and random forests for precision Medicine

    No full text
    Statistical methods generating individualized treatment rules (ITRs) often focus on maximizing expected benefit, but these rules may expose patients to excess risk. For instance, aggressive treatment of type 2 diabetes (T2D) with insulin therapies may result in an ITR which controls blood glucose levels but increases rates of hypoglycemia, diminishing the appeal of the ITR. This work proposes two methods to identify risk-controlled ITRs (rcITR), a class of ITR which maximizes a benefit while controlling risk at a prespecified threshold. A novel penalized recursive partitioning algorithm is developed which optimizes an unconstrained, penalized value function. The final rule is a risk-controlled decision tree (rcDT) that is easily interpretable. A natural extension of the rcDT model, risk controlled random forests (rcRF), is also proposed. Simulation studies demonstrate the robustness of rcRF modeling. Three variable importance measures are proposed to further guide clinical decision-making. Both rcDT and rcRF procedures can be applied to data from randomized controlled trials or observational studies. An extensive simulation study interrogates the performance of the proposed methods. A data analysis of the DURABLE diabetes trial in which two therapeutics were compared is additionally presented. An R package implements the proposed methods ( https://github.com/kdoub5ha/rcITR)

    Net benefit index: Assessing the influence of a biomarker for individualized treatment rules

    Full text link
    One central task in precision medicine is to establish individualized treatment rules (ITRs) for patients with heterogeneous responses to different therapies. Motivated from a randomized clinical trial for Type 2 diabetic patients on a comparison of two drugs, that is, pioglitazone and gliclazide, we consider a problem: utilizing promising candidate biomarkers to improve an existing ITR. This calls for a biomarker evaluation procedure that enables to gauge added values of individual biomarkers. We propose an assessment analytic, termed as net benefit index (NBI), that quantifies a contrast between the resulting gain and loss of treatment benefits when a biomarker enters ITR to reallocate patients in treatments. We optimize reallocation schemes via outcome weighted learning (OWL), from which the optimal treatment group labels are generated by weighted support vector machine (SVM). To account for sampling uncertainty in assessing a biomarker, we propose an NBI‐based test for a significant improvement over the existing ITR, where the empirical null distribution is constructed via the method of stratified permutation by treatment arms. Applying NBI to the motivating diabetes trial, we found that baseline fasting insulin is an important biomarker that leads to an improvement over an existing ITR based only on patient’s baseline fasting plasma glucose (FPG), age, and body mass index (BMI) to reduce FPG over a period of 52 weeks.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/171240/1/biom13373_am.pdfhttp://deepblue.lib.umich.edu/bitstream/2027.42/171240/2/biom13373.pdfhttp://deepblue.lib.umich.edu/bitstream/2027.42/171240/3/biom13373-sup-0001-SuppMat.pd

    A Data Encryption and Approximate Recovery Strategy Based on Double Random Sorting and CGAN

    No full text
    The measurement data in the power system may be attacked and tampered during the transmission. For this problem, a data encryption and approximate recovery strategy based on double random sorting and CGAN is proposed. The double random sorting encryption algorithm is proposed based on the form of measurement data in plaintext. The randomness of pseudo-random number, random sequence and random sequence insertion will ensure the uncertainty of data block number, sorting order and insertion location of random sequence, which will definitely improve the security of measurement data. Besides, an approximate recovery strategy used to approximately recovery abnormal samples is proposed on the base of CGAN, which can ensure the accuracy of decryption. Finally, the feasibility, and effectiveness of the proposed method are analyzed on wind power and photovoltaic power historical dataset

    Climatic and human impact on the environment: Insight from the tetraether lipid temperature reconstruction in the Beibu Gulf, China

    No full text
    Organic proxies have been widely used to reconstruct temperature and terrestrial input variation at the interface between land and sea. However, the impact of human activities on these proxies has been little constrained, which might cause the misinterpretation of the sediment profile. In order to study the climatic and human impact on the environment, proxies in coastal regions, glycerol dialkyl glycerol tetraethers (GDGTs) in a sediment core spanning 115 years was analyzed from Beibu Gulf, China. The results indicate that isoprenoid GDGTs in the Beibu Gulf were mainly derived from Thaumarchaeota and the branched GDGTs originated mainly from terrestrial soil bacteria. The TEX86 index was applied to reconstruct the temporal SST variation. The results show that variation of TEX86-derived SST is controlled mainly by East Asian Monsoon system and partly by ENSO events. The methylation index of branched tetraethers or the cyclization index of branched tetraethers (MBT/CBT) derived mean annual air temperature (MAAT) agreed with MAAT record at the Qinzhou Station before 1982 AD. However, the population growth and economic development after 1982 AD caused the reduction of terrestrial organic matter input and increase of nutrient, leading to the thriving of Thaumarchaeota and increasing of in situ branched GDGTs proportion, which created the scatter of the MBT/CBT-reconstructed MAAT after 1982 AD. Therefore, human influence on GDGT proxies can be constrained in this way to avoid misinterpretation of paleoclimate and paleoenvironment records

    An Algorithm for Generating Individualized Treatment Decision Trees and Random Forests

    No full text
    <p>With new treatments and novel technology available, precision medicine has become a key topic in the new era of healthcare. Traditional statistical methods for precision medicine focus on subgroup discovery through identifying interactions between a few markers and treatment regimes. However, given the large scale and high dimensionality of modern datasets, it is difficult to detect the interactions between treatment and high-dimensional covariates. Recently, novel approaches have emerged that seek to directly estimate individualized treatment rules (ITR) via maximizing the expected clinical reward by using, for example, support vector machines (SVM) or decision trees. The latter enjoys great popularity in clinical practice due to its interpretability. In this article, we propose a new reward function and a novel decision tree algorithm to directly maximize rewards. We further improve a single tree decision rule by an ensemble decision tree algorithm, ITR random forests. Our final decision rule is an average over single decision trees and it is a soft probability rather than a hard choice.   Depending on how strong the treatment recommendation is, physicians can make decisions based on our model along with their own judgment and experience.  Performance of ITR forest and tree methods is assessed through simulations along with applications to a randomized controlled trial (RCT) of 1385 patients with diabetes and an EMR cohort of 5177 patients with diabetes. ITR forest and tree methods are implemented using statistical software R (<i><a href="https://github.com/kdoub5ha/ITR.Forest" target="_blank">https://github.com/kdoub5ha/ITR.Forest</a></i>). Supplementary materials for this article are available online.</p
    corecore