102,718 research outputs found

    Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values

    Full text link
    This work is motivated by the needs of predictive analytics on healthcare data as represented by Electronic Medical Records. Such data is invariably problematic: noisy, with missing entries, with imbalance in classes of interests, leading to serious bias in predictive modeling. Since standard data mining methods often produce poor performance measures, we argue for development of specialized techniques of data-preprocessing and classification. In this paper, we propose a new method to simultaneously classify large datasets and reduce the effects of missing values. It is based on a multilevel framework of the cost-sensitive SVM and the expected maximization imputation method for missing values, which relies on iterated regression analyses. We compare classification results of multilevel SVM-based algorithms on public benchmark datasets with imbalanced classes and missing values as well as real data in health applications, and show that our multilevel SVM-based method produces fast, and more accurate and robust classification results.Comment: arXiv admin note: substantial text overlap with arXiv:1503.0625

    Estimating Causal Effects with Matching Methods in the Presence and Absence of Bias Cancellation

    Get PDF
    This paper explores the implications of possible bias cancellation using Rubin-style matching methods with complete and incomplete data. After reviewing the naĂŻve causal estimator and the approaches of Heckman and Rubin to the causal estimation problem, we show how missing data can complicate the estimation of average causal effects in different ways, depending upon the nature of the missing mechanism. While - contrary to published assertions in the literature - bias cancellation does not generally occur when the multivariate distribution of the errors is symmetric, bias cancellation has been observed to occur for the case where selection into training is the treatment variable, and earnings is the outcome variable. A substantive rationale for bias cancellation is offered, which conceptualizes bias cancellation as the result of a mixture process based on two distinct individual-level decision-making models. While the general properties are unknown, the existence of bias cancellation appears to reduce the average bias in both OLS and matching methods relative to the symmetric distribution case. Analysis of simulated data under a set of difference scenarios suggests that matching methods do better than OLS in reducing that portion of bias that comes purely from the error distribution (i.e., from "selection on unobservables"). This advantage is often found also for the incomplete data case. Matching appears to offer no advantage over OLS in reducing the impact of bias due purely to selection on unobservable variables when the error variables are generated by standard multivariate normal distributions, which lack the bias-cancellation property.

    Estimating causal effects with matching methods in the presence and absence of bias cancellation

    Get PDF
    This paper explores the implications of possible bias cancellation using Rubin-style matching methods with complete and incomplete data. After reviewing the naïve causal estimator and the approaches of Heckman and Rubin to the causal estimation problem, we show how missing data can complicate the estimation of average causal effects in different ways, depending upon the nature of the missing mechanism. While - contrary to published assertions in the literature - bias cancellation does not generally occur when the multivariate distribution of the errors is symmetric, bias cancellation has been observed to occur for the case where selection into training is the treatment variable, and earnings is the outcome variable. A substantive rationale for bias cancellation is offered, which conceptualizes bias cancellation as the result of a mixture process based on two distinct individual-level decision-making models. While the general properties are unknown, the existence of bias cancellation appears to reduce the average bias in both OLS and matching methods relative to the symmetric distribution case. Analysis of simulated data under a set of difference scenarios suggests that matching methods do better than OLS in reducing that portion of bias that comes purely from the error distribution (i.e., from “selection on unobservables”). This advantage is often found also for the incomplete data case. Matching appears to offer no advantage over OLS in reducing the impact of bias due purely to selection on unobservable variables when the error variables are generated by standard multivariate normal distributions, which lack the bias-cancellation property. (AUTHORS)

    A Posterior Probability Approach for Gene Regulatory Network Inference in Genetic Perturbation Data

    Full text link
    Inferring gene regulatory networks is an important problem in systems biology. However, these networks can be hard to infer from experimental data because of the inherent variability in biological data as well as the large number of genes involved. We propose a fast, simple method for inferring regulatory relationships between genes from knockdown experiments in the NIH LINCS dataset by calculating posterior probabilities, incorporating prior information. We show that the method is able to find previously identified edges from TRANSFAC and JASPAR and discuss the merits and limitations of this approach

    The use of multilegged arguments to increase confidence in safety claims for software-based systems: A study based on a BBN analysis of an idealized example

    Get PDF
    The work described here concerns the use of so-called multi-legged arguments to support dependability claims about software-based systems. The informal justification for the use of multi-legged arguments is similar to that used to support the use of multi-version software in pursuit of high reliability or safety. Just as a diverse, 1-out-of-2 system might be expected to be more reliable than each of its two component versions, so a two-legged argument might be expected to give greater confidence in the correctness of a dependability claim (e.g. a safety claim) than would either of the argument legs alone. Our intention here is to treat these argument structures formally, in particular by presenting a formal probabilistic treatment of ‘confidence’, which will be used as a measure of efficacy. This will enable claims for the efficacy of the multi-legged approach to be made quantitatively, answering questions such as ‘How much extra confidence about a system’s safety will I have if I add a verification argument leg to an argument leg based upon statistical testing?’ For this initial study, we concentrate on a simplified and idealized example of a safety system in which interest centres upon a claim about the probability of failure on demand. Our approach is to build a BBN (“Bayesian Belief Network”) model of a two-legged argument, and manipulate this analytically via parameters that define its node probability tables. The aim here is to obtain greater insight than is afforded by the more usual BBN treatment, which involves merely numerical manipulation. We show that the addition of a diverse second argument leg can, indeed, increase confidence in a dependability claim: in a reasonably plausible example the doubt in the claim is reduced to one third of the doubt present in the original single leg. However, we also show that there can be some unexpected and counter-intuitive subtleties here; for example an entirely supportive second leg can sometimes undermine an original argument, resulting overall in less confidence than came from this original argument. Our results are neutral on the issue of whether such difficulties will arise in real life - i.e. when real experts judge real systems

    Capturing Preferences Under Incomplete Scenarios Using Elicited Choice Probabilities.

    Get PDF
    Manski (1999) proposed an approach for dealing with a particular form respondent uncertainty in discrete choice settings, particularly relevant in survey based research when the uncertainty stems from the incomplete description of the choice scenarios. Specifically, he suggests eliciting choice probabilities from respondents rather than their single choice of an alternative. A recent paper in IER by Blass et al. (2010) further develops the approach and presents the first empirical application. This paper extends the literature in a number of directions, examining the linkage between elicited choice probabilities and the more common discrete choice elicitation format. We also provide the first convergent validity test of the elicited choice probability format vis-\`a-vis the standard discrete choice format in a split sample experiment. Finally, we discuss the differences between welfare measures that can be derived from elicited choice probabilities versus those that can obtained from discrete choice responses.discrete choice; Elicited Choice Probabilities

    Clouds, p-boxes, fuzzy sets, and other uncertainty representations in higher dimensions

    Get PDF
    Uncertainty modeling in real-life applications comprises some serious problems such as the curse of dimensionality and a lack of sufficient amount of statistical data. In this paper we give a survey of methods for uncertainty handling and elaborate the latest progress towards real-life applications with respect to the problems that come with it. We compare different methods and highlight their relationships. We introduce intuitively the concept of potential clouds, our latest approach which successfully copes with both higher dimensions and incomplete information
    • …
    corecore