2,780 research outputs found

    Variable selection with error control: Another look at stability selection

    Get PDF
    Stability Selection was recently introduced by Meinshausen and Buhlmann (2010) as a very general technique designed to improve the performance of a variable selection algorithm. It is based on aggregating the results of applying a selection procedure to subsamples of the data. We introduce a variant, called Complementary Pairs Stability Selection (CPSS), and derive bounds both on the expected number of variables included by CPSS that have low selection probability under the original procedure, and on the expected number of high selection probability variables that are excluded. These results require no (e.g. exchangeability) assumptions on the underlying model or on the quality of the original selection procedure. Under reasonable shape restrictions, the bounds can be further tightened, yielding improved error control, and therefore increasing the applicability of the methodology.This is the accepted manuscript version. The final published version is available from Wiley at http://onlinelibrary.wiley.com/doi/10.1111/j.1467-9868.2011.01034.x/abstract

    Random intersection trees

    Get PDF
    Finding interactions between variables in large and high-dimensional datasets is often a serious computational challenge. Most approaches build up interaction sets incrementally, adding variables in a greedy fashion. The drawback is that potentially informative high-order interactions may be overlooked. Here, we propose at an alternative approach for classification problems with binary predictor variables, called Random Intersection Trees. It works by starting with a maximal interaction that includes all variables, and then gradually removing variables if they fail to appear in randomly chosen observations of a class of interest. We show that informative interactions are retained with high probability, and the computational complexity of our procedure is of order pκp^\kappa for a value of κ\kappa that can reach values as low as 1 for very sparse data; in many more general settings, it will still beat the exponent ss obtained when using a brute force search constrained to order ss interactions. In addition, by using some new ideas based on min-wise hash schemes, we are able to further reduce the computational cost. Interactions found by our algorithm can be used for predictive modelling in various forms, but they are also often of interest in their own right as useful characterisations of what distinguishes a certain class from others.This is the author's accepted manuscript. The final version of the manuscript can be found in the Journal of Machine Learning Research here: jmlr.csail.mit.edu/papers/volume15/shah14a/shah14a.pdf

    The xyz algorithm for fast interaction search in high-dimensional data

    Get PDF
    When performing regression on a data set with p variables, it is often of interest to go beyond using main linear effects and include interactions as products between individual variables. For small-scale problems, these interactions can be computed explicitly but this leads to a computational complexity of at least O(p2) if done naively. This cost can be prohibitive if p is very large. We introduce a new randomised algorithm that is able to discover interactions with high probability and under mild conditions has a runtime that is subquadratic in p. We show that strong interactions can be discovered in almost linear time, whilst finding weaker interactions requires O(pα) operations for 1 < α < 2 depending on their strength. The underlying idea is to transform interaction search into a closest pair problem which can be solved efficiently in subquadratic time. The algorithm is called xyz and is implemented in the language R. We demonstrate its efficiency for application to genome-wide association studies, where more than 1011 interactions can be screened in under 280 seconds with a single-core 1:2 GHz CPU.Isaac Newton Trust Early Career Support Schem

    Modelling Interactions in High-dimensional Data with Backtracking

    Get PDF
    We study the problem of high-dimensional regression when there may be interacting variables. Approaches using sparsity-inducing penalty functions such as the Lasso can be useful for producing interpretable models. However, when the number variables runs into the thousands, and so even two-way interactions number in the millions, these methods may become computationally infeasible. Typically variable screening based on model fits using only main effects must be performed first. One problem with screening is that important variables may be missed if they are only useful for prediction when certain interaction terms are also present in the model. To tackle this issue, we introduce a new method we call Backtracking. It can be incorporated into many existing high-dimensional methods based on penalty functions, and works by building increasing sets of candidate interactions iteratively. Models fitted on the main effects and interactions selected early on in this process guide the selection of future interactions. By also making use of previous fits for computation, as well as performing calculations is parallel, the overall run-time of the algorithm can be greatly reduced. The effectiveness of our method when applied to regression and classification problems is demonstrated on simulated and real data sets. In the case of using Backtracking with the Lasso, we also give some theoretical support for our procedure

    Right singular vector projection graphs: fast high dimensional covariance matrix estimation under latent confounding

    Get PDF
    In this work we consider the problem of estimating a high-dimensional p×pp \times p covariance matrix Σ\Sigma, given nn observations of confounded data with covariance Σ+ΓΓT\Sigma + \Gamma \Gamma^T, where Γ\Gamma is an unknown p×qp \times q matrix of latent factor loadings. We propose a simple and scalable estimator based on the projection on to the right singular vectors of the observed data matrix, which we call RSVP. Our theoretical analysis of this method reveals that in contrast to PCA-based approaches, RSVP is able to cope well with settings where the smallest eigenvalue of ΓTΓ\Gamma^T \Gamma is close to the largest eigenvalue of Σ\Sigma, as well as settings where the eigenvalues of ΓTΓ\Gamma^T \Gamma are diverging fast. It is also able to handle data that may have heavy tails and only requires that the data has an elliptical distribution. RSVP does not require knowledge or estimation of the number of latent factors qq, but only recovers Σ\Sigma up to an unknown positive scale factor. We argue this suffices in many applications, for example if an estimate of the correlation matrix is desired. We also show that by using subsampling, we can further improve the performance of the method. We demonstrate the favourable performance of RSVP through simulation experiments and an analysis of gene expression datasets collated by the GTEX consortium.Supported by an EPSRC First Grant and the Alan Turing Institute under the EPSRC grant EP/N510129/1

    Cigarette smoking in adolescents with asthma in Jordan: Impact of peer-led education in high schools

    Full text link
    Background: Peer-led smoking prevention programs focus on teaching adolescentsespecially those with asthma- who are affected most by cigarettes, refusal skills to lower their intention to smoke. The purpose of this study was to determine the impact of a peer-led asthma education program on students who were smokers in terms of self-efficacy to resist smoking, asthma knowledge and asthma-related quality of life

    Hypothyroidism in polycystic ovarian syndrome: a comparative study of clinical characteristics, metabolic and hormonal parameters in euthyroid and hypothyroid polycystic ovarian syndrome women

    Get PDF
    Background: This study was conducted to examine influence of hypothyroidism on pathophysiology and features of PCOS with respect to clinical characteristics of polycystic ovarian syndrome (PCOS), hormonal and metabolic profile.Methods: 102 euthyroid PCOS and 18 hypothyroid PCOS women were included in this cross-sectional study after considering inclusion and exclusion criteria. The study subjects were assessed for various signs and symptoms like recent weight gain, obesity, abnormal hair growth, hirsutism, hair loss, acne, acanthosis nigricans and infertility. Various hormonal and metabolic parameters were evaluated viz. Luteinizing hormone, Follicle stimulating hormone, LH:FSH ratio, testosterone, prolactin, dehydroepiandrosterone, fasting insulin and fasting blood glucose. BMI and HOMA values were calculated.Results: Association of hirsutism, excessive hair growth, hair loss, acanthosis nigricans, acne, infertility was not significant between the two groups. Majority of patients in both the groups were overweight/obese. BMI and number of patients complaining weight gain was significantly more in hypothyroid PCOS women. While no statistical difference in LH, FSH, LH:FSH ratio, prolactin, and testosterone levels was found, serum DHEA level was significantly less in hypothyroid PCOS group. No statistical difference in fasting blood glucose and insulin levels was found between the two groups. Though both the groups show insulin resistance, HOMA values were significantly more in hypothyroid PCOS women.Conclusions: Presence of hypothyroidism significantly increased severity of insulin resistance as well as obesity in PCOS. This could have adverse metabolic consequences in them. Concurrent occurrence of both these disorder could also possibly affect other features of the PCOS viz. hair loss and infertility

    Analysis of Attribute Selection and Classification Algorithm Applied to Hepatitis Patients

    Get PDF
    Data mining techniques are widely used in classification, attribute selection and prediction in the field of bioinformatics because it helps to discover meaningful new correlations, patterns and trends by sifting through large volume of data, using pattern recognition technologies as well as statistical and mathematical techniques. Hepatitis is one of the most important health problem in the world. Many studies have been performed in the diagnosis of hepatitis disease but medical diagnosis is quite difficult and visual task which is mostly done by doctors. Therefore, this research is conducted to analyse the attribute selection and classification algorithm that applied to hepatitis patients. In order to achieve goals, WEKA tool is used to conduct the experiment with different attribute selector and classification algorithm . Hepatitis dataset that are used is taken from UC Irvine repository. This research deals with various attribute selector namely CfsSubsetEval, WrapperSubsetEval, GainRatioSubsetEval and CorrelationAttributeEval. The classification algorithm that used in this research are NaiveBayesUpdatable, SMO, KStar, RandomTree and SimpleLogistic. The results of the classification model are time and accuracy. Finally, it concludes that the best attribute selector is CfsSubsetEval while the best classifier is given to SMO because SMO performance is better than other classification techniques for hepatitis patients

    Modelling high-dimensional categorical data using nonconvex fusion penalties

    Get PDF
    We propose a method for estimation in high-dimensional linear models with nominal categorical data. Our estimator, called SCOPE, fuses levels together by making their corresponding coefficients exactly equal. This is achieved using the minimax concave penalty on differences between the order statistics of the coefficients for a categorical variable, thereby clustering the coefficients. We provide an algorithm for exact and efficient computation of the global minimum of the resulting nonconvex objective in the case with a single variable with potentially many levels, and use this within a block coordinate descent procedure in the multivariate case. We show that an oracle least squares solution that exploits the unknown level fusions is a limit point of the coordinate descent with high probability, provided the true levels have a certain minimum separation; these conditions are known to be minimal in the univariate case. We demonstrate the favourable performance of SCOPE across a range of real and simulated datasets. An R package CatReg implementing SCOPE for linear models and also a version for logistic regression is available on CRAN
    corecore