2,780 research outputs found
Variable selection with error control: Another look at stability selection
Stability Selection was recently introduced by Meinshausen and Buhlmann
(2010) as a very general technique designed to improve the performance of a
variable selection algorithm. It is based on aggregating the results of
applying a selection procedure to subsamples of the data. We introduce a
variant, called Complementary Pairs Stability Selection (CPSS), and derive
bounds both on the expected number of variables included by CPSS that have low
selection probability under the original procedure, and on the expected number
of high selection probability variables that are excluded. These results
require no (e.g. exchangeability) assumptions on the underlying model or on the
quality of the original selection procedure. Under reasonable shape
restrictions, the bounds can be further tightened, yielding improved error
control, and therefore increasing the applicability of the methodology.This is the accepted manuscript version. The final published version is available from Wiley at http://onlinelibrary.wiley.com/doi/10.1111/j.1467-9868.2011.01034.x/abstract
Random intersection trees
Finding interactions between variables in large and high-dimensional datasets
is often a serious computational challenge. Most approaches build up
interaction sets incrementally, adding variables in a greedy fashion. The
drawback is that potentially informative high-order interactions may be
overlooked. Here, we propose at an alternative approach for classification
problems with binary predictor variables, called Random Intersection Trees. It
works by starting with a maximal interaction that includes all variables, and
then gradually removing variables if they fail to appear in randomly chosen
observations of a class of interest. We show that informative interactions are
retained with high probability, and the computational complexity of our
procedure is of order for a value of that can reach values
as low as 1 for very sparse data; in many more general settings, it will still
beat the exponent obtained when using a brute force search constrained to
order interactions. In addition, by using some new ideas based on min-wise
hash schemes, we are able to further reduce the computational cost.
Interactions found by our algorithm can be used for predictive modelling in
various forms, but they are also often of interest in their own right as useful
characterisations of what distinguishes a certain class from others.This is the author's accepted manuscript. The final version of the manuscript can be found in the Journal of Machine Learning Research here: jmlr.csail.mit.edu/papers/volume15/shah14a/shah14a.pdf
The xyz algorithm for fast interaction search in high-dimensional data
When performing regression on a data set with p variables, it is often of interest to go beyond using main linear effects and include interactions as products between individual variables. For small-scale problems, these interactions can be computed explicitly but this leads to a computational complexity of at least O(p2) if done naively. This cost can be prohibitive if p is very large. We introduce a new randomised algorithm that is able to discover interactions with high probability and under mild conditions has a runtime that is subquadratic in p. We show that strong interactions can be discovered in almost linear time, whilst finding weaker interactions requires O(pα) operations for 1 < α < 2 depending on their strength. The underlying idea is to transform interaction search into a closest pair problem which can be solved efficiently in subquadratic time. The algorithm is called xyz and is implemented in the language R. We demonstrate its efficiency for application to genome-wide association studies, where more than 1011 interactions can be screened in under 280 seconds with a single-core 1:2 GHz CPU.Isaac Newton Trust Early Career Support Schem
Modelling Interactions in High-dimensional Data with Backtracking
We study the problem of high-dimensional regression when there may be
interacting variables. Approaches using sparsity-inducing penalty functions
such as the Lasso can be useful for producing interpretable models. However,
when the number variables runs into the thousands, and so even two-way
interactions number in the millions, these methods may become computationally
infeasible. Typically variable screening based on model fits using only main
effects must be performed first. One problem with screening is that important
variables may be missed if they are only useful for prediction when certain
interaction terms are also present in the model.
To tackle this issue, we introduce a new method we call Backtracking. It can
be incorporated into many existing high-dimensional methods based on penalty
functions, and works by building increasing sets of candidate interactions
iteratively. Models fitted on the main effects and interactions selected early
on in this process guide the selection of future interactions. By also making
use of previous fits for computation, as well as performing calculations is
parallel, the overall run-time of the algorithm can be greatly reduced.
The effectiveness of our method when applied to regression and classification
problems is demonstrated on simulated and real data sets. In the case of using
Backtracking with the Lasso, we also give some theoretical support for our
procedure
Right singular vector projection graphs: fast high dimensional covariance matrix estimation under latent confounding
In this work we consider the problem of estimating a high-dimensional covariance matrix , given observations of confounded data
with covariance , where is an unknown matrix of latent factor loadings. We propose a simple and scalable
estimator based on the projection on to the right singular vectors of the
observed data matrix, which we call RSVP. Our theoretical analysis of this
method reveals that in contrast to PCA-based approaches, RSVP is able to cope
well with settings where the smallest eigenvalue of is close
to the largest eigenvalue of , as well as settings where the
eigenvalues of are diverging fast. It is also able to handle
data that may have heavy tails and only requires that the data has an
elliptical distribution. RSVP does not require knowledge or estimation of the
number of latent factors , but only recovers up to an unknown
positive scale factor. We argue this suffices in many applications, for example
if an estimate of the correlation matrix is desired. We also show that by using
subsampling, we can further improve the performance of the method. We
demonstrate the favourable performance of RSVP through simulation experiments
and an analysis of gene expression datasets collated by the GTEX consortium.Supported by an EPSRC First Grant and the Alan Turing Institute under the EPSRC grant EP/N510129/1
Cigarette smoking in adolescents with asthma in Jordan: Impact of peer-led education in high schools
Background: Peer-led smoking prevention programs focus on teaching adolescentsespecially those with asthma- who are affected most by cigarettes, refusal skills to lower their intention to smoke. The purpose of this study was to determine the impact of a peer-led asthma education program on students who were smokers in terms of self-efficacy to resist smoking, asthma knowledge and asthma-related quality of life
Recommended from our members
Goodness-of-fit tests for high dimensional linear models
We propose a framework for constructing goodness-of-fit tests in both low and high dimensional linear models. We advocate applying regression methods to the scaled residuals following either an ordinary least squares or lasso fit to the data, and using some proxy for prediction error as the final test statistic. We call this family residual prediction tests. We show that simulation can be used to obtain the critical values for such tests in the low dimensional setting and demonstrate using both theoretical results and extensive numerical studies that some form of the parametric bootstrap can do the same when the high dimensional linear model is under consideration.We show that residual prediction tests can be used to test for significance of groups or individual variables as special cases, and here they compare favourably with state of the art methods, but we also argue that they can be designed to test for as diverse model misspecifications as heteroscedasticity and non-linearity.Rajen Shah was supported in part by the Forschungsinstitut fur Mathematik at the Eidgenössiche Technische Hochschule Zürich
Hypothyroidism in polycystic ovarian syndrome: a comparative study of clinical characteristics, metabolic and hormonal parameters in euthyroid and hypothyroid polycystic ovarian syndrome women
Background: This study was conducted to examine influence of hypothyroidism on pathophysiology and features of PCOS with respect to clinical characteristics of polycystic ovarian syndrome (PCOS), hormonal and metabolic profile.Methods: 102 euthyroid PCOS and 18 hypothyroid PCOS women were included in this cross-sectional study after considering inclusion and exclusion criteria. The study subjects were assessed for various signs and symptoms like recent weight gain, obesity, abnormal hair growth, hirsutism, hair loss, acne, acanthosis nigricans and infertility. Various hormonal and metabolic parameters were evaluated viz. Luteinizing hormone, Follicle stimulating hormone, LH:FSH ratio, testosterone, prolactin, dehydroepiandrosterone, fasting insulin and fasting blood glucose. BMI and HOMA values were calculated.Results: Association of hirsutism, excessive hair growth, hair loss, acanthosis nigricans, acne, infertility was not significant between the two groups. Majority of patients in both the groups were overweight/obese. BMI and number of patients complaining weight gain was significantly more in hypothyroid PCOS women. While no statistical difference in LH, FSH, LH:FSH ratio, prolactin, and testosterone levels was found, serum DHEA level was significantly less in hypothyroid PCOS group. No statistical difference in fasting blood glucose and insulin levels was found between the two groups. Though both the groups show insulin resistance, HOMA values were significantly more in hypothyroid PCOS women.Conclusions: Presence of hypothyroidism significantly increased severity of insulin resistance as well as obesity in PCOS. This could have adverse metabolic consequences in them. Concurrent occurrence of both these disorder could also possibly affect other features of the PCOS viz. hair loss and infertility
Analysis of Attribute Selection and Classification Algorithm Applied to Hepatitis Patients
Data mining techniques are widely used in classification, attribute selection and prediction in the field of bioinformatics because it helps to discover meaningful new correlations, patterns and trends by sifting through large volume of data, using pattern recognition technologies as well as statistical and mathematical techniques. Hepatitis is one of the most important health problem in the world. Many studies have been performed in the diagnosis of hepatitis disease but medical diagnosis is quite difficult and visual task which is mostly done by doctors. Therefore, this research is conducted to analyse the attribute selection and classification algorithm that applied to hepatitis patients. In order to achieve goals, WEKA tool is used to conduct the experiment with different attribute selector and classification algorithm . Hepatitis dataset that are used is taken from UC Irvine repository. This research deals with various attribute selector namely CfsSubsetEval, WrapperSubsetEval, GainRatioSubsetEval and CorrelationAttributeEval. The classification algorithm that used in this research are NaiveBayesUpdatable, SMO, KStar, RandomTree and SimpleLogistic. The results of the classification model are time and accuracy. Finally, it concludes that the best attribute selector is CfsSubsetEval while the best classifier is given to SMO because SMO performance is better than other classification techniques for hepatitis patients
Modelling high-dimensional categorical data using nonconvex fusion penalties
We propose a method for estimation in high-dimensional linear models with
nominal categorical data. Our estimator, called SCOPE, fuses levels together by
making their corresponding coefficients exactly equal. This is achieved using
the minimax concave penalty on differences between the order statistics of the
coefficients for a categorical variable, thereby clustering the coefficients.
We provide an algorithm for exact and efficient computation of the global
minimum of the resulting nonconvex objective in the case with a single variable
with potentially many levels, and use this within a block coordinate descent
procedure in the multivariate case. We show that an oracle least squares
solution that exploits the unknown level fusions is a limit point of the
coordinate descent with high probability, provided the true levels have a
certain minimum separation; these conditions are known to be minimal in the
univariate case. We demonstrate the favourable performance of SCOPE across a
range of real and simulated datasets. An R package CatReg implementing SCOPE
for linear models and also a version for logistic regression is available on
CRAN
- …