6,295 research outputs found
Sparsity Oriented Importance Learning for High-dimensional Linear Regression
With now well-recognized non-negligible model selection uncertainty, data
analysts should no longer be satisfied with the output of a single final model
from a model selection process, regardless of its sophistication. To improve
reliability and reproducibility in model choice, one constructive approach is
to make good use of a sound variable importance measure. Although interesting
importance measures are available and increasingly used in data analysis,
little theoretical justification has been done. In this paper, we propose a new
variable importance measure, sparsity oriented importance learning (SOIL), for
high-dimensional regression from a sparse linear modeling perspective by taking
into account the variable selection uncertainty via the use of a sensible model
weighting. The SOIL method is theoretically shown to have the
inclusion/exclusion property: When the model weights are properly around the
true model, the SOIL importance can well separate the variables in the true
model from the rest. In particular, even if the signal is weak, SOIL rarely
gives variables not in the true model significantly higher important values
than those in the true model. Extensive simulations in several illustrative
settings and real data examples with guided simulations show desirable
properties of the SOIL importance in contrast to other importance measures
Context-dependent feature analysis with random forests
In many cases, feature selection is often more complicated than identifying a
single subset of input variables that would together explain the output. There
may be interactions that depend on contextual information, i.e., variables that
reveal to be relevant only in some specific circumstances. In this setting, the
contribution of this paper is to extend the random forest variable importances
framework in order (i) to identify variables whose relevance is
context-dependent and (ii) to characterize as precisely as possible the effect
of contextual information on these variables. The usage and the relevance of
our framework for highlighting context-dependent variables is illustrated on
both artificial and real datasets.Comment: Accepted for presentation at UAI 201
Fitting Prediction Rule Ensembles with R Package pre
Prediction rule ensembles (PREs) are sparse collections of rules, offering
highly interpretable regression and classification models. This paper presents
the R package pre, which derives PREs through the methodology of Friedman and
Popescu (2008). The implementation and functionality of package pre is
described and illustrated through application on a dataset on the prediction of
depression. Furthermore, accuracy and sparsity of PREs is compared with that of
single trees, random forest and lasso regression in four benchmark datasets.
Results indicate that pre derives ensembles with predictive accuracy comparable
to that of random forests, while using a smaller number of variables for
prediction
Identifying features predictive of faculty integrating computation into physics courses
Computation is a central aspect of 21st century physics practice; it is used
to model complicated systems, to simulate impossible experiments, and to
analyze mountains of data. Physics departments and their faculty are
increasingly recognizing the importance of teaching computation to their
students. We recently completed a national survey of faculty in physics
departments to understand the state of computational instruction and the
factors that underlie that instruction. The data collected from the faculty
responding to the survey included a variety of scales, binary questions, and
numerical responses. We then used Random Forest, a supervised learning
technique, to explore the factors that are most predictive of whether a faculty
member decides to include computation in their physics courses. We find that
experience using computation with students in their research, or lack thereof
and various personal beliefs to be most predictive of a faculty member having
experience teaching computation. Interestingly, we find demographic and
departmental factors to be less useful factors in our model. The results of
this study inform future efforts to promote greater integration of computation
into the physics curriculum as well as comment on the current state of
computational instruction across the United States
Bridiging designs for conjoint analysis: The issue of attribute importance.
Abstract: Conjoint analysis studies involving many attributes and attribute levels often occur in practice. Because such studies can cause respondent fatigue and lack of cooperation, it is important to design data collection tasks that reduce those problems. Bridging designs, incorporating two or more task subsets with overlapping attributes, can presumably lower task difficulty in such cases. In this paper, we present results of a study examining the effects on predictive validity of bridging design decisions involving important or unimportant attributes as links (bridges) between card-sort tasks and the degree of balance and consistency in estimated attribute importance across tasks. We also propose a new symmetric procedure, Symbridge, to scale the bridged conjoint solutions.Studies; Cooperation; Data; Problems; Effects; Decisions;
Recommended from our members
Nutrient Estimation from 24-Hour Food Recalls Using Machine Learning and Database Mapping: A Case Study with Lactose.
The Automated Self-Administered 24-Hour Dietary Assessment Tool (ASA24) is a free dietary recall system that outputs fewer nutrients than the Nutrition Data System for Research (NDSR). NDSR uses the Nutrition Coordinating Center (NCC) Food and Nutrient Database, both of which require a license. Manual lookup of ASA24 foods into NDSR is time-consuming but currently the only way to acquire NCC-exclusive nutrients. Using lactose as an example, we evaluated machine learning and database matching methods to estimate this NCC-exclusive nutrient from ASA24 reports. ASA24-reported foods were manually looked up into NDSR to obtain lactose estimates and split into training (n = 378) and test (n = 189) datasets. Nine machine learning models were developed to predict lactose from the nutrients common between ASA24 and the NCC database. Database matching algorithms were developed to match NCC foods to an ASA24 food using only nutrients ("Nutrient-Only") or the nutrient and food descriptions ("Nutrient + Text"). For both methods, the lactose values were compared to the manual curation. Among machine learning models, the XGB-Regressor model performed best on held-out test data (R2 = 0.33). For the database matching method, Nutrient + Text matching yielded the best lactose estimates (R2 = 0.76), a vast improvement over the status quo of no estimate. These results suggest that computational methods can successfully estimate an NCC-exclusive nutrient for foods reported in ASA24
- …