1,050 research outputs found
Covariate dimension reduction for survival data via the Gaussian process latent variable model
The analysis of high dimensional survival data is challenging, primarily due
to the problem of overfitting which occurs when spurious relationships are
inferred from data that subsequently fail to exist in test data. Here we
propose a novel method of extracting a low dimensional representation of
covariates in survival data by combining the popular Gaussian Process Latent
Variable Model (GPLVM) with a Weibull Proportional Hazards Model (WPHM). The
combined model offers a flexible non-linear probabilistic method of detecting
and extracting any intrinsic low dimensional structure from high dimensional
data. By reducing the covariate dimension we aim to diminish the risk of
overfitting and increase the robustness and accuracy with which we infer
relationships between covariates and survival outcomes. In addition, we can
simultaneously combine information from multiple data sources by expressing
multiple datasets in terms of the same low dimensional space. We present
results from several simulation studies that illustrate a reduction in
overfitting and an increase in predictive performance, as well as successful
detection of intrinsic dimensionality. We provide evidence that it is
advantageous to combine dimensionality reduction with survival outcomes rather
than performing unsupervised dimensionality reduction on its own. Finally, we
use our model to analyse experimental gene expression data and detect and
extract a low dimensional representation that allows us to distinguish high and
low risk groups with superior accuracy compared to doing regression on the
original high dimensional data
Spike and slab variable selection: Frequentist and Bayesian strategies
Variable selection in the linear regression model takes many apparent faces
from both frequentist and Bayesian standpoints. In this paper we introduce a
variable selection method referred to as a rescaled spike and slab model. We
study the importance of prior hierarchical specifications and draw connections
to frequentist generalized ridge regression estimation. Specifically, we study
the usefulness of continuous bimodal priors to model hypervariance parameters,
and the effect scaling has on the posterior mean through its relationship to
penalization. Several model selection strategies, some frequentist and some
Bayesian in nature, are developed and studied theoretically. We demonstrate the
importance of selective shrinkage for effective variable selection in terms of
risk misclassification, and show this is achieved using the posterior from a
rescaled spike and slab model. We also show how to verify a procedure's ability
to reduce model uncertainty in finite samples using a specialized forward
selection strategy. Using this tool, we illustrate the effectiveness of
rescaled spike and slab models in reducing model uncertainty.Comment: Published at http://dx.doi.org/10.1214/009053604000001147 in the
Annals of Statistics (http://www.imstat.org/aos/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Orthogonalized smoothing for rescaled spike and slab models
Rescaled spike and slab models are a new Bayesian variable selection method
for linear regression models. In high dimensional orthogonal settings such
models have been shown to possess optimal model selection properties. We review
background theory and discuss applications of rescaled spike and slab models to
prediction problems involving orthogonal polynomials. We first consider global
smoothing and discuss potential weaknesses. Some of these deficiencies are
remedied by using local regression. The local regression approach relies on an
intimate connection between local weighted regression and weighted generalized
ridge regression. An important implication is that one can trace the effective
degrees of freedom of a curve as a way to visualize and classify curvature.
Several motivating examples are presented.Comment: Published in at http://dx.doi.org/10.1214/074921708000000192 the IMS
Collections (http://www.imstat.org/publications/imscollections.htm) by the
Institute of Mathematical Statistics (http://www.imstat.org
Variable importance in binary regression trees and forests
We characterize and study variable importance (VIMP) and pairwise variable
associations in binary regression trees. A key component involves the node mean
squared error for a quantity we refer to as a maximal subtree. The theory
naturally extends from single trees to ensembles of trees and applies to
methods like random forests. This is useful because while importance values
from random forests are used to screen variables, for example they are used to
filter high throughput genomic data in Bioinformatics, very little theory
exists about their properties.Comment: Published in at http://dx.doi.org/10.1214/07-EJS039 the Electronic
Journal of Statistics (http://www.i-journals.org/ejs/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Building Morphological Chains for Agglutinative Languages
In this paper, we build morphological chains for agglutinative languages by
using a log-linear model for the morphological segmentation task. The model is
based on the unsupervised morphological segmentation system called
MorphoChains. We extend MorphoChains log linear model by expanding the
candidate space recursively to cover more split points for agglutinative
languages such as Turkish, whereas in the original model candidates are
generated by considering only binary segmentation of each word. The results
show that we improve the state-of-art Turkish scores by 12% having a F-measure
of 72% and we improve the English scores by 3% having a F-measure of 74%.
Eventually, the system outperforms both MorphoChains and other well-known
unsupervised morphological segmentation systems. The results indicate that
candidate generation plays an important role in such an unsupervised log-linear
model that is learned using contrastive estimation with negative samples.Comment: 10 pages, accepted and presented at the CICLing 2017 (18th
International Conference on Intelligent Text Processing and Computational
Linguistics
Industry Dynamics: Aggregate Uncertainty, Heterogeneity, and the Entry and Exit of Firms
The aim of this paper is to investigate the relationship between heterogeneity at the firm level and the aggregate behavior of the industry as a whole. Empirical evidence on firm behavior documents the existence of widespread heterogeneity among firms in any industry, both cross- sectionally and over time. At the cross-sectional level, data indicates that industries are made up of firms of different sizes, ages, and productivity. These differences between firms imply that firm response to aggregate changes in the industry are varied and not exactly alike, as predicted by the representative firm models. It has also been noted that there exist differences in the evolution of firms with similar initial conditions, i.e., firms entering an industry at any given point in time follow very different capital growth paths. Whether these differences, be it at the cross-sectional level or over time, have a significant impact on the aggregate dynamics of the industry is the main focus of this paper. In the past few years, there has been an increase in interest in heterogeneous-agent models. The motivation for this has been the empirical observation of heterogeneity in the response of agents to aggregate changes. Thus, incorporating heterogeneity into models with aggregate uncertainty appears to be the logical next step. Incorporating heterogeneity allows us to evaluate whether heterogeneity at the microeconomic level plays a significant role in aggregate dynamics. Computing the equilibrium for such models has been a problem due to the lack of an analytical or close form solution. The trend has been towards using sophisticated computational techniques to arrive at the equilibrium. The use of computational methods has allowed us to solve and quantitatively analyze a whole range of models which would not have been possible otherwise. The purpose of the paper is to develop a dynamic model of firm and industry behavior that can be used to understand the relationship between firm-level decisions, aggregate uncertainty and the business cycle. The model developed assumes heterogeneous firms, and studies the investment behavior of these firms in response to aggregate shocks. It also looks at whether these differences at the microeconomic level have an impact on aggregate industry dynamics. Dropping the assumption of a representative firm also allows for the incorporation of firm entry and exit into this model. This enables us to study the changes in industry size and composition over time. This paper makes two contributions. First, it develops a model of investment behavior that reflects features observed in microeconomic data on firm behavior. It then studies the impact of these features on aggregate investment dynamics and the evolution of industries. In relaxing assumptions such as that of a representative firm, this paper provides a richer and more realistic framework within which to study investment dynamics. The second contribution of the paper is methodological. In the absence of an analytical solution, it develops a computational technique that allows for such a model to be solved for an approximate (numerical) equilibrium.
Economic growth and the environment
As the UK economy emerges from the downturn, attention is shifting to how best to return it to sustained and durable economic growth. But what does sustained and durable economic growth mean in the context of the natural environment? The UK and the global economy face significant environmental challenges, from averting dangerous climate change to halting biodiversity loss and protecting our ecosystems. There has been debate over whether it is possible to achieve economic growth whilst also tackling these challenges. This paper does not try to answer the question of what the sustainable level of economic growth might be, but instead examines the link between economic growth and the environment, and the role of environmental policy in managing the provision and use of natural assets. Many question the value of continued growth in GDP, given its limitations – including as a measure of wellbeing – and some evidence of its diminishing benefits within rich countries. However, it remains essential to support continued improvements in factors that affect people’s wellbeing, from health and employment to education and quality of life, and to help the government deliver on a range of policy objectives – economic, social, and environmental.Environmental policy: Natural Environment: Natural Capital: Growth: Sustainable Growth:
Characterizing Boosting
We consider Boosting, a special case of Friedman's generic boosting
algorithm applied to linear regression under -loss. We study Boosting
for an arbitrary regularization parameter and derive an exact closed form
expression for the number of steps taken along a fixed coordinate direction.
This relationship is used to describe Boosting's solution path, to
describe new tools for studying its path, and to characterize some of the
algorithm's unique properties, including active set cycling, a property where
the algorithm spends lengthy periods of time cycling between the same
coordinates when the regularization parameter is arbitrarily small. Our fixed
descent analysis also reveals a repressible condition that limits the
effectiveness of Boosting in correlated problems by preventing desirable
variables from entering the solution path. As a simple remedy, a data
augmentation method similar to that used for the elastic net is used to
introduce -penalization and is shown, in combination with decorrelation,
to reverse the repressible condition and circumvents Boosting's
deficiencies in correlated problems. In itself, this presents a new explanation
for why the elastic net is successful in correlated problems and why methods
like LAR and lasso can perform poorly in such settings.Comment: Published in at http://dx.doi.org/10.1214/12-AOS997 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
- …