1,050 research outputs found

    Covariate dimension reduction for survival data via the Gaussian process latent variable model

    Full text link
    The analysis of high dimensional survival data is challenging, primarily due to the problem of overfitting which occurs when spurious relationships are inferred from data that subsequently fail to exist in test data. Here we propose a novel method of extracting a low dimensional representation of covariates in survival data by combining the popular Gaussian Process Latent Variable Model (GPLVM) with a Weibull Proportional Hazards Model (WPHM). The combined model offers a flexible non-linear probabilistic method of detecting and extracting any intrinsic low dimensional structure from high dimensional data. By reducing the covariate dimension we aim to diminish the risk of overfitting and increase the robustness and accuracy with which we infer relationships between covariates and survival outcomes. In addition, we can simultaneously combine information from multiple data sources by expressing multiple datasets in terms of the same low dimensional space. We present results from several simulation studies that illustrate a reduction in overfitting and an increase in predictive performance, as well as successful detection of intrinsic dimensionality. We provide evidence that it is advantageous to combine dimensionality reduction with survival outcomes rather than performing unsupervised dimensionality reduction on its own. Finally, we use our model to analyse experimental gene expression data and detect and extract a low dimensional representation that allows us to distinguish high and low risk groups with superior accuracy compared to doing regression on the original high dimensional data

    Spike and slab variable selection: Frequentist and Bayesian strategies

    Full text link
    Variable selection in the linear regression model takes many apparent faces from both frequentist and Bayesian standpoints. In this paper we introduce a variable selection method referred to as a rescaled spike and slab model. We study the importance of prior hierarchical specifications and draw connections to frequentist generalized ridge regression estimation. Specifically, we study the usefulness of continuous bimodal priors to model hypervariance parameters, and the effect scaling has on the posterior mean through its relationship to penalization. Several model selection strategies, some frequentist and some Bayesian in nature, are developed and studied theoretically. We demonstrate the importance of selective shrinkage for effective variable selection in terms of risk misclassification, and show this is achieved using the posterior from a rescaled spike and slab model. We also show how to verify a procedure's ability to reduce model uncertainty in finite samples using a specialized forward selection strategy. Using this tool, we illustrate the effectiveness of rescaled spike and slab models in reducing model uncertainty.Comment: Published at http://dx.doi.org/10.1214/009053604000001147 in the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Orthogonalized smoothing for rescaled spike and slab models

    Full text link
    Rescaled spike and slab models are a new Bayesian variable selection method for linear regression models. In high dimensional orthogonal settings such models have been shown to possess optimal model selection properties. We review background theory and discuss applications of rescaled spike and slab models to prediction problems involving orthogonal polynomials. We first consider global smoothing and discuss potential weaknesses. Some of these deficiencies are remedied by using local regression. The local regression approach relies on an intimate connection between local weighted regression and weighted generalized ridge regression. An important implication is that one can trace the effective degrees of freedom of a curve as a way to visualize and classify curvature. Several motivating examples are presented.Comment: Published in at http://dx.doi.org/10.1214/074921708000000192 the IMS Collections (http://www.imstat.org/publications/imscollections.htm) by the Institute of Mathematical Statistics (http://www.imstat.org

    Variable importance in binary regression trees and forests

    Full text link
    We characterize and study variable importance (VIMP) and pairwise variable associations in binary regression trees. A key component involves the node mean squared error for a quantity we refer to as a maximal subtree. The theory naturally extends from single trees to ensembles of trees and applies to methods like random forests. This is useful because while importance values from random forests are used to screen variables, for example they are used to filter high throughput genomic data in Bioinformatics, very little theory exists about their properties.Comment: Published in at http://dx.doi.org/10.1214/07-EJS039 the Electronic Journal of Statistics (http://www.i-journals.org/ejs/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Building Morphological Chains for Agglutinative Languages

    Get PDF
    In this paper, we build morphological chains for agglutinative languages by using a log-linear model for the morphological segmentation task. The model is based on the unsupervised morphological segmentation system called MorphoChains. We extend MorphoChains log linear model by expanding the candidate space recursively to cover more split points for agglutinative languages such as Turkish, whereas in the original model candidates are generated by considering only binary segmentation of each word. The results show that we improve the state-of-art Turkish scores by 12% having a F-measure of 72% and we improve the English scores by 3% having a F-measure of 74%. Eventually, the system outperforms both MorphoChains and other well-known unsupervised morphological segmentation systems. The results indicate that candidate generation plays an important role in such an unsupervised log-linear model that is learned using contrastive estimation with negative samples.Comment: 10 pages, accepted and presented at the CICLing 2017 (18th International Conference on Intelligent Text Processing and Computational Linguistics

    Industry Dynamics: Aggregate Uncertainty, Heterogeneity, and the Entry and Exit of Firms

    Get PDF
    The aim of this paper is to investigate the relationship between heterogeneity at the firm level and the aggregate behavior of the industry as a whole. Empirical evidence on firm behavior documents the existence of widespread heterogeneity among firms in any industry, both cross- sectionally and over time. At the cross-sectional level, data indicates that industries are made up of firms of different sizes, ages, and productivity. These differences between firms imply that firm response to aggregate changes in the industry are varied and not exactly alike, as predicted by the representative firm models. It has also been noted that there exist differences in the evolution of firms with similar initial conditions, i.e., firms entering an industry at any given point in time follow very different capital growth paths. Whether these differences, be it at the cross-sectional level or over time, have a significant impact on the aggregate dynamics of the industry is the main focus of this paper. In the past few years, there has been an increase in interest in heterogeneous-agent models. The motivation for this has been the empirical observation of heterogeneity in the response of agents to aggregate changes. Thus, incorporating heterogeneity into models with aggregate uncertainty appears to be the logical next step. Incorporating heterogeneity allows us to evaluate whether heterogeneity at the microeconomic level plays a significant role in aggregate dynamics. Computing the equilibrium for such models has been a problem due to the lack of an analytical or close form solution. The trend has been towards using sophisticated computational techniques to arrive at the equilibrium. The use of computational methods has allowed us to solve and quantitatively analyze a whole range of models which would not have been possible otherwise. The purpose of the paper is to develop a dynamic model of firm and industry behavior that can be used to understand the relationship between firm-level decisions, aggregate uncertainty and the business cycle. The model developed assumes heterogeneous firms, and studies the investment behavior of these firms in response to aggregate shocks. It also looks at whether these differences at the microeconomic level have an impact on aggregate industry dynamics. Dropping the assumption of a representative firm also allows for the incorporation of firm entry and exit into this model. This enables us to study the changes in industry size and composition over time. This paper makes two contributions. First, it develops a model of investment behavior that reflects features observed in microeconomic data on firm behavior. It then studies the impact of these features on aggregate investment dynamics and the evolution of industries. In relaxing assumptions such as that of a representative firm, this paper provides a richer and more realistic framework within which to study investment dynamics. The second contribution of the paper is methodological. In the absence of an analytical solution, it develops a computational technique that allows for such a model to be solved for an approximate (numerical) equilibrium.

    Economic growth and the environment

    Get PDF
    As the UK economy emerges from the downturn, attention is shifting to how best to return it to sustained and durable economic growth. But what does sustained and durable economic growth mean in the context of the natural environment? The UK and the global economy face significant environmental challenges, from averting dangerous climate change to halting biodiversity loss and protecting our ecosystems. There has been debate over whether it is possible to achieve economic growth whilst also tackling these challenges. This paper does not try to answer the question of what the sustainable level of economic growth might be, but instead examines the link between economic growth and the environment, and the role of environmental policy in managing the provision and use of natural assets. Many question the value of continued growth in GDP, given its limitations – including as a measure of wellbeing – and some evidence of its diminishing benefits within rich countries. However, it remains essential to support continued improvements in factors that affect people’s wellbeing, from health and employment to education and quality of life, and to help the government deliver on a range of policy objectives – economic, social, and environmental.Environmental policy: Natural Environment: Natural Capital: Growth: Sustainable Growth:

    Characterizing L2L_2Boosting

    Full text link
    We consider L2L_2Boosting, a special case of Friedman's generic boosting algorithm applied to linear regression under L2L_2-loss. We study L2L_2Boosting for an arbitrary regularization parameter and derive an exact closed form expression for the number of steps taken along a fixed coordinate direction. This relationship is used to describe L2L_2Boosting's solution path, to describe new tools for studying its path, and to characterize some of the algorithm's unique properties, including active set cycling, a property where the algorithm spends lengthy periods of time cycling between the same coordinates when the regularization parameter is arbitrarily small. Our fixed descent analysis also reveals a repressible condition that limits the effectiveness of L2L_2Boosting in correlated problems by preventing desirable variables from entering the solution path. As a simple remedy, a data augmentation method similar to that used for the elastic net is used to introduce L2L_2-penalization and is shown, in combination with decorrelation, to reverse the repressible condition and circumvents L2L_2Boosting's deficiencies in correlated problems. In itself, this presents a new explanation for why the elastic net is successful in correlated problems and why methods like LAR and lasso can perform poorly in such settings.Comment: Published in at http://dx.doi.org/10.1214/12-AOS997 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org
    corecore