10 research outputs found
Regression-based heterogeneity analysis to identify overlapping subgroup structure in high-dimensional data
Heterogeneity is a hallmark of complex diseases. Regression-based
heterogeneity analysis, which is directly concerned with outcome-feature
relationships, has led to a deeper understanding of disease biology. Such an
analysis identifies the underlying subgroup structure and estimates the
subgroup-specific regression coefficients. However, most of the existing
regression-based heterogeneity analyses can only address disjoint subgroups;
that is, each sample is assigned to only one subgroup. In reality, some samples
have multiple labels, for example, many genes have several biological
functions, and some cells of pure cell types transition into other types over
time, which suggest that their outcome-feature relationships (regression
coefficients) can be a mixture of relationships in more than one subgroups, and
as a result, the disjoint subgrouping results can be unsatisfactory. To this
end, we develop a novel approach to regression-based heterogeneity analysis,
which takes into account possible overlaps between subgroups and high data
dimensions. A subgroup membership vector is introduced for each sample, which
is combined with a loss function. Considering the lack of information arising
from small sample sizes, an norm penalty is developed for each membership
vector to encourage similarity in its elements. A sparse penalization is also
applied for regularized estimation and feature selection. Extensive simulations
demonstrate its superiority over direct competitors. The analysis of Cancer
Cell Line Encyclopedia data and lung cancer data from The Cancer Genome Atlas
shows that the proposed approach can identify an overlapping subgroup structure
with favorable performance in prediction and stability.Comment: 33 pages, 16 figure
Robust, fuzzy, and parsimonious clustering based on mixtures of Factor Analyzers
A clustering algorithm that combines the advantages of fuzzy clustering and robust statistical estimators is presented. It is based on mixtures of Factor Analyzers, endowed by the joint usage of trimming and the constrained estimation of scatter matrices, in a modified maximum likelihood approach. The algorithm generates a set of membership values, that are used to fuzzy partition the data set and to contribute to the robust estimates of the mixture parameters. The adoption of clusters modeled by Gaussian Factor Analysis allows for dimension reduction and for discovering local linear structures in the data. The new methodology has been shown to be resistant to different types of contamination, by applying it on artificial data. A brief discussion on the tuning parameters, such as the trimming level, the fuzzifier parameter, the number of clusters and the value of the scatter matrices constraint, has been developed, also with the help of some heuristic tools for their choice. Finally, a real data set has been analyzed, to show how intermediate membership values are estimated for observations lying at cluster overlap, while cluster cores are composed by observations that are assigned to a cluster in a crisp way.Ministerio de Economía y Competitividad grant MTM2017-86061-C2-1-P, y Consejería de Educación de la Junta de Castilla y León and FEDER grantVA005P17 y VA002G1
A comparison of the CAR and DAGAR spatial random effects models with an application to diabetics rate estimation in Belgium
When hierarchically modelling an epidemiological phenomenon on a finite collection of sites in space, one must always take a latent spatial effect into account in order to capture the correlation structure that links the phenomenon to the territory. In this work, we compare two autoregressive spatial models that can be used for this purpose: the classical CAR model and the more recent DAGAR model. Differently from the former, the latter has a desirable property: its ρ parameter can be naturally interpreted as the average neighbor pair correlation and, in addition, this parameter can be directly estimated when the effect is modelled using a DAGAR rather than a CAR structure. As an application, we model the diabetics rate in Belgium in 2014 and show the adequacy of these models in predicting the response variable when no covariates are available
A Statistical Approach to the Alignment of fMRI Data
Multi-subject functional Magnetic Resonance Image studies are critical. The anatomical and functional structure varies across subjects, so the image alignment is necessary. We define a probabilistic model to describe functional alignment. Imposing a prior distribution, as the matrix Fisher Von Mises distribution, of the orthogonal transformation parameter, the anatomical information is embedded in the estimation of the parameters, i.e., penalizing the combination of spatially distant voxels. Real applications show an improvement in the classification and interpretability of the results compared to various functional alignment methods
Methods for Modelling Response Styles
Abstract
Ratings scales are ubiquitous in empirical research, especially in the social sciences, where they are used for measuring abstract concepts such as opinion or attitude. Survey questions typically employ rating scales, for example when persons are asked to self-report their perceptions of films or their job satisfaction. Yet, using a rating scale is subjective. Some persons may use only the middle of the rating scale, whilst others choose to use only the extremes.
Consequently, persons with the same opinion may very well answer the same survey question using different ratings. This leads to the response style problem: How can we take into account that different ratings can potentially have different meanings to different persons when analyzing such data?
This dissertation makes methodological and empirical contributions towards modelling rating scale data while accounting for such differences in response styles. The general approach is to identify individuals in the data which exhibit similar response styles, and to extract substantive information only within such groups. These elements naturally lead to the synthesis of cluster analysis and dimensionality reduction methods. In order to identify these response styles, responses to multiple survey questions are used to assess within-subject rating scale usage. Both non-parametric and parametric approaches are formulated and studied, and accompanying open-source software implementations are made available. The added value of using the developed algorithms is illustrated by applying these to empirical data. Applications range from sensometrics and brand studies, to psychology and political science
SIS 2017. Statistics and Data Science: new challenges, new generations
The 2017 SIS Conference aims to highlight the crucial role of the Statistics in Data Science. In this new domain of ‘meaning’ extracted from the data, the increasing amount of produced and available data in databases, nowadays, has brought new challenges. That involves different fields of statistics, machine learning, information and computer science, optimization, pattern recognition. These afford together a considerable contribute in the analysis of ‘Big data’, open data, relational and complex data, structured and no-structured. The interest is to collect the contributes which provide from the different domains of Statistics, in the high dimensional data quality validation, sampling extraction, dimensional reduction, pattern selection, data modelling, testing hypotheses and confirming conclusions drawn from the data