Heterogeneity is a hallmark of complex diseases. Regression-based
heterogeneity analysis, which is directly concerned with outcome-feature
relationships, has led to a deeper understanding of disease biology. Such an
analysis identifies the underlying subgroup structure and estimates the
subgroup-specific regression coefficients. However, most of the existing
regression-based heterogeneity analyses can only address disjoint subgroups;
that is, each sample is assigned to only one subgroup. In reality, some samples
have multiple labels, for example, many genes have several biological
functions, and some cells of pure cell types transition into other types over
time, which suggest that their outcome-feature relationships (regression
coefficients) can be a mixture of relationships in more than one subgroups, and
as a result, the disjoint subgrouping results can be unsatisfactory. To this
end, we develop a novel approach to regression-based heterogeneity analysis,
which takes into account possible overlaps between subgroups and high data
dimensions. A subgroup membership vector is introduced for each sample, which
is combined with a loss function. Considering the lack of information arising
from small sample sizes, an l2 norm penalty is developed for each membership
vector to encourage similarity in its elements. A sparse penalization is also
applied for regularized estimation and feature selection. Extensive simulations
demonstrate its superiority over direct competitors. The analysis of Cancer
Cell Line Encyclopedia data and lung cancer data from The Cancer Genome Atlas
shows that the proposed approach can identify an overlapping subgroup structure
with favorable performance in prediction and stability.Comment: 33 pages, 16 figure