A new strategy is proposed for building easy to interpret predictive models
in the context of a high-dimensional dataset, with a large number of highly
correlated explanatory variables. The strategy is based on a first step of
variables clustering using the CLustering of Variables around Latent Variables
(CLV) method. The exploration of the hierarchical clustering dendrogram is
undertaken in order to sequentially select the explanatory variables in a
group-wise fashion. For model setting implementation, the dendrogram is used as
the base-learner in an L2-boosting procedure. The proposed approach, named
lmCLV, is illustrated on the basis of a toy-simulated example when the clusters
and predictive equation are already known, and on a real case study dealing
with the authentication of orange juices based on 1H-NMR spectroscopic
analysis. In both illustrative examples, this procedure was shown to have
similar predictive efficiency to other methods, with additional
interpretability capacity. It is available in the R package ClustVarLV.Comment: 24 pages, 7 figure