9 research outputs found
Why use component-based methods in sensory science?
This paper discusses the advantages of using so-called component-based methods in sensory science. For instance, principal component analysis (PCA) and partial least squares (PLS) regression are used widely in the field; we will here discuss these and other methods for handling one block of data, as well as several blocks of data. Component-based methods all share a common feature: they define linear combinations of the variables to achieve data compression, interpretation, and prediction. The common properties of the component-based methods are listed and their advantages illustrated by examples. The paper equips practitioners with a list of solid and concrete arguments for using this methodology.publishedVersio
A primer on correlation-based dimension reduction methods for multi-omics analysis
The continuing advances of omic technologies mean that it is now more
tangible to measure the numerous features collectively reflecting the molecular
properties of a sample. When multiple omic methods are used, statistical and
computational approaches can exploit these large, connected profiles.
Multi-omics is the integration of different omic data sources from the same
biological sample. In this review, we focus on correlation-based dimension
reduction approaches for single omic datasets, followed by methods for pairs of
omics datasets, before detailing further techniques for three or more omic
datasets. We also briefly detail network methods when three or more omic
datasets are available and which complement correlation-oriented tools. To aid
readers new to this area, these are all linked to relevant R packages that can
implement these procedures. Finally, we discuss scenarios of experimental
design and present road maps that simplify the selection of appropriate
analysis methods. This review will guide researchers navigate the emerging
methods for multi-omics and help them integrate diverse omic datasets
appropriately and embrace the opportunity of population multi-omics.Comment: 30 pages, 2 figures, 6 table
Regularized generalized canonical correlation analysis for multiblock or multigroup data analysis
International audienceThis paper presents an overview of methods for the analysis of data structured in blocks of variables or in groups of individuals. More specifically, regularized generalized canonical correlation analysis (RGCCA), which is a unifying approach for multiblock data analysis, is extended to be also a unifying tool for multigroup data analysis. The versatility and usefulness of our approach is illustrated on two real datasets
Modelling with heterogeneity
When collecting survey data for a specific study it is usual to have some background information, in the form for example, of socio-demographic variables. In our context, these variables may be useful in identifying potential sources of heterogeneity. Resolving the heterogeneity may mean to perform distinct analyses based on the main variables for distinct and homogeneous segments of the data, defined in terms of the segmentation variables. In 2009 Gastón Sánchez proposed an algorithm PATHMOX with the aim to automatic detecting heterogeneous segments within the PLS-PM methodology. This technique, based on recursive partitioning, produces a segmentation tree with a distinct path models in each node. At each node PATHMOX searches among all splits based on the segmentation variables and chooses the one resulting in the maximal difference between the PLS-PM models in the children nodes. Starting from the work of Sanchez the purpose of the thesis is to extend PATHMOX in the following points:
1. Extension to the PATHMOX approach to detect which constructs differentiate segments. The PATHMOX approach uses a F-global test to identify the best split in heterogeneous segments. Following the same approach it is possible to extend the testing to find which the endogenous constructs are and which are the relationships between constructs responsible of the difference between the segments.
2. Extension to the PATHMOX approach to deal with the factor invariance problem. Originally PATHMOX adapted the estimation of constructs to each detected segment, that is, once a split is performed the PLS-PM model is recalculated in every child. This leads to the problem of invariance: if the the estimation of the latent variables are recalculated in each terminal node of the tree, we cannot be sure to compare the distinct behavior of two individuals who belong to two different terminal nodes. To solve this problem we will propose a invariance test based on the X^2 distribution, where the goal of to test whether the measurement models of each terminal node can be considered equal or not among them.
3. Extension to the PATHMOX approach to overcome the parametric hypothesis of F-test. One critic to the PATHMOX approach, applied in the context of partial least square path modeling, is that it utilizes a parametric test based on the hypothesis that the residuals have a normal distribution to compare two structural models. PLS-PM in general, is utilized to model data that come from survey analysis. These data are characterized by an asymmetric distribution. This situation produces skewness in the distribution of data. As we know, PLS-PM methodology, is based in the absence of assumptions about the distribution of data. Hence, the parametric F test used in PATHMOX may represent a limit of the methodology. To overcome this limit, we will extend the test in the context of LAD robust regression.
4. Generalization of PATHMOX algorithm to any type of modeling methodology. The PATHMOX algorithm has been proposed to analyze heterogeneity in the context of the partial least square path modeling. However, this algorithm can be applied to many other kind of methodologies according to the appropriate split criterion. To generalize PATHMOX we will consider three distinct scenarios: Regression analysis (OLS, LAD, GLM regression) and Principal Component Analysis.
5. Implement the methodology, using the R software as specific library.Cuando se realiza un estudio cientÃfico, el análisis hace énfasis sobre las variables recogidas para responder a las preguntas que se quieren hallar durante el mismo estudio. Sin embargo en muchos análisis se suele recoger más variables, como por ejemplo variables socio demográfico: sexo, status social, edad. Estas variables son conocidas como variables de segmentación, ya que pueden ser útiles en la identificación de posibles fuentes de heterogeneidad. Analizar la heterogeneidad quiere decir realizar distintas análisis para distintos colectivos homogéneos definidos a partir de las variables de segmentación. Muchas veces, si hay algún conocimiento previo, esta heterogeneidad puede ser controlada mediante la definición de segmentos a priori. Sin embargo no siempre se dispone de conocimiento suficiente para definir a priori los grupos. Por otro lado muchas variables de segmentación podrÃan ser disponibles para analizar la heterogeneidad de acuerdo con un apropiado algoritmo. Un algoritmo desarrollado con este objetivo fue PATHMOX, propuesto por Gastón Sanchez en 2009. Esta técnica, utilizando particiones recursivas, produce un árbol de segmentación con distintos modelos asociados a cada nodo. Para cada nodo, PATHMOX busca entre todas las variables de segmentación aquella que produce una diferencia máxima entre los modelos de los nodos hijos. Tomando como punto de partida el trabajo de Gastón Sanchez esta tesis se propone: 1. Extender PATHMOX para identificar los constructos responsables de la diferencias. PATHMOX nos permite detectar distintos modelos en un data-set sin identificar grupos a priori. Sin embargo, PATHMOX es un criterio global. Pera identificar las distintas ecuaciones y coeficientes responsables de las particiones, introduciremos los test F-block y F-coefficient. 2. Extender PATHMOX para solucionar el problema de la invariancia. En el contexto del PLS-PM (Partial Least Squares Path Modeling), PATHMOX funciona fijando las relaciones causales entre las variables latentes y el objetivo es identificar modelos con coeficientes path lo más posible distintos sin poner ninguna restricción sobre el modelo de medida. Por lo tanto, cada vez que una diferencia significativa es identificada, y dos nodos hijos vienen definidos, las relaciones causales entre las variables latentes son las mismas en ambos modelos "hijos", pero la estimación de cada variable latente se recalcula y no podemos estar seguros de comparar el comportamiento de dos individuos distintos que pertenecen a dos nodos diferentes. Para resolver este problema propondremos un test de invariancia basado en la distribución X^2, donde el objetivo del test es verificar si los modelos de cada nodo terminales se puede considerar igual o no entre ellos. 3. Extender PATHMOX para superar la hipótesis paramétrica del F-test. Una crÃtica a PATHMOX, aplicadas en el contexto del PLS-PM, es que el algoritmo utiliza una prueba paramétrica, basada en la hipótesis de que los residuos tienen una distribución normal, para comparar dos modelos estructurales. Para superar este lÃmite, extenderemos el test para comparar dos regresiones robustas LAD en el contexto del PLS. 4. La generalización del algoritmo PATHMOX a cualquier tipo de metodologÃa. El algoritmo PATHMOX ha sido propuesto para analizar la heterogeneidad en el contexto PLS-PM. Sin embargo, este algoritmo se puede aplicar a muchos otros tipos de metodologÃas de acuerdo con un apropiado criterio de partición. Para generalizar PATHMOX consideraremos tres escenarios distintos: modelos de regresión (modelos OLS, LAD, GLM) y el análisis en componentes principales. 5. Implementar la metodologÃa, utilizando el software R como librerÃa especÃfica
Recommended from our members
The Environment and Child Development: A Multivariate Approach
The environment that a child grows up in has a profound effect on their child development. For example, key outcomes such as academic ability or behaviour, cognitive ability and the neurobiology of the brain have been found to be associated to a child’s environment. However, the factors that make up a child’s environment are highly complex and yet the majority of research treats SES as a single number. In addition, the environment is related to several aspects of a child’s development, yet there is very little research considering how these multiple levels of development relate to each other and interact. This thesis builds on the current literature by investigating how multiple aspects of a child’s environment combine to create an environmental profile that is associated with positive child development. We endeavour to address three questions:
1) Which environmental factors most strongly relate to a child’s academic ability, behaviour, cognitive ability and neural development?
2) Does the wider environment mediate the relationship between standard measures of SES and child development?
3) How might the environment impact academic and behaviour outcomes? In particular, is this relationship mediated by a child’s cognition or the structural and functional connectivity of their brain?
7-11 year-old children (N=97) and their caregivers took part in this study. Several environmental domains and child behaviour were assessed through caregiver and child questionnaires. Academic and cognitive ability were measured using behavioural assessments. Resting state functional connectivity was measured using a magnetoencephalography (MEG) scan and structural connectivity was measured in an optional MRI scan on a separate visit (N=87).
Partial Least Squares (PLS) methods identified significant relationships between the environment and child development. Multiple environmental domains were found to be reliably related to each aspect of child development. Furthermore, the wider environmental domains mediated the association between SES measures and each aspect of child development. Finally, cognition and the structural connectivity of a child’s brain mediated the association between the environment and academic outcomes. This was not found for the behavioural outcomes.
This thesis provides key advances towards addressing the considerable methodological challenge presented in the investigation of the complex relationships between a child’s environment and multiple aspects of their development. We believe that this work will complement the available research to date and provide important detail to enable practitioners and policymakers to better support children at risk from disadvantaged environments.Funded by the Medical Research Counci
SIS 2017. Statistics and Data Science: new challenges, new generations
The 2017 SIS Conference aims to highlight the crucial role of the Statistics in Data Science. In this new domain of ‘meaning’ extracted from the data, the increasing amount of produced and available data in databases, nowadays, has brought new challenges. That involves different fields of statistics, machine learning, information and computer science, optimization, pattern recognition. These afford together a considerable contribute in the analysis of ‘Big data’, open data, relational and complex data, structured and no-structured. The interest is to collect the contributes which provide from the different domains of Statistics, in the high dimensional data quality validation, sampling extraction, dimensional reduction, pattern selection, data modelling, testing hypotheses and confirming conclusions drawn from the data