147 research outputs found

    Combining Shapley value and statistics to the analysis of gene expression data in children exposed to air pollution

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In gene expression analysis, statistical tests for differential gene expression provide lists of candidate genes having, individually, a sufficiently low <it>p</it>-value. However, the interpretation of each single <it>p</it>-value within complex systems involving several interacting genes is problematic. In parallel, in the last sixty years, <it>game theory </it>has been applied to political and social problems to assess the power of interacting agents in forcing a decision and, more recently, to represent the relevance of genes in response to certain conditions.</p> <p>Results</p> <p>In this paper we introduce a Bootstrap procedure to test the null hypothesis that each gene has the same relevance between two conditions, where the relevance is represented by the Shapley value of a particular coalitional game defined on a microarray data-set. This method, which is called <it>Comparative Analysis of Shapley value </it>(shortly, CASh), is applied to data concerning the gene expression in children differentially exposed to air pollution. The results provided by CASh are compared with the results from a parametric statistical test for testing differential gene expression. Both lists of genes provided by CASh and t-test are informative enough to discriminate exposed subjects on the basis of their gene expression profiles. While many genes are selected in common by CASh and the parametric test, it turns out that the biological interpretation of the differences between these two selections is more interesting, suggesting a different interpretation of the main biological pathways in gene expression regulation for exposed individuals. A simulation study suggests that CASh offers more power than t-test for the detection of differential gene expression variability.</p> <p>Conclusion</p> <p>CASh is successfully applied to gene expression analysis of a data-set where the joint expression behavior of genes may be critical to characterize the expression response to air pollution. We demonstrate a synergistic effect between coalitional games and statistics that resulted in a selection of genes with a potential impact in the regulation of complex pathways.</p

    Using coalitional games on biological networks to measure centrality and power of genes

    Get PDF
    Abstract Motivation: The interpretation of gene interaction in biological networks generates the need for a meaningful ranking of network elements. Classical centrality analysis ranks network elements according to their importance but may fail to reflect the power of each gene in interaction with the others. Results: We introduce a new approach using coalitional games to evaluate the centrality of genes in networks keeping into account genes' interactions. The Shapley value for coalitional games is used to express the power of each gene in interaction with the others and to stress the centrality of certain hub genes in the regulation of biological pathways of interest. The main improvement of this contribution, with respect to previous applications of game theory to gene expression analysis, consists in a finer resolution of the gene interaction investigated in the model, which is based on pairwise relationships of genes in the network. In addition, the new approach allows for the integration of a priori knowledge about genes playing a key function on a certain biological process. An approximation method for practical computation on large biological networks, together with a comparison with other centrality measures, is also presented. Contact: [email protected]

    Identification of gene-gene interactions for Alzheimer's disease using co-operative game theory

    Full text link
    Thesis (Ph.D.)--Boston UniversityThe multifactorial nature of Alzheimer's Disease suggests that complex gene-gene interactions are present in AD pathways. Contemporary approaches to detect such interactions in genome-wide data are mathematically and computationally challenging. We investigated gene-gene interactions for AD using a novel algorithm based on cooperative game theory in 15 genome-wide association study (GWAS) datasets comprising of a total of 11,840 AD cases and 10,931 cognitively normal elderly controls from the Alzheimer Disease Genetics Consortium (ADGC). We adapted this approach, which was developed originally for solving multi-dimensional problems in economics and social sciences, to compute a Shapely value statistic to identify genetic markers that contribute most to coalitions of SNPs in predicting AD risk. Treating each GWAS dataset as independent discovery, markers were ranked according to their contribution to coalitions formed with other markers. Using a backward elimination strategy, markers with low Shapley values were eliminated and the statistic was recalculated iteratively. We tested all two-way interactions between top Shapley markers in regression models which included the two SNPs (main effects) and a term for their interaction. Models yielding a p-value<0.05 for the interaction term were evaluated in each of the other datasets and the results from all datasets were combined by meta-analysis. Statistically significant interactions were observed with multiple marker combinations in the APOE regions. My analyses also revealed statistically strong interactions between markers in 6 regions; CTNNA3-ATP11A (p=4.1E-07), CSMD1-PRKCQ (p=3.5E-08), DCC-UNC5CL (p=5.9e-8), CNTNAP2-RFC3 (p=1.16e-07), AACS-TSHZ3 (p=2.64e-07) and CAMK4-MMD (p=3.3e-07). The Shapley value algorithm outperformed Chi-Square and ReliefF in detecting known interactions between APOE and GAB2 in a previously published GWAS dataset. It was also more accurate than competing filtering methods in identifying simulated epistastic SNPs that are additive in nature, but its accuracy was low in identifying non-linear interactions. The game theory algorithm revealed strong interactions between markers in novel genes with weak main effects, which would have been overlooked if only markers with strong marginal association with AD were tested. This method will be a valuable tool for identifying gene-gene interactions for complex diseases and other traits

    A review of common statistical methods for dealing with multiple pollutant mixtures and multiple exposures

    Get PDF
    Traditional environmental epidemiology has consistently focused on studying the impact of single exposures on specific health outcomes, considering concurrent exposures as variables to be controlled. However, with the continuous changes in environment, humans are increasingly facing more complex exposures to multi-pollutant mixtures. In this context, accurately assessing the impact of multi-pollutant mixtures on health has become a central concern in current environmental research. Simultaneously, the continuous development and optimization of statistical methods offer robust support for handling large datasets, strengthening the capability to conduct in-depth research on the effects of multiple exposures on health. In order to examine complicated exposure mixtures, we introduce commonly used statistical methods and their developments, such as weighted quantile sum, bayesian kernel machine regression, toxic equivalency analysis, and others. Delineating their applications, advantages, weaknesses, and interpretability of results. It also provides guidance for researchers involved in studying multi-pollutant mixtures, aiding them in selecting appropriate statistical methods and utilizing R software for more accurate and comprehensive assessments of the impact of multi-pollutant mixtures on human health

    CLADAG 2021 BOOK OF ABSTRACTS AND SHORT PAPERS

    Get PDF
    The book collects the short papers presented at the 13th Scientific Meeting of the Classification and Data Analysis Group (CLADAG) of the Italian Statistical Society (SIS). The meeting has been organized by the Department of Statistics, Computer Science and Applications of the University of Florence, under the auspices of the Italian Statistical Society and the International Federation of Classification Societies (IFCS). CLADAG is a member of the IFCS, a federation of national, regional, and linguistically-based classification societies. It is a non-profit, non-political scientific organization, whose aims are to further classification research

    Non-communicable Diseases, Big Data and Artificial Intelligence

    Get PDF
    This reprint includes 15 articles in the field of non-communicable Diseases, big data, and artificial intelligence, overviewing the most recent advances in the field of AI and their application potential in 3P medicine

    Gaining Insight into Determinants of Physical Activity using Bayesian Network Learning

    Get PDF
    Contains fulltext : 228326pre.pdf (preprint version ) (Open Access) Contains fulltext : 228326pub.pdf (publisher's version ) (Open Access)BNAIC/BeneLearn 202

    Occurrence and effects of pharmaceuticals in estuaries

    Get PDF
    Pharmaceuticals have been identified as emerging contaminants of concern due to their widespread occurrence in the aquatic environment and potential to be biologically active, yet the implications of their presence in the environment is not fully known. There is a plethora of pharmaceuticals commercially available making it unfeasible to carry out detailed investigations on all of these compounds, and prioritisation schemes can provide a useful tool to determine how best to direct resources. Different prioritisation schemes were carried out on the fifty most prescribed drugs in the UK, and their results were compared in order to assess the efficacy of these schemes. Many failed to accurately identify these risks, but a holistic approach using more than one method to generate a priority list of compounds, may provide better protection for the environment. To date, most monitoring and ecotoxicological studies have focused on pharmaceuticals in freshwater, and there is less understanding of their occurrence and effects in estuaries. In order to gain insight into their spatio-temporal patterns, five pharmaceuticals were monitored in the Humber Estuary every other month for twelve months. Patterns in their spatial and temporal occurrence were related to source points, consumption patterns and environmental conditions. Eleven further estuaries were monitored to give an overall picture of pharmaceutical pollution in the UK. The Humber Estuary contained highest levels of pharmaceuticals and concentrations of ibuprofen were the highest measured globally. Finally, ragworms (Hediste diversicolor) were exposed to diclofenac and metformin in a controlled experimental exposure, and the expression of selected target genes, ATP synthase and c-amp activated protein kinase was measured. Highest levels of metformin (1 µg l-1) were found to significantly increase expression of ATP synthase, indicating that this drug induces environmental stress in H. diversicolor. Overall, this body of research has further contributed to the knowledge of pharmaceuticals as emerging contaminants in estuaries, and information on the occurrence, current levels and biological effects of the drugs studied may be of interest to regulators in their management decisions for such environments

    Supplement 1

    Get PDF
    The 24rd Norwegian Conference on Epidemiolog

    Enabling cardiovascular multimodal, high dimensional, integrative analytics

    Get PDF
    While traditionally the understanding of cardiovascular morbidity relied on the acquisition and interpretation of health data, the advances in health technologies has enabled us to collect far larger amount of health data. This thesis explores the application of advanced analytics that utilise powerful mechanisms for integrating health data across different modalities and dimensions into a single and holistic environment to better understand different diseases, with a focus on cardiovascular conditions. Different statistical methodologies are applied across a number of case studies supported by a novel methodology to integrate and simplify data collection. The work culminates in the different dataset modalities explaining different effects on morbidity: blood biomarkers, electrocardiogram recordings, RNA-Seq measurements, and different population effects piece together the understanding of a person morbidity. More specifically, explainable artificial intelligence methods were employed on structured datasets from patients with atrial fibrillation to improve the screening for the disease. Omics datasets, including RNA-sequencing and genotype datasets, were examined and new biomarkers were discovered allowing a better understanding of atrial fibrillation. Electrocardiogram signal data were used to assess the early risk prediction of heart failure, enabling clinicians to use this novel approach to estimate future incidences. Population-level data were applied to the identification of associations and temporal trajectory of diseases to better understand disease dependencies in different clinical cohorts
    corecore