89 research outputs found

    Finding Statistically Significant Interactions between Continuous Features

    Full text link
    The search for higher-order feature interactions that are statistically significantly associated with a class variable is of high relevance in fields such as Genetics or Healthcare, but the combinatorial explosion of the candidate space makes this problem extremely challenging in terms of computational efficiency and proper correction for multiple testing. While recent progress has been made regarding this challenge for binary features, we here present the first solution for continuous features. We propose an algorithm which overcomes the combinatorial explosion of the search space of higher-order interactions by deriving a lower bound on the p-value for each interaction, which enables us to massively prune interactions that can never reach significance and to thereby gain more statistical power. In our experiments, our approach efficiently detects all significant interactions in a variety of synthetic and real-world datasets.Comment: 13 pages, 5 figures, 2 tables, accepted to the 28th International Joint Conference on Artificial Intelligence (IJCAI 2019

    DEvIANT: Discovering Significant Exceptional (Dis-)Agreement Within Groups

    Get PDF
    We strive to find contexts (i.e., subgroups of entities) under which exceptional (dis-)agreement occurs among a group of individuals , in any type of data featuring individuals (e.g., parliamentarians , customers) performing observable actions (e.g., votes, ratings) on entities (e.g., legislative procedures, movies). To this end, we introduce the problem of discovering statistically significant exceptional contextual intra-group agreement patterns. To handle the sparsity inherent to voting and rating data, we use Krippendorff's Alpha measure for assessing the agreement among individuals. We devise a branch-and-bound algorithm , named DEvIANT, to discover such patterns. DEvIANT exploits both closure operators and tight optimistic estimates. We derive analytic approximations for the confidence intervals (CIs) associated with patterns for a computationally efficient significance assessment. We prove that these approximate CIs are nested along specialization of patterns. This allows to incorporate pruning properties in DEvIANT to quickly discard non-significant patterns. Empirical study on several datasets demonstrates the efficiency and the usefulness of DEvIANT. Technical Report Associated with the ECML/PKDD 2019 Paper entitled: "DEvIANT: Discovering Significant Exceptional (Dis-)Agreement Within Groups"

    Molecular analysis of the midbrain dopaminergic niche during neurogenesis

    Get PDF
    Midbrain dopaminergic (mDA) neurons degenerate in Parkinson’s disease and are one of the main targets for cell replacement therapies. However, a comprehensive view of the signals and cell types contributing to mDA neurogenesis is not yet available. By analyzing the transcriptome of the mouse ventral midbrain at a tissue and single-cell level during mDA neurogenesis we found that three recently identified radial glia types 1-3 (Rgl1-3) contribute to different key aspects of mDA neurogenesis. While Rgl3 expressed most extracellular matrix components and multiple ligands for various pathways controlling mDA neuron development, such as Wnt and Shh, Rgl1-2 expressed most receptors. Moreover, we found that specific transcription factor networks explain the transcriptome and suggest a function for each individual radial glia. A network controlling neurogenesis was found in Rgl1, progenitor maintenance in Rgl2 and the secretion of factors forming the mDA niche by Rgl3. Our results thus uncover a broad repertoire of developmental signals expressed by each midbrain cell type during mDA neurogenesis. Cells identified for their emerging importance are Rgl3, a niche cell type, and Rgl1, a neurogenic progenitor that expresses ARNTL, a transcription factor that we find is required for mDA neurogenesis

    패스웨이 정보를 이용한 대용량 유전체 자료의 통계적 분석

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 자연과학대학 협동과정 생물정보학전공, 2018. 2. 박태성.hence our method considers the correlation of pathways and handles an entire dataset in a single model. In addition, PHARAOH-multi further extends the original model into multivariate analysis, while keeping the advantages of our previous approach. We extend PHARAOH to enable analysis of multiple traits using hierarchical components of genetic variants. In addition, PHARAOH-multi can identify associations between multiple phenotypes and multiple pathways, with a single model, in the presence of subsequent genes within pathways, as a hierarchy. Through simulation studies, PHARAOH was shown to have higher statistical power than the existing pathway-based methods. In addition, a detailed simulation study for PHARAOH-multi demonstrated advantages of multivariate analysis, compared to univariate analysis, and comparison studies showed the proposed approach to outperform existing multivariate pathway-based methods. Finally, we conducted an analysis of whole-exome sequencing data from a Korean population study to compare the performance between the proposed methods with the previous pathway-based methods, using validated pathway databases. As a result, PHARAOH successfully discovered 13 pathways for the liver enzymes, and PHARAOH-multi identified 8 pathways for multiple metabolic traits. Through a replication study using an independent, large-scale exome chip dataset, we replicated many pathways that were discovered by the proposed methods and showed their biological relationship to the target traits.In the past two decades, rapid advances in DNA sequencing technology have enabled extensive investigations into human genetic architecture, especially for the identification of genetic variants associated with complex traits. In particular, genome-wide association studies (GWAS) have played a key role in identifying genetic associations between Single Nucleotide Variants (SNVs) and many complex biological pathologies. However, the genetic variants identified by many successful GWAS have explained only a modest part of heritability for most of phenotypes, and many hypotheses have been proposed to address so-called missing heritability issue, such as rare variant association, gene-gene interaction or multi-omics integration. Methods for rare variants analysis arose from extending individual variant-level approaches to those at the gene-level, and extending those at the gene level to multiple phenotypes. In this trend, as the number of publicly available biological resources is increasing, recent methods for analyzing rare variants utilize pathway knowledge as a priori information. In this respect, many statistical methods for pathway-based analyses using rare variants have been proposed to analyze pathways individually. However, neglecting correlations between multiple pathways can result in misleading solutions, and pathway-based analyses of large-scale genetic datasets require massive computational burden. Moreover, while a number of methods for pathway-based rare-variant analysis of multiple phenotypes have been proposed, no method considers a unified model that incorporate multiple pathways. In this thesis, we propose novel statistical methods to analyze large-scale genetic dataset using pathway information, Pathway-based approach using HierArchical components of collapsed RAre variants Of High-throughput sequencing data (PHARAOH) and PHARAOH-multi. PHARAOH extends generalized structural component analysis, and implements the method based on the framework of generalized linear models, to accommodate phenotype data arising from a variety of exponential family distributions. PHARAOH constructs a single hierarchical model that consists of collapsed gene-level summaries and pathways, and analyzes entire pathways simultaneously by imposing ridge-type penalties on both gene and pathway coefficient estimatesIntroduction 1 1.1. The background on genetic association studies 1 1.1.1. Genome-wide association studies and the missing heritability 1 1.1.2. Rare variant analyses 6 1.2. The purpose of this study 10 1.3. Outline of the thesis 12 An overview of existing methods 13 2.1. Review of pathway-based methods 13 2.2.1. Competitive and self-contained tests: WKS and DRB 16 2.2.2. Self-contained test: aSPU 19 2.2.3. Self-contained test: MARV 21 2.3. Generalized structured component analysis 23 2.3.1. The model 23 2.3.2. Parameter estimation 25 Pathway-based approach using rare variants 27 3.1. Introduction 27 3.2. Methods 29 3.2.1. Notations and the model 29 3.2.2. An exemplary structure 32 3.3.3. Parameter estimation 33 3.4. Simulation study 37 3.4.1. The simulation dataset 37 3.4.2. Comparison of methods using simulation dataset 38 3.5. Application to analysis of liver enzymes 44 3.5.1. Whole exome sequencing dataset for pathway discovery 44 3.5.2. Replication study using exome chip dataset 53 3.6. Discussion 56 Multivariate pathway-based approach using rare variants 60 4.1. Introduction 60 4.2. Methods 61 4.2.1. Notations and the model 61 4.2.2. An exemplary structure 63 4.2.3. Parameter estimation 66 4.2.4. Significance testing 69 4.2.5. Multiple testing correction 75 4.3. Simulation study 77 4.3.1. The simulation model 74 4.3.2. Evaluation with simulated data 88 4.4. Application to the real datasets 88 4.4.1. Real data discovery from whole-exome sequencing dataset 95 4.4.2. Replication study using independent exome chip dataset 98 4.5. Discussion 99 Summary & Conclusions 104 Bibliography 108 초 록 127Docto

    Multi-agent statistical discriminative sub-trajectory mining and an application to NBA basketball

    Full text link
    Improvements in tracking technology through optical and computer vision systems have enabled a greater understanding of the movement-based behaviour of multiple agents, including in team sports. In this study, a Multi-Agent Statistically Discriminative Sub-Trajectory Mining (MA-Stat-DSM) method is proposed that takes a set of binary-labelled agent trajectory matrices as input and incorporates Hausdorff distance to identify sub-matrices that statistically significantly discriminate between the two groups of labelled trajectory matrices. Utilizing 2015/16 SportVU NBA tracking data, agent trajectory matrices representing attacks consisting of the trajectories of five agents (the ball, shooter, last passer, shooter defender, and last passer defender), were truncated to correspond to the time interval following the receipt of the ball by the last passer, and labelled as effective or ineffective based on a definition of attack effectiveness that we devise in the current study. After identifying appropriate parameters for MA-Stat-DSM by iteratively applying it to all matches involving the two top- and two bottom-placed teams from the 2015/16 NBA season, the method was then applied to selected matches and could identify and visualize the portions of plays, e.g., involving passing, on-, and/or off-the-ball movements, which were most relevant in rendering attacks effective or ineffective

    Discovering robust dependencies from data

    Get PDF
    Science revolves around forming hypotheses, designing experiments, collecting data, and tests. It was not until recently, with the advent of modern hardware and data analytics, that science shifted towards a big-data-driven paradigm that led to an unprecedented success across various fields. What is perhaps the most astounding feature of this new era, is that interesting hypotheses can now be automatically discovered from observational data. This dissertation investigates knowledge discovery procedures that do exactly this. In particular, we seek algorithms that discover the most informative models able to compactly “describe” aspects of the phenomena under investigation, in both supervised and unsupervised settings. We consider interpretable models in the form of subsets of the original variable set. We want the models to capture all possible interactions, e.g., linear, non-linear, between all types of variables, e.g., discrete, continuous, and lastly, we want their quality to be meaningfully assessed. For this, we employ information-theoretic measures, and particularly, the fraction of information for the supervised setting, and the normalized total correlation for the unsupervised. The former measures the uncertainty reduction of the target variable conditioned on a model, and the latter measures the information overlap of the variables included in a model. Without access to the true underlying data generating process, we estimate the aforementioned measures from observational data. This process is prone to statistical errors, and in our case, the errors manifest as biases towards larger models. This can lead to situations where the results are utterly random, hindering therefore further analysis. We correct this behavior with notions from statistical learning theory. In particular, we propose regularized estimators that are unbiased under the hypothesis of independence, leading to robust estimation from limited data samples and arbitrary dimensionalities. Moreover, we do this for models consisting of both discrete and continuous variables. Lastly, to discover the top scoring models, we derive effective optimization algorithms for exact, approximate, and heuristic search. These algorithms are powered by admissible, tight, and efficient-to-compute bounding functions for our proposed estimators that can be used to greatly prune the search space. Overall, the products of this dissertation can successfully assist data analysts with data exploration, discovering powerful description models, or concluding that no satisfactory models exist, implying therefore new experiments and data are required for the phenomena under investigation. This statement is supported by Materials Science researchers who corroborated our discoveries.In der Wissenschaft geht es um Hypothesenbildung, Entwerfen von Experimenten, Sammeln von Daten und Tests. Jüngst hat sich die Wissenschaft, durch das Aufkommen moderner Hardware und Datenanalyse, zu einem Big-Data-basierten Paradigma hin entwickelt, das zu einem beispiellosen Erfolg in verschiedenen Bereichen geführt hat. Ein erstaunliches Merkmal dieser neuen ra ist, dass interessante Hypothesen jetzt automatisch aus Beobachtungsdaten entdeckt werden k nnen. In dieser Dissertation werden Verfahren zur Wissensentdeckung untersucht, die genau dies tun. Insbesondere suchen wir nach Algorithmen, die Modelle identifizieren, die in der Lage sind, Aspekte der untersuchten Ph nomene sowohl in beaufsichtigten als auch in unbeaufsichtigten Szenarien kompakt zu “beschreiben”. Hierzu betrachten wir interpretierbare Modelle in Form von Untermengen der ursprünglichen Variablenmenge. Ziel ist es, dass diese Modelle alle m glichen Interaktionen erfassen (z.B. linear, nicht-lineare), zwischen allen Arten von Variablen unterscheiden (z.B. diskrete, kontinuierliche) und dass schlussendlich ihre Qualit t sinnvoll bewertet wird. Dazu setzen wir informationstheoretische Ma e ein, insbesondere den Informationsanteil für das überwachte und die normalisierte Gesamtkorrelation für das unüberwachte Szenario. Ersteres misst die Unsicherheitsreduktion der Zielvariablen, die durch ein Modell bedingt ist, und letztere misst die Informationsüberlappung der enthaltenen Variablen. Ohne Kontrolle des Datengenerierungsprozesses werden die oben genannten Ma e aus Beobachtungsdaten gesch tzt. Dies ist anf llig für statistische Fehler, die zu Verzerrungen in gr  eren Modellen führen. So entstehen Situationen, wobei die Ergebnisse v llig zuf llig sind und somit weitere Analysen st ren. Wir korrigieren dieses Verhalten mit Methoden aus der statistischen Lerntheorie. Insbesondere schlagen wir regularisierte Sch tzer vor, die unter der Hypothese der Unabh ngigkeit nicht verzerrt sind und somit zu einer robusten Sch tzung aus begrenzten Datenstichproben und willkürlichen-Dimensionalit ten führen. Darüber hinaus wenden wir dies für Modelle an, die sowohl aus diskreten als auch aus kontinuierlichen Variablen bestehen. Um die besten Modelle zu entdecken, leiten wir effektive Optimierungsalgorithmen mit verschiedenen Garantien ab. Diese Algorithmen basieren auf speziellen Begrenzungsfunktionen der vorgeschlagenen Sch tzer und erlauben es den Suchraum stark einzuschr nken. Insgesamt sind die Produkte dieser Arbeit sehr effektiv für die Wissensentdeckung. Letztere Aussage wurde von Materialwissenschaftlern best tigt

    Statistical tools for general association testing and control of false discoveries in group testing

    Get PDF
    In modern applications of high-throughput technologies, it is important to identify pairwise associations between variables, and desirable to use methods that are powerful and sensitive to a variety of association relationships. In the first part of the dissertation, we describe RankCover, a new non-parametric association test for association between two variables that measures the concentration of paired ranked points. Here `concentration' is quantified using a disk-covering statistic that is similar to those employed in spatial data analysis. Analysis of simulated datasets demonstrates that the method is robust and often powerful in comparison to competing general association tests. We also illustrate RankCover in the analysis of several real datasets. Using RankCover, we also propose a method of testing the association of two variables while controlling the effect of a third variable. In the second part of the dissertation, we describe statistical methodologies for testing hypotheses that can be collected into groups, with each group showing potentially different characteristics. Methods to control family-wise error rate or false discovery rate for group testing have been proposed earlier, but may not easily apply to expression quantitative trait loci (eQTL) data, for which certain structured alternatives may be defensible and enable the researcher to avoid overly conservative approaches. In an empirical Bayesian setting, we propose a new method to control the false discovery rate (FDR) for grouped hypothesis data. Here, each gene forms a group, with SNPs annotated to the gene corresponding to individual hypotheses. Heterogeneity of effect sizes in different groups is considered by the introduction of a random effects component. Our method, entitled Random Effects model and testing procedure for Group-level FDR control (REG-FDR) assumes a model for alternative hypotheses for the eQTL data and controls the FDR by adaptive thresholding. Finally, we propose Z-REG-FDR, an approximate version of REG-FDR that uses only Z-statistics of association between genotype and expression at each SNP. Simulations demonstrate that Z-REG-FDR performed similarly to REG-FDR, but with much improved computational speed. We further propose an extension of Z-REG-FDR to a multi-tissue setting, providing the basis for gene-based multi-tissue analysis.Doctor of Philosoph
    corecore