1,230 research outputs found

    Discovering robust dependencies from data

    Get PDF
    Science revolves around forming hypotheses, designing experiments, collecting data, and tests. It was not until recently, with the advent of modern hardware and data analytics, that science shifted towards a big-data-driven paradigm that led to an unprecedented success across various fields. What is perhaps the most astounding feature of this new era, is that interesting hypotheses can now be automatically discovered from observational data. This dissertation investigates knowledge discovery procedures that do exactly this. In particular, we seek algorithms that discover the most informative models able to compactly “describe” aspects of the phenomena under investigation, in both supervised and unsupervised settings. We consider interpretable models in the form of subsets of the original variable set. We want the models to capture all possible interactions, e.g., linear, non-linear, between all types of variables, e.g., discrete, continuous, and lastly, we want their quality to be meaningfully assessed. For this, we employ information-theoretic measures, and particularly, the fraction of information for the supervised setting, and the normalized total correlation for the unsupervised. The former measures the uncertainty reduction of the target variable conditioned on a model, and the latter measures the information overlap of the variables included in a model. Without access to the true underlying data generating process, we estimate the aforementioned measures from observational data. This process is prone to statistical errors, and in our case, the errors manifest as biases towards larger models. This can lead to situations where the results are utterly random, hindering therefore further analysis. We correct this behavior with notions from statistical learning theory. In particular, we propose regularized estimators that are unbiased under the hypothesis of independence, leading to robust estimation from limited data samples and arbitrary dimensionalities. Moreover, we do this for models consisting of both discrete and continuous variables. Lastly, to discover the top scoring models, we derive effective optimization algorithms for exact, approximate, and heuristic search. These algorithms are powered by admissible, tight, and efficient-to-compute bounding functions for our proposed estimators that can be used to greatly prune the search space. Overall, the products of this dissertation can successfully assist data analysts with data exploration, discovering powerful description models, or concluding that no satisfactory models exist, implying therefore new experiments and data are required for the phenomena under investigation. This statement is supported by Materials Science researchers who corroborated our discoveries.In der Wissenschaft geht es um Hypothesenbildung, Entwerfen von Experimenten, Sammeln von Daten und Tests. Jüngst hat sich die Wissenschaft, durch das Aufkommen moderner Hardware und Datenanalyse, zu einem Big-Data-basierten Paradigma hin entwickelt, das zu einem beispiellosen Erfolg in verschiedenen Bereichen geführt hat. Ein erstaunliches Merkmal dieser neuen ra ist, dass interessante Hypothesen jetzt automatisch aus Beobachtungsdaten entdeckt werden k nnen. In dieser Dissertation werden Verfahren zur Wissensentdeckung untersucht, die genau dies tun. Insbesondere suchen wir nach Algorithmen, die Modelle identifizieren, die in der Lage sind, Aspekte der untersuchten Ph nomene sowohl in beaufsichtigten als auch in unbeaufsichtigten Szenarien kompakt zu “beschreiben”. Hierzu betrachten wir interpretierbare Modelle in Form von Untermengen der ursprünglichen Variablenmenge. Ziel ist es, dass diese Modelle alle m glichen Interaktionen erfassen (z.B. linear, nicht-lineare), zwischen allen Arten von Variablen unterscheiden (z.B. diskrete, kontinuierliche) und dass schlussendlich ihre Qualit t sinnvoll bewertet wird. Dazu setzen wir informationstheoretische Ma e ein, insbesondere den Informationsanteil für das überwachte und die normalisierte Gesamtkorrelation für das unüberwachte Szenario. Ersteres misst die Unsicherheitsreduktion der Zielvariablen, die durch ein Modell bedingt ist, und letztere misst die Informationsüberlappung der enthaltenen Variablen. Ohne Kontrolle des Datengenerierungsprozesses werden die oben genannten Ma e aus Beobachtungsdaten gesch tzt. Dies ist anf llig für statistische Fehler, die zu Verzerrungen in gr  eren Modellen führen. So entstehen Situationen, wobei die Ergebnisse v llig zuf llig sind und somit weitere Analysen st ren. Wir korrigieren dieses Verhalten mit Methoden aus der statistischen Lerntheorie. Insbesondere schlagen wir regularisierte Sch tzer vor, die unter der Hypothese der Unabh ngigkeit nicht verzerrt sind und somit zu einer robusten Sch tzung aus begrenzten Datenstichproben und willkürlichen-Dimensionalit ten führen. Darüber hinaus wenden wir dies für Modelle an, die sowohl aus diskreten als auch aus kontinuierlichen Variablen bestehen. Um die besten Modelle zu entdecken, leiten wir effektive Optimierungsalgorithmen mit verschiedenen Garantien ab. Diese Algorithmen basieren auf speziellen Begrenzungsfunktionen der vorgeschlagenen Sch tzer und erlauben es den Suchraum stark einzuschr nken. Insgesamt sind die Produkte dieser Arbeit sehr effektiv für die Wissensentdeckung. Letztere Aussage wurde von Materialwissenschaftlern best tigt

    Entropy inference and the James-Stein estimator, with application to nonlinear gene association networks

    Full text link
    We present a procedure for effective estimation of entropy and mutual information from small-sample data, and apply it to the problem of inferring high-dimensional gene association networks. Specifically, we develop a James-Stein-type shrinkage estimator, resulting in a procedure that is highly efficient statistically as well as computationally. Despite its simplicity, we show that it outperforms eight other entropy estimation procedures across a diverse range of sampling scenarios and data-generating models, even in cases of severe undersampling. We illustrate the approach by analyzing E. coli gene expression data and computing an entropy-based gene-association network from gene expression data. A computer program is available that implements the proposed shrinkage estimator.Comment: 18 pages, 3 figures, 1 tabl

    Massively-Parallel Feature Selection for Big Data

    Full text link
    We present the Parallel, Forward-Backward with Pruning (PFBP) algorithm for feature selection (FS) in Big Data settings (high dimensionality and/or sample size). To tackle the challenges of Big Data FS PFBP partitions the data matrix both in terms of rows (samples, training examples) as well as columns (features). By employing the concepts of pp-values of conditional independence tests and meta-analysis techniques PFBP manages to rely only on computations local to a partition while minimizing communication costs. Then, it employs powerful and safe (asymptotically sound) heuristics to make early, approximate decisions, such as Early Dropping of features from consideration in subsequent iterations, Early Stopping of consideration of features within the same iteration, or Early Return of the winner in each iteration. PFBP provides asymptotic guarantees of optimality for data distributions faithfully representable by a causal network (Bayesian network or maximal ancestral graph). Our empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores, while dominating other competitive algorithms in its class

    Microarray analysis of autoimmune diseases by machine learning procedures

    Get PDF
    —Microarray-based global gene expression proïŹling, with the use of sophisticated statistical algorithms is providing new insights into the pathogenesis of autoimmune diseases. We have applied a novel statistical technique for gene selection based on machine learning approaches to analyze microarray expression data gathered from patients with systemic lupus erythematosus (SLE) and primary antiphospholipid syndrome (PAPS), two autoimmune diseases of unknown genetic origin that share many common features. The methodology included a combination of three data discretization policies, a consensus gene selection method, and a multivariate correlation measurement. A set of 150 genes was found to discriminate SLE and PAPS patients from healthy individuals. Statistical validations demonstrate the relevance of this gene set from an univariate and multivariate perspective. Moreover, functional characterization of these genes identiïŹed an interferon-regulated gene signature, consistent with previous reports. It also revealed the existence of other regulatory pathways, including those regulated by PTEN, TNF, and BCL-2, which are altered in SLE and PAPS. Remarkably, a signiïŹcant number of these genes carry E2F binding motifs in their promoters, projecting a role for E2F in the regulation of autoimmunity

    A Bayesian network approach to feature selection in mass spectrometry data

    Get PDF
    One of the key goals of current cancer research is the identification of biologic molecules that allow non-invasive detection of existing cancers or cancer precursors. One way to begin this process of biomarker discovery is by using time-of-flight mass spectroscopy to identify proteins or other molecules in tissue or serum that correlate to certain cancers. However, there are many difficulties associated with the output of such experiments. The distribution of protein abundances in a population is unknown, the mass spectroscopy measurements have high variability, and high correlations between variables cause problems with popular methods of data mining. to mitigate these issues, Bayesian inductive methods, combined with non-model dependent information theory scoring, are used to find feature sets and build classifiers for mass spectroscopy data from blood serum Such methods show improvement over existing measures, and naturally incorporate measurement uncertainties. Resulting Bayesian network models are applied to three blood serum data sets: one artificially generated, one from a 2004 leukemia study, and another from a 2007 prostate cancer study. Feature sets obtained appear to show sufficient stability under cross-validation to provide not only biomarker candidates but also families of features for further biochemical analysis

    Experiment-Based Validation and Uncertainty Quantification of Partitioned Models: Improving Predictive Capability of Multi-Scale Plasticity Models

    Get PDF
    Partitioned analysis involves coupling of constituent models that resolve their own scales or physics by exchanging inputs and outputs in an iterative manner. Through partitioning, simulations of complex physical systems are becoming evermore present in scientific modeling, making Verification and Validation of partitioned models for the purpose of quantifying the predictive capability of their simulations increasingly important. Parameterization of the constituent models as well as the coupling interface requires a significant amount of information about the system, which is often imprecisely known. Consequently, uncertainties as well as biases in constituent models and their interface lead to concerns about the accumulation and compensation of these uncertainties and errors during the iterative procedures of partitioned analysis. Furthermore, partitioned analysis relies on the availability of reliable constituent models for each component of a system. When a constituent is unavailable, assumptions must be made to represent the coupling relationship, often through uncertain parameters that are then calibrated. This dissertation contributes to the field of computational modeling by presenting novel methods that take advantage of the transparency of partitioned analysis to compare constituent models with separate-effect experiments (measurements contained to the constituent domain) and coupled models with integral-effect experiments (measurements capturing behavior of the full system). The methods developed herein focus on these two types of experiments seeking to maximize the information that can be gained from each, thus progressing our capability to assess and improve the predictive capability of partitioned models through inverse analysis. The importance of this study stems from the need to make coupled models available for widespread use for predicting the behavior of complex systems with confidence to support decision-making in high-risk scenarios. Methods proposed herein address the challenges currently limiting the predictive capability of coupled models through a focused analysis with available experiments. Bias-corrected partitioned analysis takes advantage of separate-effect experiments to reduce parametric uncertainty and quantify systematic bias at the constituent level followed by an integration of bias-correction to the coupling framework, thus ‘correcting’ the constituent model during coupling iterations and preventing the accumulation of errors due to the final predictions. Model bias is the result of assumptions made in the modeling process, often due to lack of understanding of the underlying physics. Such is the case when a constituent model of a system component is entirely unavailable and cannot be developed due to lack of knowledge. However, if this constituent model were to be available and coupled to existing models of the other system components, bias in the coupled system would be reduced. This dissertation proposes a novel statistical inference method for developing empirical constituent models where integral-effect experiments are used to infer relationships missing from system models. Thus, the proposed inverse analysis may be implemented to infer underlying coupled relationships, not only improving the predictive capability of models by producing empirical constituents to allow for coupling, but also advancing our fundamental understanding of dependencies in the coupled system. Throughout this dissertation, the applicability and feasibility of the proposed methods are demonstrated with advanced multi-scale and multi-physics material models simulating complex material behaviors under extreme loading conditions, thus specifically contributing advancements to the material modeling community

    Causal relationships between frequency bands of extracellular signals in visual cortex revealed by an information theoretic analysis

    Get PDF
    Characterizing how different cortical rhythms interact and how their interaction changes with sensory stimulation is important to gather insights into how these rhythms are generated and what sensory function they may play. Concepts from information theory, such as Transfer Entropy (TE), offer principled ways to quantify the amount of causation between different frequency bands of the signal recorded from extracellular electrodes; yet these techniques are hard to apply to real data. To address the above issues, in this study we develop a method to compute fast and reliably the amount of TE from experimental time series of extracellular potentials. The method consisted in adapting efficiently the calculation of TE to analog signals and in providing appropriate sampling bias corrections. We then used this method to quantify the strength and significance of causal interaction between frequency bands of field potentials and spikes recorded from primary visual cortex of anaesthetized macaques, both during spontaneous activity and during binocular presentation of naturalistic color movies. Causal interactions between different frequency bands were prominent when considering the signals at a fine (ms) temporal resolution, and happened with a very short (ms-scale) delay. The interactions were much less prominent and significant at coarser temporal resolutions. At high temporal resolution, we found strong bidirectional causal interactions between gamma-band (40–100 Hz) and slower field potentials when considering signals recorded within a distance of 2 mm. The interactions involving gamma bands signals were stronger during movie presentation than in absence of stimuli, suggesting a strong role of the gamma cycle in processing naturalistic stimuli. Moreover, the phase of gamma oscillations was playing a stronger role than their amplitude in increasing causations with slower field potentials and spikes during stimulation. The dominant direction of causality was mainly found in the direction from MUA or gamma frequency band signals to lower frequency signals, suggesting that hierarchical correlations between lower and higher frequency cortical rhythms are originated by the faster rhythms

    Design of Evolutionary Methods Applied to the Learning of Bayesian Network Structures

    Get PDF
    Bayesian Network, Ahmed Rebai (Ed.), ISBN: 978-953-307-124-4, pp. 13-38
    • 

    corecore