4,427 research outputs found

    Parallel Implementation of Efficient Search Schemes for the Inference of Cancer Progression Models

    Full text link
    The emergence and development of cancer is a consequence of the accumulation over time of genomic mutations involving a specific set of genes, which provides the cancer clones with a functional selective advantage. In this work, we model the order of accumulation of such mutations during the progression, which eventually leads to the disease, by means of probabilistic graphic models, i.e., Bayesian Networks (BNs). We investigate how to perform the task of learning the structure of such BNs, according to experimental evidence, adopting a global optimization meta-heuristics. In particular, in this work we rely on Genetic Algorithms, and to strongly reduce the execution time of the inference -- which can also involve multiple repetitions to collect statistically significant assessments of the data -- we distribute the calculations using both multi-threading and a multi-node architecture. The results show that our approach is characterized by good accuracy and specificity; we also demonstrate its feasibility, thanks to a 84x reduction of the overall execution time with respect to a traditional sequential implementation

    Efficient computational strategies to learn the structure of probabilistic graphical models of cumulative phenomena

    Full text link
    Structural learning of Bayesian Networks (BNs) is a NP-hard problem, which is further complicated by many theoretical issues, such as the I-equivalence among different structures. In this work, we focus on a specific subclass of BNs, named Suppes-Bayes Causal Networks (SBCNs), which include specific structural constraints based on Suppes' probabilistic causation to efficiently model cumulative phenomena. Here we compare the performance, via extensive simulations, of various state-of-the-art search strategies, such as local search techniques and Genetic Algorithms, as well as of distinct regularization methods. The assessment is performed on a large number of simulated datasets from topologies with distinct levels of complexity, various sample size and different rates of errors in the data. Among the main results, we show that the introduction of Suppes' constraints dramatically improve the inference accuracy, by reducing the solution space and providing a temporal ordering on the variables. We also report on trade-offs among different search techniques that can be efficiently employed in distinct experimental settings. This manuscript is an extended version of the paper "Structural Learning of Probabilistic Graphical Models of Cumulative Phenomena" presented at the 2018 International Conference on Computational Science

    A Profile Likelihood Analysis of the Constrained MSSM with Genetic Algorithms

    Full text link
    The Constrained Minimal Supersymmetric Standard Model (CMSSM) is one of the simplest and most widely-studied supersymmetric extensions to the standard model of particle physics. Nevertheless, current data do not sufficiently constrain the model parameters in a way completely independent of priors, statistical measures and scanning techniques. We present a new technique for scanning supersymmetric parameter spaces, optimised for frequentist profile likelihood analyses and based on Genetic Algorithms. We apply this technique to the CMSSM, taking into account existing collider and cosmological data in our global fit. We compare our method to the MultiNest algorithm, an efficient Bayesian technique, paying particular attention to the best-fit points and implications for particle masses at the LHC and dark matter searches. Our global best-fit point lies in the focus point region. We find many high-likelihood points in both the stau co-annihilation and focus point regions, including a previously neglected section of the co-annihilation region at large m_0. We show that there are many high-likelihood points in the CMSSM parameter space commonly missed by existing scanning techniques, especially at high masses. This has a significant influence on the derived confidence regions for parameters and observables, and can dramatically change the entire statistical inference of such scans.Comment: 47 pages, 8 figures; Fig. 8, Table 7 and more discussions added to Sec. 3.4.2 in response to referee's comments; accepted for publication in JHE

    Automating biomedical data science through tree-based pipeline optimization

    Full text link
    Over the past decade, data science and machine learning has grown from a mysterious art form to a staple tool across a variety of fields in academia, business, and government. In this paper, we introduce the concept of tree-based pipeline optimization for automating one of the most tedious parts of machine learning---pipeline design. We implement a Tree-based Pipeline Optimization Tool (TPOT) and demonstrate its effectiveness on a series of simulated and real-world genetic data sets. In particular, we show that TPOT can build machine learning pipelines that achieve competitive classification accuracy and discover novel pipeline operators---such as synthetic feature constructors---that significantly improve classification accuracy on these data sets. We also highlight the current challenges to pipeline optimization, such as the tendency to produce pipelines that overfit the data, and suggest future research paths to overcome these challenges. As such, this work represents an early step toward fully automating machine learning pipeline design.Comment: 16 pages, 5 figures, to appear in EvoBIO 2016 proceeding

    The Optimisation of Stochastic Grammars to Enable Cost-Effective Probabilistic Structural Testing

    Get PDF
    The effectiveness of probabilistic structural testing depends on the characteristics of the probability distribution from which test inputs are sampled at random. Metaheuristic search has been shown to be a practical method of optimis- ing the characteristics of such distributions. However, the applicability of the existing search-based algorithm is lim- ited by the requirement that the software’s inputs must be a fixed number of numeric values. In this paper we relax this limitation by means of a new representation for the probability distribution. The repre- sentation is based on stochastic context-free grammars but incorporates two novel extensions: conditional production weights and the aggregation of terminal symbols represent- ing numeric values. We demonstrate that an algorithm which combines the new representation with hill-climbing search is able to effi- ciently derive probability distributions suitable for testing software with structurally-complex input domains
    corecore