3,241 research outputs found

    BOOL-AN: A method for comparative sequence analysis and phylogenetic reconstruction

    Get PDF
    A novel discrete mathematical approach is proposed as an additional tool for molecular systematics which does not require prior statistical assumptions concerning the evolutionary process. The method is based on algorithms generating mathematical representations directly from DNA/RNA or protein sequences, followed by the output of numerical (scalar or vector) and visual characteristics (graphs). The binary encoded sequence information is transformed into a compact analytical form, called the Iterative Canonical Form (or ICF) of Boolean functions, which can then be used as a generalized molecular descriptor. The method provides raw vector data for calculating different distance matrices, which in turn can be analyzed by neighbor-joining or UPGMA to derive a phylogenetic tree, or by principal coordinates analysis to get an ordination scattergram. The new method and the associated software for inferring phylogenetic trees are called the Boolean analysis or BOOL-AN

    Learning and Designing Stochastic Processes from Logical Constraints

    Get PDF
    Stochastic processes offer a flexible mathematical formalism to model and reason about systems. Most analysis tools, however, start from the premises that models are fully specified, so that any parameters controlling the system's dynamics must be known exactly. As this is seldom the case, many methods have been devised over the last decade to infer (learn) such parameters from observations of the state of the system. In this paper, we depart from this approach by assuming that our observations are {\it qualitative} properties encoded as satisfaction of linear temporal logic formulae, as opposed to quantitative observations of the state of the system. An important feature of this approach is that it unifies naturally the system identification and the system design problems, where the properties, instead of observations, represent requirements to be satisfied. We develop a principled statistical estimation procedure based on maximising the likelihood of the system's parameters, using recent ideas from statistical machine learning. We demonstrate the efficacy and broad applicability of our method on a range of simple but non-trivial examples, including rumour spreading in social networks and hybrid models of gene regulation

    Numeric Input Relations for Relational Learning with Applications to Community Structure Analysis

    Full text link
    Most work in the area of statistical relational learning (SRL) is focussed on discrete data, even though a few approaches for hybrid SRL models have been proposed that combine numerical and discrete variables. In this paper we distinguish numerical random variables for which a probability distribution is defined by the model from numerical input variables that are only used for conditioning the distribution of discrete response variables. We show how numerical input relations can very easily be used in the Relational Bayesian Network framework, and that existing inference and learning methods need only minor adjustments to be applied in this generalized setting. The resulting framework provides natural relational extensions of classical probabilistic models for categorical data. We demonstrate the usefulness of RBN models with numeric input relations by several examples. In particular, we use the augmented RBN framework to define probabilistic models for multi-relational (social) networks in which the probability of a link between two nodes depends on numeric latent feature vectors associated with the nodes. A generic learning procedure can be used to obtain a maximum-likelihood fit of model parameters and latent feature values for a variety of models that can be expressed in the high-level RBN representation. Specifically, we propose a model that allows us to interpret learned latent feature values as community centrality degrees by which we can identify nodes that are central for one community, that are hubs between communities, or that are isolated nodes. In a multi-relational setting, the model also provides a characterization of how different relations are associated with each community

    Computational Hardness of Certifying Bounds on Constrained PCA Problems

    Get PDF
    Given a random n×n symmetric matrix W drawn from the Gaussian orthogonal ensemble (GOE), we consider the problem of certifying an upper bound on the maximum value of the quadratic form x⊤Wx over all vectors x in a constraint set S⊂Rn. For a certain class of normalized constraint sets S we show that, conditional on certain complexity-theoretic assumptions, there is no polynomial-time algorithm certifying a better upper bound than the largest eigenvalue of W. A notable special case included in our results is the hypercube S={±1/n−−√}n, which corresponds to the problem of certifying bounds on the Hamiltonian of the Sherrington-Kirkpatrick spin glass model from statistical physics. Our proof proceeds in two steps. First, we give a reduction from the detection problem in the negatively-spiked Wishart model to the above certification problem. We then give evidence that this Wishart detection problem is computationally hard below the classical spectral threshold, by showing that no low-degree polynomial can (in expectation) distinguish the spiked and unspiked models. This method for identifying computational thresholds was proposed in a sequence of recent works on the sum-of-squares hierarchy, and is believed to be correct for a large class of problems. Our proof can be seen as constructing a distribution over symmetric matrices that appears computationally indistinguishable from the GOE, yet is supported on matrices whose maximum quadratic form over x∈S is much larger than that of a GOE matrix.ISSN:1868-896

    Unconventional machine learning of genome-wide human cancer data

    Full text link
    Recent advances in high-throughput genomic technologies coupled with exponential increases in computer processing and memory have allowed us to interrogate the complex aberrant molecular underpinnings of human disease from a genome-wide perspective. While the deluge of genomic information is expected to increase, a bottleneck in conventional high-performance computing is rapidly approaching. Inspired in part by recent advances in physical quantum processors, we evaluated several unconventional machine learning (ML) strategies on actual human tumor data. Here we show for the first time the efficacy of multiple annealing-based ML algorithms for classification of high-dimensional, multi-omics human cancer data from the Cancer Genome Atlas. To assess algorithm performance, we compared these classifiers to a variety of standard ML methods. Our results indicate the feasibility of using annealing-based ML to provide competitive classification of human cancer types and associated molecular subtypes and superior performance with smaller training datasets, thus providing compelling empirical evidence for the potential future application of unconventional computing architectures in the biomedical sciences
    corecore