356,774 research outputs found

    Non-Negative Matrix Factorization for Learning Alignment-Specific Models of Protein Evolution

    Get PDF
    Models of protein evolution currently come in two flavors: generalist and specialist. Generalist models (e.g. PAM, JTT, WAG) adopt a one-size-fits-all approach, where a single model is estimated from a number of different protein alignments. Specialist models (e.g. mtREV, rtREV, HIVbetween) can be estimated when a large quantity of data are available for a single organism or gene, and are intended for use on that organism or gene only. Unsurprisingly, specialist models outperform generalist models, but in most instances there simply are not enough data available to estimate them. We propose a method for estimating alignment-specific models of protein evolution in which the complexity of the model is adapted to suit the richness of the data. Our method uses non-negative matrix factorization (NNMF) to learn a set of basis matrices from a general dataset containing a large number of alignments of different proteins, thus capturing the dimensions of important variation. It then learns a set of weights that are specific to the organism or gene of interest and for which only a smaller dataset is available. Thus the alignment-specific model is obtained as a weighted sum of the basis matrices. Having been constrained to vary along only as many dimensions as the data justify, the model has far fewer parameters than would be required to estimate a specialist model. We show that our NNMF procedure produces models that outperform existing methods on all but one of 50 test alignments. The basis matrices we obtain confirm the expectation that amino acid properties tend to be conserved, and allow us to quantify, on specific alignments, how the strength of conservation varies across different properties. We also apply our new models to phylogeny inference and show that the resulting phylogenies are different from, and have improved likelihood over, those inferred under standard models

    Testing product states, quantum Merlin-Arthur games and tensor optimisation

    Full text link
    We give a test that can distinguish efficiently between product states of n quantum systems and states which are far from product. If applied to a state psi whose maximum overlap with a product state is 1-epsilon, the test passes with probability 1-Theta(epsilon), regardless of n or the local dimensions of the individual systems. The test uses two copies of psi. We prove correctness of this test as a special case of a more general result regarding stability of maximum output purity of the depolarising channel. A key application of the test is to quantum Merlin-Arthur games with multiple Merlins, where we obtain several structural results that had been previously conjectured, including the fact that efficient soundness amplification is possible and that two Merlins can simulate many Merlins: QMA(k)=QMA(2) for k>=2. Building on a previous result of Aaronson et al, this implies that there is an efficient quantum algorithm to verify 3-SAT with constant soundness, given two unentangled proofs of O(sqrt(n) polylog(n)) qubits. We also show how QMA(2) with log-sized proofs is equivalent to a large number of problems, some related to quantum information (such as testing separability of mixed states) as well as problems without any apparent connection to quantum mechanics (such as computing injective tensor norms of 3-index tensors). As a consequence, we obtain many hardness-of-approximation results, as well as potential algorithmic applications of methods for approximating QMA(2) acceptance probabilities. Finally, our test can also be used to construct an efficient test for determining whether a unitary operator is a tensor product, which is a generalisation of classical linearity testing.Comment: 44 pages, 1 figure, 7 appendices; v6: added references, rearranged sections, added discussion of connections to classical CS. Final version to appear in J of the AC

    Testing probability distributions underlying aggregated data

    Full text link
    In this paper, we analyze and study a hybrid model for testing and learning probability distributions. Here, in addition to samples, the testing algorithm is provided with one of two different types of oracles to the unknown distribution DD over [n][n]. More precisely, we define both the dual and cumulative dual access models, in which the algorithm AA can both sample from DD and respectively, for any i∈[n]i\in[n], - query the probability mass D(i)D(i) (query access); or - get the total mass of {1,…,i}\{1,\dots,i\}, i.e. ∑j=1iD(j)\sum_{j=1}^i D(j) (cumulative access) These two models, by generalizing the previously studied sampling and query oracle models, allow us to bypass the strong lower bounds established for a number of problems in these settings, while capturing several interesting aspects of these problems -- and providing new insight on the limitations of the models. Finally, we show that while the testing algorithms can be in most cases strictly more efficient, some tasks remain hard even with this additional power

    Estimating Local Function Complexity via Mixture of Gaussian Processes

    Full text link
    Real world data often exhibit inhomogeneity, e.g., the noise level, the sampling distribution or the complexity of the target function may change over the input space. In this paper, we try to isolate local function complexity in a practical, robust way. This is achieved by first estimating the locally optimal kernel bandwidth as a functional relationship. Specifically, we propose Spatially Adaptive Bandwidth Estimation in Regression (SABER), which employs the mixture of experts consisting of multinomial kernel logistic regression as a gate and Gaussian process regression models as experts. Using the locally optimal kernel bandwidths, we deduce an estimate to the local function complexity by drawing parallels to the theory of locally linear smoothing. We demonstrate the usefulness of local function complexity for model interpretation and active learning in quantum chemistry experiments and fluid dynamics simulations.Comment: 19 pages, 16 figure

    On Counting Triangles through Edge Sampling in Large Dynamic Graphs

    Full text link
    Traditional frameworks for dynamic graphs have relied on processing only the stream of edges added into or deleted from an evolving graph, but not any additional related information such as the degrees or neighbor lists of nodes incident to the edges. In this paper, we propose a new edge sampling framework for big-graph analytics in dynamic graphs which enhances the traditional model by enabling the use of additional related information. To demonstrate the advantages of this framework, we present a new sampling algorithm, called Edge Sample and Discard (ESD). It generates an unbiased estimate of the total number of triangles, which can be continuously updated in response to both edge additions and deletions. We provide a comparative analysis of the performance of ESD against two current state-of-the-art algorithms in terms of accuracy and complexity. The results of the experiments performed on real graphs show that, with the help of the neighborhood information of the sampled edges, the accuracy achieved by our algorithm is substantially better. We also characterize the impact of properties of the graph on the performance of our algorithm by testing on several Barabasi-Albert graphs.Comment: A short version of this article appeared in Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2017

    Adaptive Compressed Sensing for Support Recovery of Structured Sparse Sets

    Get PDF
    This paper investigates the problem of recovering the support of structured signals via adaptive compressive sensing. We examine several classes of structured support sets, and characterize the fundamental limits of accurately recovering such sets through compressive measurements, while simultaneously providing adaptive support recovery protocols that perform near optimally for these classes. We show that by adaptively designing the sensing matrix we can attain significant performance gains over non-adaptive protocols. These gains arise from the fact that adaptive sensing can: (i) better mitigate the effects of noise, and (ii) better capitalize on the structure of the support sets.Comment: to appear in IEEE Transactions on Information Theor

    Metrics and models to support the development of hybrid information systems

    Get PDF
    The research described here concerns the development of metrics and models to support the development of hybrid (conventional/knowledge based) integrated systems. The thesis argues from the point that, although it is well known that estimating the cost, duration and quality of information systems is a difficult task, it is far from clear what sorts of tools and techniques would adequately support a project manager in the estimation of these properties. A literature review shows that metrics (measurements) and estimating tools have been developed for conventional systems since the 1960s while there has been very little research on metrics for knowledge based systems (KBSs). Furthermore, although there are a number of theoretical problems with many of the `classic' metrics developed for conventional systems, it also appears that the tools which such metrics can be used to develop are not widely used by project managers. A survey was carried out of large UK companies which confirmed this continuing state of affairs. Before any useful tools could be developed, therefore, it was important to find out why project managers were not using these tools already. By characterising those companies that use software cost estimating (SCE) tools against those which could but do not, it was possible to recognise the involvement of the client/customer in the process of estimation. Pursuing this point, a model of the early estimating and planning stages (the EEPS model) was developed to test exactly where estimating takes place. The EEPS model suggests that estimating could take place either before a fully-developed plan has been produced, or while this plan is being produced. If it were the former, then SCE tools would be particularly useful since there is very little other data available from which to produce an estimate. A second survey, however, indicated that project managers see estimating as being essentially the latter at which point project management tools are available to support the process. It would seem, therefore, that SCE tools are not being used because project management tools are being used instead. The issue here is not with the method of developing an estimating model or tool, but; in the way in which "an estimate" is intimately tied to an understanding of what tasks are being planned. Current SCE tools are perceived by project managers as targetting the wrong point of estimation, A model (called TABATHA) is then presented which describes how an estimating tool based on an analysis of tasks would thus fit into the planning stage. The issue of whether metrics can be usefully developed for hybrid systems (which also contain KBS components) is tested by extending a number of "classic" program size and structure metrics to a KBS language, Prolog. Measurements of lines of code, Halstead's operators/operands, McCabe's cyclomatic complexity, Henry & Kafura's data flow fan-in/out and post-release reported errors were taken for a set of 80 commercially-developed LPA Prolog programs: By re~defining the metric counts for Prolog it was found that estimates of program size and error-proneness comparable to the best conventional studies are possible. This suggests that metrics can be usefully applied to KBS languages, such as Prolog and thus, the development of metncs and models to support the development of hybrid information systems is both feasible and useful

    The Power of an Example: Hidden Set Size Approximation Using Group Queries and Conditional Sampling

    Full text link
    We study a basic problem of approximating the size of an unknown set SS in a known universe UU. We consider two versions of the problem. In both versions the algorithm can specify subsets T⊆UT\subseteq U. In the first version, which we refer to as the group query or subset query version, the algorithm is told whether T∩ST\cap S is non-empty. In the second version, which we refer to as the subset sampling version, if T∩ST\cap S is non-empty, then the algorithm receives a uniformly selected element from T∩ST\cap S. We study the difference between these two versions under different conditions on the subsets that the algorithm may query/sample, and in both the case that the algorithm is adaptive and the case where it is non-adaptive. In particular we focus on a natural family of allowed subsets, which correspond to intervals, as well as variants of this family
    • …
    corecore