Search CORE

356,774 research outputs found

Non-Negative Matrix Factorization for Learning Alignment-Specific Models of Protein Evolution

Author: Ben Murrell
C Kosiol
D Posada
D Posada
D Robinson
Daniel Kaliski
DC Nickle
DD Lee
DJ Lipman
DT Jones
F Abascal
Gerdus Benade
J Adachi
J Felsenstein
J Felsenstein
Jan Buys
K Devarajan
Konrad Scheffler
KP Burnham
KP Burnham
L Stanfel
Lise du Buisson
MO Dayhoff
MO Dayhoff
MW Dimmic
N Goldman
N Lartillot
Robert Ketteringham
S Whelan
S Whelan
S Zoller
SA Guindon
Sasha Moola
SL Kosakovsky Pond
SL Kosakovsky Pond
SQ Le
SQ Le
Thomas Mailund
Thomas Weighill
Tristan Hands
W Delport
Y Cao
Z Yang
Z Yang
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

Models of protein evolution currently come in two flavors: generalist and specialist. Generalist models (e.g. PAM, JTT, WAG) adopt a one-size-fits-all approach, where a single model is estimated from a number of different protein alignments. Specialist models (e.g. mtREV, rtREV, HIVbetween) can be estimated when a large quantity of data are available for a single organism or gene, and are intended for use on that organism or gene only. Unsurprisingly, specialist models outperform generalist models, but in most instances there simply are not enough data available to estimate them. We propose a method for estimating alignment-specific models of protein evolution in which the complexity of the model is adapted to suit the richness of the data. Our method uses non-negative matrix factorization (NNMF) to learn a set of basis matrices from a general dataset containing a large number of alignments of different proteins, thus capturing the dimensions of important variation. It then learns a set of weights that are specific to the organism or gene of interest and for which only a smaller dataset is available. Thus the alignment-specific model is obtained as a weighted sum of the basis matrices. Having been constrained to vary along only as many dimensions as the data justify, the model has far fewer parameters than would be required to estimate a specialist model. We show that our NNMF procedure produces models that outperform existing methods on all but one of 50 test alignments. The basis matrices we obtain confirm the expectation that amino acid properties tend to be conserved, and allow us to quantify, on specific alignments, how the strength of conservation varies across different properties. We also apply our new models to phylogeny inference and show that the resulting phylogenies are different from, and have improved likelihood over, those inferred under standard models

Public Library of Science (PLOS)

Cape Town University OpenUCT

Crossref

Directory of Open Access Journals

PubMed Central

Stellenbosch University SUNScholar Repository

Testing product states, quantum Merlin-Arthur games and tensor optimisation

Author: Harrow Aram W.
Montanaro Ashley
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2010
Field of study

We give a test that can distinguish efficiently between product states of n quantum systems and states which are far from product. If applied to a state psi whose maximum overlap with a product state is 1-epsilon, the test passes with probability 1-Theta(epsilon), regardless of n or the local dimensions of the individual systems. The test uses two copies of psi. We prove correctness of this test as a special case of a more general result regarding stability of maximum output purity of the depolarising channel. A key application of the test is to quantum Merlin-Arthur games with multiple Merlins, where we obtain several structural results that had been previously conjectured, including the fact that efficient soundness amplification is possible and that two Merlins can simulate many Merlins: QMA(k)=QMA(2) for k>=2. Building on a previous result of Aaronson et al, this implies that there is an efficient quantum algorithm to verify 3-SAT with constant soundness, given two unentangled proofs of O(sqrt(n) polylog(n)) qubits. We also show how QMA(2) with log-sized proofs is equivalent to a large number of problems, some related to quantum information (such as testing separability of mixed states) as well as problems without any apparent connection to quantum mechanics (such as computing injective tensor norms of 3-index tensors). As a consequence, we obtain many hardness-of-approximation results, as well as potential algorithmic applications of methods for approximating QMA(2) acceptance probabilities. Finally, our test can also be used to construct an efficient test for determining whether a unitary operator is a tensor product, which is a generalisation of classical linearity testing.Comment: 44 pages, 1 figure, 7 appendices; v6: added references, rearranged sections, added discussion of connections to classical CS. Final version to appear in J of the AC

arXiv.org e-Print Archive

CiteSeerX

Explore Bristol Research

Testing probability distributions underlying aggregated data

Author: A. Blum
C. Dwork
C. Dwork
C.L. Canonne
L. Birgé
L. Paninski
M. Parnas
P. Valiant
R. Rubinfeld
S. Chakraborty
S.K. Ma
T. Batu
Publication venue
Publication date: 01/01/2014
Field of study

In this paper, we analyze and study a hybrid model for testing and learning probability distributions. Here, in addition to samples, the testing algorithm is provided with one of two different types of oracles to the unknown distribution

D

over

[n]

. More precisely, we define both the dual and cumulative dual access models, in which the algorithm

A

can both sample from

D

and respectively, for any

i\in[n]

, - query the probability mass

D(i)

(query access); or - get the total mass of

\{1,\dots,i\}

, i.e.

\sum_{j=1}^i D(j)

(cumulative access) These two models, by generalizing the previously studied sampling and query oracle models, allow us to bypass the strong lower bounds established for a number of problems in these settings, while capturing several interesting aspects of these problems -- and providing new insight on the limitations of the models. Finally, we show that while the testing algorithms can be in most cases strictly more efficient, some tasks remain hard even with this additional power

arXiv.org e-Print Archive

CiteSeerX

Crossref

DSpace@MIT

Estimating Local Function Complexity via Mixture of Gaussian Processes

Author: Bui Thanh Binh
Müller Klaus-Robert
Nakajima Shinichi
Panknin Danny
Publication venue
Publication date: 28/08/2019
Field of study

Real world data often exhibit inhomogeneity, e.g., the noise level, the sampling distribution or the complexity of the target function may change over the input space. In this paper, we try to isolate local function complexity in a practical, robust way. This is achieved by first estimating the locally optimal kernel bandwidth as a functional relationship. Specifically, we propose Spatially Adaptive Bandwidth Estimation in Regression (SABER), which employs the mixture of experts consisting of multinomial kernel logistic regression as a gate and Gaussian process regression models as experts. Using the locally optimal kernel bandwidths, we deduce an estimate to the local function complexity by drawing parallels to the theory of locally linear smoothing. We demonstrate the usefulness of local function complexity for model interpretation and active learning in quantum chemistry experiments and fluid dynamics simulations.Comment: 19 pages, 16 figure

arXiv.org e-Print Archive

On Counting Triangles through Edge Sampling in Large Dynamic Graphs

Author: D Chakrabarti
DG Horvitz
H Avron
HT Welser
JW Berry
L Becchetti
M Jha
M Latapy
M Thorup
ME Newman
MN Kolountzakis
N Alon
N Chiba
R Gemulla
R Pagh
S Wasserman
T Tiropanis
Publication venue
Publication date: 28/02/2018
Field of study

Traditional frameworks for dynamic graphs have relied on processing only the stream of edges added into or deleted from an evolving graph, but not any additional related information such as the degrees or neighbor lists of nodes incident to the edges. In this paper, we propose a new edge sampling framework for big-graph analytics in dynamic graphs which enhances the traditional model by enabling the use of additional related information. To demonstrate the advantages of this framework, we present a new sampling algorithm, called Edge Sample and Discard (ESD). It generates an unbiased estimate of the total number of triangles, which can be continuously updated in response to both edge additions and deletions. We provide a comparative analysis of the performance of ESD against two current state-of-the-art algorithms in terms of accuracy and complexity. The results of the experiments performed on real graphs show that, with the help of the neighborhood information of the sampled edges, the accuracy achieved by our algorithm is substantially better. We also characterize the impact of properties of the graph on the performance of our algorithm by testing on several Barabasi-Albert graphs.Comment: A short version of this article appeared in Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2017

arXiv.org e-Print Archive

Crossref

Adaptive Compressed Sensing for Support Recovery of Structured Sparse Sets

Author: Castro Rui M.
Tánczos Ervin
Publication venue
Publication date: 02/09/2016
Field of study

This paper investigates the problem of recovering the support of structured signals via adaptive compressive sensing. We examine several classes of structured support sets, and characterize the fundamental limits of accurately recovering such sets through compressive measurements, while simultaneously providing adaptive support recovery protocols that perform near optimally for these classes. We show that by adaptively designing the sensing matrix we can attain significant performance gains over non-adaptive protocols. These gains arise from the fact that adaptive sensing can: (i) better mitigate the effects of noise, and (ii) better capitalize on the structure of the support sets.Comment: to appear in IEEE Transactions on Information Theor

arXiv.org e-Print Archive

Repository TU/e

Pure OAI Repository

Metrics and models to support the development of hybrid information systems

Author: Moores T.T.
Publication venue
Publication date
Field of study

The research described here concerns the development of metrics and models to support the development of hybrid (conventional/knowledge based) integrated systems. The thesis argues from the point that, although it is well known that estimating the cost, duration and quality of information systems is a difficult task, it is far from clear what sorts of tools and techniques would adequately support a project manager in the estimation of these properties. A literature review shows that metrics (measurements) and estimating tools have been developed for conventional systems since the 1960s while there has been very little research on metrics for knowledge based systems (KBSs). Furthermore, although there are a number of theoretical problems with many of the `classic' metrics developed for conventional systems, it also appears that the tools which such metrics can be used to develop are not widely used by project managers. A survey was carried out of large UK companies which confirmed this continuing state of affairs. Before any useful tools could be developed, therefore, it was important to find out why project managers were not using these tools already. By characterising those companies that use software cost estimating (SCE) tools against those which could but do not, it was possible to recognise the involvement of the client/customer in the process of estimation. Pursuing this point, a model of the early estimating and planning stages (the EEPS model) was developed to test exactly where estimating takes place. The EEPS model suggests that estimating could take place either before a fully-developed plan has been produced, or while this plan is being produced. If it were the former, then SCE tools would be particularly useful since there is very little other data available from which to produce an estimate. A second survey, however, indicated that project managers see estimating as being essentially the latter at which point project management tools are available to support the process. It would seem, therefore, that SCE tools are not being used because project management tools are being used instead. The issue here is not with the method of developing an estimating model or tool, but; in the way in which "an estimate" is intimately tied to an understanding of what tasks are being planned. Current SCE tools are perceived by project managers as targetting the wrong point of estimation, A model (called TABATHA) is then presented which describes how an estimating tool based on an analysis of tasks would thus fit into the planning stage. The issue of whether metrics can be usefully developed for hybrid systems (which also contain KBS components) is tested by extending a number of "classic" program size and structure metrics to a KBS language, Prolog. Measurements of lines of code, Halstead's operators/operands, McCabe's cyclomatic complexity, Henry & Kafura's data flow fan-in/out and post-release reported errors were taken for a set of 80 commercially-developed LPA Prolog programs: By re~defining the metric counts for Prolog it was found that estimates of program size and error-proneness comparable to the best conventional studies are possible. This suggests that metrics can be usefully applied to KBS languages, such as Prolog and thus, the development of metncs and models to support the development of hybrid information systems is both feasible and useful

Aston Publications Explorer

The Power of an Example: Hidden Set Size Approximation Using Group Queries and Conditional Sampling

Author: Ron Dana
Tsur Gilad
Publication venue
Publication date: 20/04/2014
Field of study

We study a basic problem of approximating the size of an unknown set

S

in a known universe

U

. We consider two versions of the problem. In both versions the algorithm can specify subsets

T\subseteq U

. In the first version, which we refer to as the group query or subset query version, the algorithm is told whether

T\cap S

is non-empty. In the second version, which we refer to as the subset sampling version, if

T\cap S

is non-empty, then the algorithm receives a uniformly selected element from

T\cap S

. We study the difference between these two versions under different conditions on the subsets that the algorithm may query/sample, and in both the case that the algorithm is adaptive and the case where it is non-adaptive. In particular we focus on a natural family of allowed subsets, which correspond to intervals, as well as variants of this family

arXiv.org e-Print Archive

CiteSeerX