356,774 research outputs found
Non-Negative Matrix Factorization for Learning Alignment-Specific Models of Protein Evolution
Models of protein evolution currently come in two flavors: generalist and specialist. Generalist models (e.g. PAM, JTT, WAG) adopt a one-size-fits-all approach, where a single model is estimated from a number of different protein alignments. Specialist models (e.g. mtREV, rtREV, HIVbetween) can be estimated when a large quantity of data are available for a single organism or gene, and are intended for use on that organism or gene only. Unsurprisingly, specialist models outperform generalist models, but in most instances there simply are not enough data available to estimate them. We propose a method for estimating alignment-specific models of protein evolution in which the complexity of the model is adapted to suit the richness of the data. Our method uses non-negative matrix factorization (NNMF) to learn a set of basis matrices from a general dataset containing a large number of alignments of different proteins, thus capturing the dimensions of important variation. It then learns a set of weights that are specific to the organism or gene of interest and for which only a smaller dataset is available. Thus the alignment-specific model is obtained as a weighted sum of the basis matrices. Having been constrained to vary along only as many dimensions as the data justify, the model has far fewer parameters than would be required to estimate a specialist model. We show that our NNMF procedure produces models that outperform existing methods on all but one of 50 test alignments. The basis matrices we obtain confirm the expectation that amino acid properties tend to be conserved, and allow us to quantify, on specific alignments, how the strength of conservation varies across different properties. We also apply our new models to phylogeny inference and show that the resulting phylogenies are different from, and have improved likelihood over, those inferred under standard models
Testing product states, quantum Merlin-Arthur games and tensor optimisation
We give a test that can distinguish efficiently between product states of n
quantum systems and states which are far from product. If applied to a state
psi whose maximum overlap with a product state is 1-epsilon, the test passes
with probability 1-Theta(epsilon), regardless of n or the local dimensions of
the individual systems. The test uses two copies of psi. We prove correctness
of this test as a special case of a more general result regarding stability of
maximum output purity of the depolarising channel. A key application of the
test is to quantum Merlin-Arthur games with multiple Merlins, where we obtain
several structural results that had been previously conjectured, including the
fact that efficient soundness amplification is possible and that two Merlins
can simulate many Merlins: QMA(k)=QMA(2) for k>=2. Building on a previous
result of Aaronson et al, this implies that there is an efficient quantum
algorithm to verify 3-SAT with constant soundness, given two unentangled proofs
of O(sqrt(n) polylog(n)) qubits. We also show how QMA(2) with log-sized proofs
is equivalent to a large number of problems, some related to quantum
information (such as testing separability of mixed states) as well as problems
without any apparent connection to quantum mechanics (such as computing
injective tensor norms of 3-index tensors). As a consequence, we obtain many
hardness-of-approximation results, as well as potential algorithmic
applications of methods for approximating QMA(2) acceptance probabilities.
Finally, our test can also be used to construct an efficient test for
determining whether a unitary operator is a tensor product, which is a
generalisation of classical linearity testing.Comment: 44 pages, 1 figure, 7 appendices; v6: added references, rearranged
sections, added discussion of connections to classical CS. Final version to
appear in J of the AC
Testing probability distributions underlying aggregated data
In this paper, we analyze and study a hybrid model for testing and learning
probability distributions. Here, in addition to samples, the testing algorithm
is provided with one of two different types of oracles to the unknown
distribution over . More precisely, we define both the dual and
cumulative dual access models, in which the algorithm can both sample from
and respectively, for any ,
- query the probability mass (query access); or
- get the total mass of , i.e. (cumulative
access)
These two models, by generalizing the previously studied sampling and query
oracle models, allow us to bypass the strong lower bounds established for a
number of problems in these settings, while capturing several interesting
aspects of these problems -- and providing new insight on the limitations of
the models. Finally, we show that while the testing algorithms can be in most
cases strictly more efficient, some tasks remain hard even with this additional
power
Estimating Local Function Complexity via Mixture of Gaussian Processes
Real world data often exhibit inhomogeneity, e.g., the noise level, the
sampling distribution or the complexity of the target function may change over
the input space. In this paper, we try to isolate local function complexity in
a practical, robust way. This is achieved by first estimating the locally
optimal kernel bandwidth as a functional relationship. Specifically, we propose
Spatially Adaptive Bandwidth Estimation in Regression (SABER), which employs
the mixture of experts consisting of multinomial kernel logistic regression as
a gate and Gaussian process regression models as experts. Using the locally
optimal kernel bandwidths, we deduce an estimate to the local function
complexity by drawing parallels to the theory of locally linear smoothing. We
demonstrate the usefulness of local function complexity for model
interpretation and active learning in quantum chemistry experiments and fluid
dynamics simulations.Comment: 19 pages, 16 figure
On Counting Triangles through Edge Sampling in Large Dynamic Graphs
Traditional frameworks for dynamic graphs have relied on processing only the
stream of edges added into or deleted from an evolving graph, but not any
additional related information such as the degrees or neighbor lists of nodes
incident to the edges. In this paper, we propose a new edge sampling framework
for big-graph analytics in dynamic graphs which enhances the traditional model
by enabling the use of additional related information. To demonstrate the
advantages of this framework, we present a new sampling algorithm, called Edge
Sample and Discard (ESD). It generates an unbiased estimate of the total number
of triangles, which can be continuously updated in response to both edge
additions and deletions. We provide a comparative analysis of the performance
of ESD against two current state-of-the-art algorithms in terms of accuracy and
complexity. The results of the experiments performed on real graphs show that,
with the help of the neighborhood information of the sampled edges, the
accuracy achieved by our algorithm is substantially better. We also
characterize the impact of properties of the graph on the performance of our
algorithm by testing on several Barabasi-Albert graphs.Comment: A short version of this article appeared in Proceedings of the 2017
IEEE/ACM International Conference on Advances in Social Networks Analysis and
Mining (ASONAM 2017
Adaptive Compressed Sensing for Support Recovery of Structured Sparse Sets
This paper investigates the problem of recovering the support of structured
signals via adaptive compressive sensing. We examine several classes of
structured support sets, and characterize the fundamental limits of accurately
recovering such sets through compressive measurements, while simultaneously
providing adaptive support recovery protocols that perform near optimally for
these classes. We show that by adaptively designing the sensing matrix we can
attain significant performance gains over non-adaptive protocols. These gains
arise from the fact that adaptive sensing can: (i) better mitigate the effects
of noise, and (ii) better capitalize on the structure of the support sets.Comment: to appear in IEEE Transactions on Information Theor
Metrics and models to support the development of hybrid information systems
The research described here concerns the development of metrics and models to support the development of hybrid (conventional/knowledge based) integrated systems. The thesis argues from the point that, although it is well known that estimating the cost, duration and quality of information systems is a difficult task, it is far from clear what sorts of tools and techniques would adequately support a project manager in the estimation of these properties. A literature review shows that metrics (measurements) and estimating tools have been developed for conventional systems since the 1960s while there has been very little research on metrics for knowledge based systems (KBSs). Furthermore, although there are a number of theoretical problems with many of the `classic' metrics developed for conventional systems, it also appears that the tools which such metrics can be used to develop are not widely used by project managers. A survey was carried out of large UK companies which confirmed this continuing state of affairs. Before any useful tools could be developed, therefore, it was important to find out why project managers were not using these tools already. By characterising those companies that use software cost estimating (SCE) tools against those which could but do not, it was possible to recognise the involvement of the client/customer in the process of estimation. Pursuing this point, a model of the early estimating and planning stages (the EEPS model) was developed to test exactly where estimating takes place. The EEPS model suggests that estimating could take place either before a fully-developed plan has been produced, or while this plan is being produced. If it were the former, then SCE tools would be particularly useful since there is very little other data available from which to produce an estimate. A second survey, however, indicated that project managers see estimating as being essentially the latter at which point project management tools are available to support the process. It would seem, therefore, that SCE tools are not being used because project management tools are being used instead. The issue here is not with the method of developing an estimating model or tool, but; in the way in which "an estimate" is intimately tied to an understanding of what tasks are being planned. Current SCE tools are perceived by project managers as targetting the wrong point of estimation, A model (called TABATHA) is then presented which describes how an estimating tool based on an analysis of tasks would thus fit into the planning stage. The issue of whether metrics can be usefully developed for hybrid systems (which also contain KBS components) is tested by extending a number of "classic" program size and structure metrics to a KBS language, Prolog. Measurements of lines of code, Halstead's operators/operands, McCabe's cyclomatic complexity, Henry & Kafura's data flow fan-in/out and post-release reported errors were taken for a set of 80 commercially-developed LPA Prolog programs: By re~defining the metric counts for Prolog it was found that estimates of program size and error-proneness comparable to the best conventional studies are possible. This suggests that metrics can be usefully applied to KBS languages, such as Prolog and thus, the development of metncs and models to support the development of hybrid information systems is both feasible and useful
The Power of an Example: Hidden Set Size Approximation Using Group Queries and Conditional Sampling
We study a basic problem of approximating the size of an unknown set in a
known universe . We consider two versions of the problem. In both versions
the algorithm can specify subsets . In the first version, which
we refer to as the group query or subset query version, the algorithm is told
whether is non-empty. In the second version, which we refer to as the
subset sampling version, if is non-empty, then the algorithm receives
a uniformly selected element from . We study the difference between
these two versions under different conditions on the subsets that the algorithm
may query/sample, and in both the case that the algorithm is adaptive and the
case where it is non-adaptive. In particular we focus on a natural family of
allowed subsets, which correspond to intervals, as well as variants of this
family
- …