3,628 research outputs found

    Evolutionary Inference via the Poisson Indel Process

    Full text link
    We address the problem of the joint statistical inference of phylogenetic trees and multiple sequence alignments from unaligned molecular sequences. This problem is generally formulated in terms of string-valued evolutionary processes along the branches of a phylogenetic tree. The classical evolutionary process, the TKF91 model, is a continuous-time Markov chain model comprised of insertion, deletion and substitution events. Unfortunately this model gives rise to an intractable computational problem---the computation of the marginal likelihood under the TKF91 model is exponential in the number of taxa. In this work, we present a new stochastic process, the Poisson Indel Process (PIP), in which the complexity of this computation is reduced to linear. The new model is closely related to the TKF91 model, differing only in its treatment of insertions, but the new model has a global characterization as a Poisson process on the phylogeny. Standard results for Poisson processes allow key computations to be decoupled, which yields the favorable computational profile of inference under the PIP model. We present illustrative experiments in which Bayesian inference under the PIP model is compared to separate inference of phylogenies and alignments.Comment: 33 pages, 6 figure

    Demonstration of a scaling advantage for a quantum annealer over simulated annealing

    Full text link
    The observation of an unequivocal quantum speedup remains an elusive objective for quantum computing. The D-Wave quantum annealing processors have been at the forefront of experimental attempts to address this goal, given their relatively large numbers of qubits and programmability. A complete determination of the optimal time-to-solution (TTS) using these processors has not been possible to date, preventing definitive conclusions about the presence of a scaling advantage. The main technical obstacle has been the inability to verify an optimal annealing time within the available range. Here we overcome this obstacle and present a class of problem instances for which we observe an optimal annealing time using a D-Wave 2000Q processor over a range spanning up to more than 20002000 qubits. This allows us to perform an optimal TTS benchmarking analysis and perform a comparison to several classical algorithms, including simulated annealing, spin-vector Monte Carlo, and discrete-time simulated quantum annealing. We establish the first example of a scaling advantage for an experimental quantum annealer over classical simulated annealing: we find that the D-Wave device exhibits certifiably better scaling than simulated annealing, with 95%95\% confidence, over the range of problem sizes that we can test. However, we do not find evidence for a quantum speedup: simulated quantum annealing exhibits the best scaling by a significant margin. Our construction of instance classes with verifiably optimal annealing times opens up the possibility of generating many new such classes, paving the way for further definitive assessments of scaling advantages using current and future quantum annealing devices.Comment: 26 pages, 22 figures. v2: Updated benchmarking results with additional analysis. v3: Updated to published versio

    A genotyping protocol for multiple tissue types from the polyploid tree species Sequoia sempervirens (Cupressaceae).

    Get PDF
    Premise of the studyIdentifying clonal lineages in asexually reproducing plants using microsatellite markers is complicated by the possibility of nonidentical genotypes from the same clonal lineage due to somatic mutations, null alleles, and scoring errors. We developed and tested a clonal identification protocol that is robust to these issues for the asexually reproducing hexaploid tree species coast redwood (Sequoia sempervirens).MethodsMicrosatellite data from four previously published and two newly developed primers were scored using a modified protocol, and clones were identified using Bruvo genetic distances. The effectiveness of this clonal identification protocol was assessed using simulations and by genotyping a test set of paired samples of different tissue types from the same trees.ResultsData from simulations showed that our protocol allowed us to accurately identify clonal lineages. Multiple test samples from the same trees were identified correctly, although certain tissue type pairs had larger genetic distances on average.DiscussionThe methods described in this paper will allow for the accurate identification of coast redwood clones, facilitating future studies of the reproductive ecology of this species. The techniques used in this paper can be applied to studies of other clonal organisms as well

    Strategies for automatic planning: A collection of ideas

    Get PDF
    The main goal of the Jet Propulsion Laboratory (JPL) is to obtain science return from interplanetary probes. The uplink process is concerned with communicating commands to a spacecraft in order to achieve science objectives. There are two main parts to the development of the command file which is sent to a spacecraft. First, the activity planning process integrates the science requests for utilization of spacecraft time into a feasible sequence. Then the command generation process converts the sequence into a set of commands. The development of a feasible sequence plan is an expensive and labor intensive process requiring many months of effort. In order to save time and manpower in the uplink process, automation of parts of this process is desired. There is an ongoing effort to develop automatic planning systems. This has met with some success, but has also been informative about the nature of this effort. It is now clear that innovative techniques and state-of-the-art technology will be required in order to produce a system which can provide automatic sequence planning. As part of this effort to develop automatic planning systems, a survey of the literature, looking for known techniques which may be applicable to our work was conducted. Descriptions of and references for these methods are given, together with ideas for applying the techniques to automatic planning

    Protein Function Prediction using Phylogenomics, Domain Architecture Analysis, Data Integration, and Lexical Scoring

    Get PDF
    “As the number of sequenced genomes rapidly grows, the overwhelming majority of protein products can only be annotated computationally.” (Radivojac, Clark, Oron, et al. 2013) With this goal, three new protein function annotation tools were developed, which produce trustworthy and concise protein annotations, are easy to obtain and install, and are capable of processing large sets of proteins with reasonable computational resource demands. Especially for high throughput analysis e.g. on genome scale, these tools improve over existing tools both in ease of use and accuracy. They are dubbed: • Automated Assignment of Human Readable Descriptions (AHRD) (github.com/groupschoof/AHRD; Hallab, Klee, Srinivas, and Schoof 2014), • AHRD on gene clusters, and • Phylogenetic predictions of Gene Ontology (GO) terms with specific calibrations (PhyloFun v2). “AHRD” assigns human readable descriptions (HRDs) to query proteins and was developed to mimic the decision making process of an expert curator. To this end it processes the descriptions of reference proteins obtained by searching selected databases with BLAST (Altschul, Madden, Schaffer, et al. 1997). Here, the trust a user puts into results found in each of these databases can be weighted separately. In the next step the descriptions of the found homologous proteins are filtered, removing accessions, species information, and finally discarding uninformative candidate descriptions like e.g. “putative protein”. Afterwards a dictionary of meaningful words is constructed from those found in the remaining candidates. In this, another filter is applied to ignore words, not conveying information like e.g. the word “protein” itself. In a lexical approach each word is assigned a score based on its frequency in all candidate descriptions, the sequence alignment quality associated with the candidate reference proteins, and finally the already mentioned trust put into the database the reference was obtained from. Subsequently each candidate description is assigned a score, which is computed from the respective scores of the meaningful words contained in that candidate. Also incorporated into this score is the description’s frequency among all regarded candidates. In the final step the highest scoring description is assigned to the query protein. The performance of this lexical algorithm, implemented in “AHRD”, was subsequently compared with that of competitive methods, which were Blast2GO and “best Blast”, where the latter “best Blast” simply passes the description of the best scoring hit to the query protein. To enable this comparison of performance, and in lack of a robust evaluation procedure, a new method to measure the accuracy of textual human readable protein descriptions was developed and applied with success. In this, the accuracy of each assigned competitive description was inferred with the frequently used “F-measure”, the harmonic mean of precision and recall, which we computed regarding meaningful words appearing in both the reference and the assigned descriptions as true positives. The results showed that “AHRD” not only outperforms its competitors by far, but also is very robust and thus does not require its users to use carefully selected parameters. In fact, AHRD’s robustness was demonstrated through cross validation and use of three different reference sets. The second annotation tool “AHRD on gene clusters” uses conserved protein domains from the InterPro database (Apweiler, Attwood, Bairoch, et al. 2000) to annotate clusters of homologous proteins. In a first step the domains found in each cluster are filtered, such that only the most informative are retained. For example are family descriptions discarded, if more detailed sub-family descriptions are also found annotated to members of the cluster. Subsequently, the most frequent candidate description is assigned, favoring those of type “family” over “domain”. Finally the third tool “PhyloFun (v2)” was developed to annotate large sets of query proteins with terms from the Gene Ontology. This work focussed on extending the “Belief propagation” (Pearl 1988) algorithm implemented in the “Sifter” annotation tool (Engelhardt, Jordan, Muratore, and Brenner 2005; Engelhardt, Jordan, Srouji, and Brenner 2011). Jöcker had developed a phylogenetic pipeline generating the input that was fed into the Sifter program. This pipeline executes stringent sequence similarity searches in a database of selected reference proteins, and reconstruct a phylogenetic tree from the found orthologs and inparalogs. This tree is than used by the Sifter program and interpreted as a “Bayesian Network” into which the GO term annotations of the homologous reference proteins are fed as “diagnostic evidence” (Pearl 1988). Subsequently the current strength of belief, the probability of this evidence being also the true state of ancestral tree nodes, is then spread recursively through the tree towards its root, and then vice versa towards the tips. These, of course, include the query protein, which in the final step is annotated with those GO terms that have the strongest belief. Note that during this recursive belief propagation a given GO term’s annotation probability depends on both the length of the currently processed branch, as well as the type of evolutionary event that took place. This event can be one of “speciation” or “duplication”, such that function mutation becomes more likely on longer branches and particularly after “duplication” events. A particular goal in extending this algorithm was to base the annotation probability of a given GO term not on a preconceived model of function evolution among homologous proteins as implemented in Sifter, but instead to compute these GO term annotation probabilities based on empirical measurements. To achieve this, calibrations were computed for each GO term separately, and reference proteins annotated with a given GO term were investigated such that the probability of function loss could be assessed empirically for decreasing sequence homology among related proteins. A second goal was to overcome errors in the identification of the type of evolutionary events. These errors arose from missing knowledge in terms of true species trees, which, in version 1 of the PhyloFun pipeline, are compared with the actual protein trees in order to tell “duplication” from “speciation” events (Zmasek and Eddy 2001). As reliable reference species trees are sparse or in many cases not available, the part of the algorithm incorporating the type of evolutionary event was discarded. Finally, the third goal postulated for the development of PhyloFun’s version 2 was to enable easy installation, usage, and calibration on latest available knowledge. This was motivated by observations made during the application of the first version of PhyloFun, in which maintaining the knowledge-base was almost not feasible. This obstacle was overcome in version 2 of PhyloFun by obtaining required reference data directly from publicly available databases. The accuracy and performance of the new PhyloFun version 2 was assessed and compared with selected competitive methods. These were chosen based on their widespread usage, as well as their applicability on large sets of query proteins without them surpassing reasonable time and computational resource requirements. The measurement of each method’s performance was carried out on a “gold standard”, obtained from the Uniprot/Swissprot public database (Boeckmann, Bairoch, Apweiler, et al. 2003), of 1000 selected reference proteins, all of which had GO term annotations made by expert curators and mostly based on experimental verifications. Subsequently the performance assessment was executed with a slightly modified version of the “Critical Assessment of Function Annotation experiment (CAFA)” experiment (Radivojac, Clark, Oron, et al. 2013). CAFA compares the performance of different protein function annotation tools on a worldwide scale using a provided set of reference proteins. In this, the predictions the competitors deliver are evaluated using the already introduced “F-measure”. Our performance evaluation of PhyloFun’s protein annotations interestingly showed that PhyloFun outperformed all of its competitors. Its use is recommended furthermore by the highly accurate phylogenetic trees the pipeline computes for each query and the found homologous reference proteins. In conclusion, three new premium tools addressing important matters in the computational prediction of protein function were developed and, in two cases, their performance assessed. Here, both AHRD and PhyloFun (v2) outperformed their competitors. Further arguments for the usage of all three tools are, that they are easy to install and use, as well as being reasonably resource demanding. Because of these results the publications of AHRD and PhyloFun (v2) are in preparation, even while AHRD already is applied by different researchers worldwide
    • …
    corecore