12 research outputs found

    Parametric inference of recombination in HIV genomes

    Full text link
    Recombination is an important event in the evolution of HIV. It affects the global spread of the pandemic as well as evolutionary escape from host immune response and from drug therapy within single patients. Comprehensive computational methods are needed for detecting recombinant sequences in large databases, and for inferring the parental sequences. We present a hidden Markov model to annotate a query sequence as a recombinant of a given set of aligned sequences. Parametric inference is used to determine all optimal annotations for all parameters of the model. We show that the inferred annotations recover most features of established hand-curated annotations. Thus, parametric analysis of the hidden Markov model is feasible for HIV full-length genomes, and it improves the detection and annotation of recombinant forms. All computational results, reference alignments, and C++ source code are available at http://bio.math.berkeley.edu/recombination/.Comment: 20 pages, 5 figure

    On Computable Protein Functions

    Get PDF
    Proteins are biological machines that perform the majority of functions necessary for life. Nature has evolved many different proteins, each of which perform a subset of an organism’s functional repertoire. One aim of biology is to solve the sparse high dimensional problem of annotating all proteins with their true functions. Experimental characterisation remains the gold standard for assigning function, but is a major bottleneck due to resource scarcity. In this thesis, we develop a variety of computational methods to predict protein function, reduce the functional search space for proteins, and guide the design of experimental studies. Our methods take two distinct approaches: protein-centric methods that predict the functions of a given protein, and function-centric methods that predict which proteins perform a given function. We applied our methods to help solve a number of open problems in biology. First, we identified new proteins involved in the progression of Alzheimer’s disease using proteomics data of brains from a fly model of the disease. Second, we predicted novel plastic hydrolase enzymes in a large data set of 1.1 billion protein sequences from metagenomes. Finally, we optimised a neural network method that extracts a small number of informative features from protein networks, which we used to predict functions of fission yeast proteins

    Analysis of recombination in molecular sequence data

    Get PDF
    We present the new and fast method Recco for analyzing a multiple alignment regarding recombination. Recco is based on a dynamic program that explains one sequence in the alignment with the other sequences using mutation and recombination. The dynamic program allows for an intuitive visualization of the optimal solution and also introduces a parameter α controlling the number of recombinations in the solution. Recco performs a parametric analysis regarding α and orders all pareto-optimal solutions by increasing number of recombinations. α is also directly related to the Savings value, a quantitative and intuitive measure for the preference of recombination in the solution. The Savings value and the solutions have a simple interpretation regarding the ancestry of the sequences in the alignment and it is usually easy to understand the output of the method. The distribution of the Savings value for non-recombining alignments is estimated by processing column permutations of the alignment and p-values are provided for recombination in the alignment, in a sequence and at a breakpoint position. Recco also uses the p-values to suggest a single solution, or recombinant structure, for the explained sequence. Recco is validated on a large set of simulated alignments and has a recombination detection performance superior to all current methods. The analysis of real alignments confirmed that Recco is among the best methods for recombination analysis and further supported that Recco is very intuitive compared to other methods.Wir präsentieren Recco, eine neue und schnelle Methode zur Analyse von Rekombinationen in multiplen Alignments. Recco basiert auf einem dynamischen Programm, welches eine Sequenz im Alignment durch die anderen Sequenzen im Alignment rekonstruiert, wobei die Operatoren Mutation und Rekombination erlaubt sind. Das dynamische Programm ermöglicht eine intuitive Visualisierung der optimalen Lösung und besitzt einen Parameter α, welcher die Anzahl der Rekombinationsereignisse in der optimalen Lösung steuert. Recco fĂĽhrt eine parametrische Analyse bezĂĽglich des Parameters α durch, so dass alle pareto-optimalen Lösungen nach der Anzahl ihrer Rekombinationsereignisse sortiert werden können. α steht auch direkt in Beziehung mit dem sogenannten Savings-Wert, der die Neigung zum EinfĂĽgen von Rekombinationsereignissen in die optimale Lösung quantitativ und intuitiv bemisst. Der Savings-Wert und die optimalen Lösungen haben eine einfache Interpretation bezĂĽglich der Historie der Sequenzen im Alignment, so dass es in der Regel leicht fällt, die Ausgabe von Recco zu verstehen. Recco schätzt die Verteilung des Savings-Werts fĂĽr Alignments ohne Rekombinationen durch einen Permutationstest, der auf Spaltenpermutationen basiert. Dieses Verfahren resultiert in p-Werten fĂĽr Rekombination im Alignment, in einer Sequenz und an jeder Position im Alignment. Basierend auf diesen p-Werten schlägt Recco eine optimale Lösung vor, als Schätzer fĂĽr die rekombinante Struktur der erklärten Sequenz. Recco wurde auf einem groĂźen Datensatz simulierter Alignments getestet und erzielte auf diesem Datensatz eine bessere VorhersagegĂĽte in Bezug auf das Erkennen von Alignments mit Rekombination als alle anderen aktuellen Verfahren. Die Analyse von realen Datensätzen bestätigte, dass Recco zu den besten Methoden fĂĽr die Rekombinationsanalyse gehört und im Vergleich zu anderen Methoden oft leichter verständliche Resultate liefert

    Fast and numerically stable parametric alignment of biosequences

    No full text
    Parametric alignment is an approach towards dealing with the uncertainties that are inherent in the cost parameters involved in the alignment of biosequences. In parametric alignment, a selected set of cost parameters is kept variable. The resulting space of optimal alignments is then analyzed with respect to the dependencies on these cost parameters. The number n of variable parameters is called the dimension of the parametric aligment problem. If the cost function is linear in the variable parameters, then the alignment space has the structure of a subdivision of the n-dimensional euclidean space into a collection of convex polyhedra. Each polyhedron represents a region of parameter settings that lead to the same set of optimal alignments. Efficient algorithms have been put forth for computing the subdivisions arising from one- and two-dimensional parametric alignment. General algorithmic schemes have been developed for n> 2, but these algorithms are inefficient, in practice. To our knowledge, all algorithms devised so far are plagued by numerical instabilities. We propose an algorithm for two-dimensional parametric alignment which is both faster than existing algorithms and numerically stable. We report on experimental results with this algorithm on biological data and comment on the potential role of parametric alignment in the analysis of biosequences.

    WTEC Panel Report on International Assessment of Research and Development in Simulation-Based Engineering and Science

    Full text link
    corecore