15,697 research outputs found
Pair HMM based gap statistics for re-evaluation of indels in alignments with affine gap penalties: Extended Version
Although computationally aligning sequence is a crucial step in the vast
majority of comparative genomics studies our understanding of alignment biases
still needs to be improved. To infer true structural or homologous regions
computational alignments need further evaluation. It has been shown that the
accuracy of aligned positions can drop substantially in particular around gaps.
Here we focus on re-evaluation of score-based alignments with affine gap
penalty costs. We exploit their relationships with pair hidden Markov models
and develop efficient algorithms by which to identify gaps which are
significant in terms of length and multiplicity. We evaluate our statistics
with respect to the well-established structural alignments from SABmark and
find that indel reliability substantially increases with their significance in
particular in worst-case twilight zone alignments. This points out that our
statistics can reliably complement other methods which mostly focus on the
reliability of match positions.Comment: 17 pages, 7 figure
Robust Subgraph Generation Improves Abstract Meaning Representation Parsing
The Abstract Meaning Representation (AMR) is a representation for open-domain
rich semantics, with potential use in fields like event extraction and machine
translation. Node generation, typically done using a simple dictionary lookup,
is currently an important limiting factor in AMR parsing. We propose a small
set of actions that derive AMR subgraphs by transformations on spans of text,
which allows for more robust learning of this stage. Our set of construction
actions generalize better than the previous approach, and can be learned with a
simple classifier. We improve on the previous state-of-the-art result for AMR
parsing, boosting end-to-end performance by 3 F on both the LDC2013E117 and
LDC2014T12 datasets.Comment: To appear in ACL 201
Detection of recombination in DNA multiple alignments with hidden markov models
CConventional phylogenetic tree estimation methods assume that all sites in a DNA multiple alignment have the same evolutionary history. This assumption is violated in data sets from certain bacteria and viruses due to recombination, a process that leads to the creation of mosaic sequences from different strains and, if undetected, causes systematic errors in phylogenetic tree estimation. In the current work, a hidden Markov model (HMM) is employed to detect recombination events in multiple alignments of DNA sequences. The emission probabilities in a given state are determined by the branching order (topology) and the branch lengths of the respective phylogenetic tree, while the transition probabilities depend on the global recombination probability. The present study improves on an earlier heuristic parameter optimization scheme and shows how the branch lengths and the recombination probability can be optimized in a maximum likelihood sense by applying the expectation maximization (EM) algorithm. The novel algorithm is tested on a synthetic benchmark problem and is found to clearly outperform the earlier heuristic approach. The paper concludes with an application of this scheme to a DNA sequence alignment of the argF gene from four Neisseria strains, where a likely recombination event is clearly detected
Structure and functional motifs of GCR1, the only plant protein with a GPCR fold?
Whether GPCRs exist in plants is a fundamental biological question. Interest in deorphanizing new G
protein coupled receptors (GPCRs), arises because of their importance in signaling. Within plants, this
is controversial as genome analysis has identified 56 putative GPCRs, including GCR1 which is
reportedly a remote homologue to class A, B and E GPCRs. Of these, GCR2, is not a GPCR; more
recently it has been proposed that none are, not even GCR1. We have addressed this disparity
between genome analysis and biological evidence through a structural bioinformatics study, involving
fold recognition methods, from which only GCR1 emerges as a strong candidate. To further probe
GCR1, we have developed a novel helix alignment method, which has been benchmarked against the
the class A – class B - class F GPCR alignments. In addition, we have presented a mutually consistent
set of alignments of GCR1 homologues to class A, class B and class F GPCRs, and shown that GCR1
is closer to class A and /or class B GPCRs than class A, class B or class F GPCRs are to each other.
To further probe GCR1, we have aligned transmembrane helix 3 of GCR1 to each of the 6 GPCR
classes. Variability comparisons provide additional evidence that GCR1 homologues have the GPCR
fold. From the alignments and a GCR1 comparative model we have identified motifs that are common
to GCR1, class A, B and E GPCRs. We discuss the possibilities that emerge from this controversial
evidence that GCR1 has a GPCR fol
Human-chimpanzee alignment: Ortholog Exponentials and Paralog Power Laws
Genomic subsequences conserved between closely related species such as human
and chimpanzee exhibit an exponential length distribution, in contrast to the
algebraic length distribution observed for sequences shared between distantly
related genomes. We find that the former exponential can be further decomposed
into an exponential component primarily composed of orthologous sequences, and
a truncated algebraic component primarily composed of paralogous sequences.Comment: Main text: 31 pages, 13 figures, 1 table; Supplementary materials: 9
pages, 9 figures, 1 tabl
Deep learning extends de novo protein modelling coverage of genomes using iteratively predicted structural constraints
The inapplicability of amino acid covariation methods to small protein
families has limited their use for structural annotation of whole genomes.
Recently, deep learning has shown promise in allowing accurate residue-residue
contact prediction even for shallow sequence alignments. Here we introduce
DMPfold, which uses deep learning to predict inter-atomic distance bounds, the
main chain hydrogen bond network, and torsion angles, which it uses to build
models in an iterative fashion. DMPfold produces more accurate models than two
popular methods for a test set of CASP12 domains, and works just as well for
transmembrane proteins. Applied to all Pfam domains without known structures,
confident models for 25% of these so-called dark families were produced in
under a week on a small 200 core cluster. DMPfold provides models for 16% of
human proteome UniProt entries without structures, generates accurate models
with fewer than 100 sequences in some cases, and is freely available.Comment: JGG and SMK contributed equally to the wor
New encouraging developments in contact prediction: Assessment of the CASP11 results
This article provides a report on the state-of-the-art in the prediction of intra-molecular residue-residue contacts in proteins
based on the assessment of the predictions submitted to the CASP11 experiment. The assessment emphasis is placed on the
accuracy in predicting long-range contacts. Twenty-nine groups participated in contact prediction in CASP11. At least eight
of them used the recently developed evolutionary coupling techniques, with the top group (CONSIP2) reaching precision of
27% on target proteins that could not be modeled by homology. This result indicates a breakthrough in the development of
methods based on the correlated mutation approach. Successful prediction of contacts was shown to be practically helpful
in modeling three-dimensional structures; in particular target T0806 was modeled exceedingly well with accuracy not yet
seen for ab initio targets of this size (>250 residues
- …