64,327 research outputs found
Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model
Recently exciting progress has been made on protein contact prediction, but
the predicted contacts for proteins without many sequence homologs is still of
low quality and not very useful for de novo structure prediction. This paper
presents a new deep learning method that predicts contacts by integrating both
evolutionary coupling (EC) and sequence conservation information through an
ultra-deep neural network formed by two deep residual networks. This deep
neural network allows us to model very complex sequence-contact relationship as
well as long-range inter-contact correlation. Our method greatly outperforms
existing contact prediction methods and leads to much more accurate
contact-assisted protein folding. Tested on three datasets of 579 proteins, the
average top L long-range prediction accuracy obtained our method, the
representative EC method CCMpred and the CASP11 winner MetaPSICOV is 0.47, 0.21
and 0.30, respectively; the average top L/10 long-range accuracy of our method,
CCMpred and MetaPSICOV is 0.77, 0.47 and 0.59, respectively. Ab initio folding
using our predicted contacts as restraints can yield correct folds (i.e.,
TMscore>0.6) for 203 test proteins, while that using MetaPSICOV- and
CCMpred-predicted contacts can do so for only 79 and 62 proteins, respectively.
Further, our contact-assisted models have much better quality than
template-based models. Using our predicted contacts as restraints, we can (ab
initio) fold 208 of the 398 membrane proteins with TMscore>0.5. By contrast,
when the training proteins of our method are used as templates, homology
modeling can only do so for 10 of them. One interesting finding is that even if
we do not train our prediction models with any membrane proteins, our method
works very well on membrane protein prediction. Finally, in recent blind CAMEO
benchmark our method successfully folded 5 test proteins with a novel fold
Telomere length de novo assembly of all 7 chromosomes and mitogenome sequencing of the model entomopathogenic fungus, Metarhizium brunneum, by means of a novel assembly pipeline
BackgroundMore accurate and complete reference genomes have improved understanding of gene function, biology, and evolutionary mechanisms. Hybrid genome assembly approaches leverage benefits of both long, relatively error-prone reads from third-generation sequencing technologies and short, accurate reads from second-generation sequencing technologies, to produce more accurate and contiguous de novo genome assemblies in comparison to using either technology independently. In this study, we present a novel hybrid assembly pipeline that allowed for both mitogenome de novo assembly and telomere length de novo assembly of all 7 chromosomes of the model entomopathogenic fungus, Metarhizium brunneum.ResultsThe improved assembly allowed for better ab initio gene prediction and a more BUSCO complete proteome set has been generated in comparison to the eight current NCBI reference Metarhizium spp. genomes. Remarkably, we note that including the mitogenome in ab initio gene prediction training improved overall gene prediction. The assembly was further validated by comparing contig assembly agreement across various assemblers, assessing the assembly performance of each tool. Genomic synteny and orthologous protein clusters were compared between Metarhizium brunneum and three other Hypocreales species with complete genomes, identifying core proteins, and listing orthologous protein clusters shared uniquely between the two entomopathogenic fungal species, so as to further facilitate the understanding of molecular mechanisms underpinning fungal-insect pathogenesis.ConclusionsThe novel assembly pipeline may be used for other haploid fungal species, facilitating the need to produce high-quality reference fungal genomes, leading to better understanding of fungal genomic evolution, chromosome structuring and gene regulation
A max-margin model for efficient simultaneous alignment and folding of RNA sequences
Motivation: The need for accurate and efficient tools for computational RNA structure analysis has become increasingly apparent over the last several years: RNA folding algorithms underlie numerous applications in bioinformatics, ranging from microarray probe selection to de novo non-coding RNA gene prediction
Using ESTs to improve the accuracy of de novo gene prediction
BACKGROUND: ESTs are a tremendous resource for determining the exon-intron structures of genes, but even extensive EST sequencing tends to leave many exons and genes untouched. Gene prediction systems based exclusively on EST alignments miss these exons and genes, leading to poor sensitivity. De novo gene prediction systems, which ignore ESTs in favor of genomic sequence, can predict such "untouched" exons, but they are less accurate when predicting exons to which ESTs align. TWINSCAN is the most accurate de novo gene finder available for nematodes and N-SCAN is the most accurate for mammals, as measured by exact CDS gene prediction and exact exon prediction. RESULTS: TWINSCAN_EST is a new system that successfully combines EST alignments with TWINSCAN. On the whole C. elegans genome TWINSCAN_EST shows 14% improvement in sensitivity and 13% in specificity in predicting exact gene structures compared to TWINSCAN without EST alignments. Not only are the structures revealed by EST alignments predicted correctly, but these also constrain the predictions without alignments, improving their accuracy. For the human genome, we used the same approach with N-SCAN, creating N-SCAN_EST. On the whole genome, N-SCAN_EST produced a 6% improvement in sensitivity and 1% in specificity of exact gene structure predictions compared to N-SCAN. CONCLUSION: TWINSCAN_EST and N-SCAN_EST are more accurate than TWINSCAN and N-SCAN, while retaining their ability to discover novel genes to which no ESTs align. Thus, we recommend using the EST versions of these programs to annotate any genome for which EST information is available. TWINSCAN_EST and N-SCAN_EST are part of the TWINSCAN open source software package
Comparative analysis of RNA secondary structure accuracy on predicted RNA 3D models
RNA structure is conformationally dynamic, and accurate all-atom tertiary (3D) structure modeling of RNA remains challenging with the prevailing tools. Secondary structure (2D) information is the standard prerequisite for most RNA 3D modeling. Despite several 2D and 3D structure prediction tools proposed in recent years, one of the challenges is to choose the best combination for accurate RNA 3D structure prediction. Here, we benchmarked seven small RNA PDB structures (40 to 90 nucleotides) with different topologies to understand the effects of different 2D structure predictions on the accuracy of 3D modeling. The current study explores the blind challenge of 2D to 3D conversions and highlights the performances of de novo RNA 3D modeling from their predicted 2D structure constraints. Our results show that conformational sampling-based methods such as SimRNA and IsRNA1 depend less on 2D accuracy, whereas motif-based methods account for 2D evidence. Our observations illustrate the disparities in available 3D and 2D prediction methods and may further offer insights into developing topology-specific or family-specific RNA structure prediction pipelines
Prediction of motif-mediated viral mimicry through the integration of host-pathogen interactions.
One of the mechanisms viruses use in hijacking host cellular machinery is mimicking Short Linear Motifs (SLiMs) in host proteins to maintain their life cycle inside host cells. In the face of the escalating volume of virus-host protein-protein interactions (vhPPIs) documented in databases; the accurate prediction of molecular mimicry remains a formidable challenge due to the inherent degeneracy of SLiMs. Consequently, there is a pressing need for computational methodologies to predict new instances of viral mimicry. Our present study introduces a DMI-de-novo pipeline, revealing that vhPPIs catalogued in the VirHostNet3.0 database effectively capture domain-motif interactions (DMIs). Notably, both affinity purification coupled mass spectrometry and yeast two-hybrid assays emerged as good approaches for delineating DMIs. Furthermore, we have identified new vhPPIs mediated by SLiMs across different viruses. Importantly, the de-novo prediction strategy facilitated the recognition of several potential mimicry candidates implicated in the subversion of host cellular proteins. The insights gleaned from this research not only enhance our comprehension of the mechanisms by which viruses co-opt host cellular machinery but also pave the way for the development of novel therapeutic interventions
Recommended from our members
Kevlar: A Mapping-Free Framework for Accurate Discovery of De Novo Variants.
De novo genetic variants are an important source of causative variation in complex genetic disorders. Many methods for variant discovery rely on mapping reads to a reference genome, detecting numerous inherited variants irrelevant to the phenotype of interest. To distinguish between inherited and de novo variation, sequencing of families (parents and siblings) is commonly pursued. However, standard mapping-based approaches tend to have a high false-discovery rate for de novo variant prediction. Kevlar is a mapping-free method for de novo variant discovery, based on direct comparison of sequences between related individuals. Kevlar identifies high-abundance k-mers unique to the individual of interest. Reads containing these k-mers are partitioned into disjoint sets by shared k-mer content for variant calling, and preliminary variant predictions are sorted using a probabilistic score. We evaluated Kevlar on simulated and real datasets, demonstrating its ability to detect both de novo single-nucleotide variants and indels with high accuracy
Chronic copper gBAM for fish : investigating possibilities and limitations of a generalised bioavailability model (gBAM) for predicting chronic copper toxicity to freshwater fish : final report
This study developed a fully parameterised de novo chronic Cu generalised bioavailability model (gBAM) for juvenile rainbow trout that incorporates toxicity modifying effects of pH, dissolved organic carbon (DOC), Ca, Mg, and Na. This gBAM genreally performs better than a previously developed biotic ligand model (BLM), notably with respect to a more accurate prediction of the effect of pH on chronic copper toxicity. The extrapolation of this model to early life stages of fish is encouraging. The gBAM developed in this study is the first model to integrate all available data on the chronic toxicity of copper to fish in a consistent way
Deep learning extends de novo protein modelling coverage of genomes using iteratively predicted structural constraints
The inapplicability of amino acid covariation methods to small protein
families has limited their use for structural annotation of whole genomes.
Recently, deep learning has shown promise in allowing accurate residue-residue
contact prediction even for shallow sequence alignments. Here we introduce
DMPfold, which uses deep learning to predict inter-atomic distance bounds, the
main chain hydrogen bond network, and torsion angles, which it uses to build
models in an iterative fashion. DMPfold produces more accurate models than two
popular methods for a test set of CASP12 domains, and works just as well for
transmembrane proteins. Applied to all Pfam domains without known structures,
confident models for 25% of these so-called dark families were produced in
under a week on a small 200 core cluster. DMPfold provides models for 16% of
human proteome UniProt entries without structures, generates accurate models
with fewer than 100 sequences in some cases, and is freely available.Comment: JGG and SMK contributed equally to the wor
SLIQ: Simple Linear Inequalities for Efficient Contig Scaffolding
Scaffolding is an important subproblem in "de novo" genome assembly in which
mate pair data are used to construct a linear sequence of contigs separated by
gaps. Here we present SLIQ, a set of simple linear inequalities derived from
the geometry of contigs on the line that can be used to predict the relative
positions and orientations of contigs from individual mate pair reads and thus
produce a contig digraph. The SLIQ inequalities can also filter out unreliable
mate pairs and can be used as a preprocessing step for any scaffolding
algorithm. We tested the SLIQ inequalities on five real data sets ranging in
complexity from simple bacterial genomes to complex mammalian genomes and
compared the results to the majority voting procedure used by many other
scaffolding algorithms. SLIQ predicted the relative positions and orientations
of the contigs with high accuracy in all cases and gave more accurate position
predictions than majority voting for complex genomes, in particular the human
genome. Finally, we present a simple scaffolding algorithm that produces linear
scaffolds given a contig digraph. We show that our algorithm is very efficient
compared to other scaffolding algorithms while maintaining high accuracy in
predicting both contig positions and orientations for real data sets.Comment: 16 pages, 6 figures, 7 table
- …