20,802 research outputs found
The Parallelism Motifs of Genomic Data Analysis
Genomic data sets are growing dramatically as the cost of sequencing
continues to decline and small sequencing devices become available. Enormous
community databases store and share this data with the research community, but
some of these genomic data analysis problems require large scale computational
platforms to meet both the memory and computational requirements. These
applications differ from scientific simulations that dominate the workload on
high end parallel systems today and place different requirements on programming
support, software libraries, and parallel architectural design. For example,
they involve irregular communication patterns such as asynchronous updates to
shared data structures. We consider several problems in high performance
genomics analysis, including alignment, profiling, clustering, and assembly for
both single genomes and metagenomes. We identify some of the common
computational patterns or motifs that help inform parallelization strategies
and compare our motifs to some of the established lists, arguing that at least
two key patterns, sorting and hashing, are missing
Comprehensive structural classification of ligand binding motifs in proteins
Comprehensive knowledge of protein-ligand interactions should provide a
useful basis for annotating protein functions, studying protein evolution,
engineering enzymatic activity, and designing drugs. To investigate the
diversity and universality of ligand binding sites in protein structures, we
conducted the all-against-all atomic-level structural comparison of over
180,000 ligand binding sites found in all the known structures in the Protein
Data Bank by using a recently developed database search and alignment
algorithm. By applying a hybrid top-down-bottom-up clustering analysis to the
comparison results, we determined approximately 3000 well-defined structural
motifs of ligand binding sites. Apart from a handful of exceptions, most
structural motifs were found to be confined within single families or
superfamilies, and to be associated with particular ligands. Furthermore, we
analyzed the components of the similarity network and enumerated more than 4000
pairs of ligand binding sites that were shared across different protein folds.Comment: 13 pages, 8 figure
Protein Secondary Structure Prediction Using Cascaded Convolutional and Recurrent Neural Networks
Protein secondary structure prediction is an important problem in
bioinformatics. Inspired by the recent successes of deep neural networks, in
this paper, we propose an end-to-end deep network that predicts protein
secondary structures from integrated local and global contextual features. Our
deep architecture leverages convolutional neural networks with different kernel
sizes to extract multiscale local contextual features. In addition, considering
long-range dependencies existing in amino acid sequences, we set up a
bidirectional neural network consisting of gated recurrent unit to capture
global contextual features. Furthermore, multi-task learning is utilized to
predict secondary structure labels and amino-acid solvent accessibility
simultaneously. Our proposed deep network demonstrates its effectiveness by
achieving state-of-the-art performance, i.e., 69.7% Q8 accuracy on the public
benchmark CB513, 76.9% Q8 accuracy on CASP10 and 73.1% Q8 accuracy on CASP11.
Our model and results are publicly available.Comment: 8 pages, 3 figures, Accepted by International Joint Conferences on
Artificial Intelligence (IJCAI
Coevolved mutations reveal distinct architectures for two core proteins in the bacterial flagellar motor
Switching of bacterial flagellar rotation is caused by large domain movements of the FliG protein triggered by binding of the signal protein CheY to FliM. FliG and FliM form adjacent multi-subunit arrays within the basal body C-ring. The movements alter the interaction of the FliG C-terminal (FliGC) "torque" helix with the stator complexes. Atomic models based on the Salmonella entrovar C-ring electron microscopy reconstruction have implications for switching, but lack consensus on the relative locations of the FliG armadillo (ARM) domains (amino-terminal (FliGN), middle (FliGM) and FliGC) as well as changes during chemotaxis. The generality of the Salmonella model is challenged by the variation in motor morphology and response between species. We studied coevolved residue mutations to determine the unifying elements of switch architecture. Residue interactions, measured by their coevolution, were formalized as a network, guided by structural data. Our measurements reveal a common design with dedicated switch and motor modules. The FliM middle domain (FliMM) has extensive connectivity most simply explained by conserved intra and inter-subunit contacts. In contrast, FliG has patchy, complex architecture. Conserved structural motifs form interacting nodes in the coevolution network that wire FliMM to the FliGC C-terminal, four-helix motor module (C3-6). FliG C3-6 coevolution is organized around the torque helix, differently from other ARM domains. The nodes form separated, surface-proximal patches that are targeted by deleterious mutations as in other allosteric systems. The dominant node is formed by the EHPQ motif at the FliMMFliGM contact interface and adjacent helix residues at a central location within FliGM. The node interacts with nodes in the N-terminal FliGc α-helix triad (ARM-C) and FliGN. ARM-C, separated from C3-6 by the MFVF motif, has poor intra-network connectivity consistent with its variable orientation revealed by structural data. ARM-C could be the convertor element that provides mechanistic and species diversity.JK was supported by Medical Research Council grant U117581331. SK was supported by seed funds from Lahore University of Managment Sciences (LUMS) and the Molecular Biology Consortium
Kernel methods in genomics and computational biology
Support vector machines and kernel methods are increasingly popular in
genomics and computational biology, due to their good performance in real-world
applications and strong modularity that makes them suitable to a wide range of
problems, from the classification of tumors to the automatic annotation of
proteins. Their ability to work in high dimension, to process non-vectorial
data, and the natural framework they provide to integrate heterogeneous data
are particularly relevant to various problems arising in computational biology.
In this chapter we survey some of the most prominent applications published so
far, highlighting the particular developments in kernel methods triggered by
problems in biology, and mention a few promising research directions likely to
expand in the future
- …