2,700 research outputs found
ProtNN: Fast and Accurate Nearest Neighbor Protein Function Prediction based on Graph Embedding in Structural and Topological Space
Studying the function of proteins is important for understanding the
molecular mechanisms of life. The number of publicly available protein
structures has increasingly become extremely large. Still, the determination of
the function of a protein structure remains a difficult, costly, and time
consuming task. The difficulties are often due to the essential role of spatial
and topological structures in the determination of protein functions in living
cells. In this paper, we propose ProtNN, a novel approach for protein function
prediction. Given an unannotated protein structure and a set of annotated
proteins, ProtNN finds the nearest neighbor annotated structures based on
protein-graph pairwise similarities. Given a query protein, ProtNN finds the
nearest neighbor reference proteins based on a graph representation model and a
pairwise similarity between vector embedding of both query and reference
protein-graphs in structural and topological spaces. ProtNN assigns to the
query protein the function with the highest number of votes across the set of k
nearest neighbor reference proteins, where k is a user-defined parameter.
Experimental evaluation demonstrates that ProtNN is able to accurately classify
several datasets in an extremely fast runtime compared to state-of-the-art
approaches. We further show that ProtNN is able to scale up to a whole PDB
dataset in a single-process mode with no parallelization, with a gain of
thousands order of magnitude of runtime compared to state-of-the-art
approaches
Entropy-scaling search of massive biological data
Many datasets exhibit a well-defined structure that can be exploited to
design faster search tools, but it is not always clear when such acceleration
is possible. Here, we introduce a framework for similarity search based on
characterizing a dataset's entropy and fractal dimension. We prove that
searching scales in time with metric entropy (number of covering hyperspheres),
if the fractal dimension of the dataset is low, and scales in space with the
sum of metric entropy and information-theoretic entropy (randomness of the
data). Using these ideas, we present accelerated versions of standard tools,
with no loss in specificity and little loss in sensitivity, for use in three
domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics
(MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search
(esFragBag, 10x speedup of FragBag). Our framework can be used to achieve
"compressive omics," and the general theory can be readily applied to data
science problems outside of biology.Comment: Including supplement: 41 pages, 6 figures, 4 tables, 1 bo
Codon Bias Patterns of 's Interacting Proteins
Synonymous codons, i.e., DNA nucleotide triplets coding for the same amino
acid, are used differently across the variety of living organisms. The
biological meaning of this phenomenon, known as codon usage bias, is still
controversial. In order to shed light on this point, we propose a new codon
bias index, , that is based on the competition between cognate and
near-cognate tRNAs during translation, without being tuned to the usage bias of
highly expressed genes. We perform a genome-wide evaluation of codon bias for
, comparing with other widely used indices: , , and
. We show that and capture similar information by being
positively correlated with gene conservation, measured by ERI, and
essentiality, whereas, and appear to be less sensitive to
evolutionary-functional parameters. Notably, the rate of variation of and
with ERI allows to obtain sets of genes that consistently belong to
specific clusters of orthologous genes (COGs). We also investigate the
correlation of codon bias at the genomic level with the network features of
protein-protein interactions in . We find that the most densely
connected communities of the network share a similar level of codon bias (as
measured by and ). Conversely, a small difference in codon bias
between two genes is, statistically, a prerequisite for the corresponding
proteins to interact. Importantly, among all codon bias indices, turns
out to have the most coherent distribution over the communities of the
interactome, pointing to the significance of competition among cognate and
near-cognate tRNAs for explaining codon usage adaptation
Scoring Protein Relationships in Functional Interaction Networks Predicted from Sequence Data
The abundance of diverse biological data from various sources constitutes a rich source of knowledge, which has the power to advance our understanding of organisms. This requires computational methods in order to integrate and exploit these data effectively and elucidate local and genome wide functional connections between protein pairs, thus enabling functional inferences for uncharacterized proteins. These biological data are primarily in the form of sequences, which determine functions, although functional properties of a protein can often be predicted from just the domains it contains. Thus, protein sequences and domains can be used to predict protein pair-wise functional relationships, and thus contribute to the function prediction process of uncharacterized proteins in order to ensure that knowledge is gained from sequencing efforts. In this work, we introduce information-theoretic based approaches to score protein-protein functional interaction pairs predicted from protein sequence similarity and conserved protein signature matches. The proposed schemes are effective for data-driven scoring of connections between protein pairs. We applied these schemes to the Mycobacterium tuberculosis proteome to produce a homology-based functional network of the organism with a high confidence and coverage. We use the network for predicting functions of uncharacterised proteins
- …