33 research outputs found
Application of compression-based distance measures to protein sequence classification: a methodological study
Abstract
Motivation: Distance measures built on the notion of text compression have been used for the comparison and classification of entire genomes and mitochondrial genomes. The present study was undertaken in order to explore their utility in the classification of protein sequences.
Results: We constructed compression-based distance measures (CBMs) using the Lempel-Zlv and the PPMZ compression algorithms and compared their performance with that of the Smith–Waterman algorithm and BLAST, using nearest neighbour or support vector machine classification schemes. The datasets included a subset of the SCOP protein structure database to test distant protein similarities, a 3-phosphoglycerate-kinase sequences selected from archaean, bacterial and eukaryotic species as well as low and high-complexity sequence segments of the human proteome, CBMs values show a dependence on the length and the complexity of the sequences compared. In classification tasks CBMs performed especially well on distantly related proteins where the performance of a combined measure, constructed from a CBM and a BLAST score, approached or even slightly exceeded that of the Smith–Waterman algorithm and two hidden Markov model-based algorithms.
Contact: [email protected]
Supplementary information
Chemical rule-based filtering of MS/MS spectra
Abstract
Motivation: Identification of proteins by mass spectrometry–based proteomics requires automated interpretation of peptide tandem mass spectrometry spectra. The effectiveness of peptide identification can be greatly improved by filtering out extraneous noise peaks before the subsequent database searching steps.
Results: Here we present a novel chemical rule-based filtering algorithm, termed CRF, which makes use of the predictable patterns (rules) of collision-induced peptide fragmentation. The algorithm selects peak pairs that obey the common fragmentation rules within plausible limits of mass tolerance as well as peak intensity and produces spectra that can be subsequently submitted to any search engine. CRF increases the positive predictive value and decreases the number of random matches and thus improves performance by 15–20% in terms of peptide annotation using search engines, such as X!Tandem. Importantly, the algorithm also achieves data compression rates of ∼75%.
Availability: The MATLAB source code and a web server are available at http://hydrax.icgeb.trieste.it/CRFilter/
Contact: [email protected]
Supplementary information: Supplementary data are available at Bioinformatics online
Application of a simple likelihood ratio approximant to protein sequence classification
Abstract
Motivation: Likelihood ratio approximants (LRA) have been widely used for model comparison in statistics. The present study was undertaken in order to explore their utility as a scoring (ranking) function in the classification of protein sequences.
Results: We used a simple LRA-based on the maximal similarity (or minimal distance) scores of the two top ranking sequence classes. The scoring methods (Smith–Waterman, BLAST, local alignment kernel and compression based distances) were compared on datasets designed to test sequence similarities between proteins distantly related in terms of structure or evolution. It was found that LRA-based scoring can significantly outperform simple scoring methods.
Contact: [email protected].
Supplementary information:
A Protein Classification Benchmark collection for machine learning
Protein classification by machine learning algorithms is now widely used in structural and functional annotation of proteins. The Protein Classification Benchmark collection () was created in order to provide standard datasets on which the performance of machine learning methods can be compared. It is primarily meant for method developers and users interested in comparing methods under standardized conditions. The collection contains datasets of sequences and structures, and each set is subdivided into positive/negative, training/test sets in several ways. There is a total of 6405 classification tasks, 3297 on protein sequences, 3095 on protein structures and 10 on protein coding regions in DNA. Typical tasks include the classification of structural domains in the SCOP and CATH databases based on their sequences or structures, as well as various functional and taxonomic classification problems. In the case of hierarchical classification schemes, the classification tasks can be defined at various levels of the hierarchy (such as classes, folds, superfamilies, etc.). For each dataset there are distance matrices available that contain all vs. all comparison of the data, based on various sequence or structure comparison methods, as well as a set of classification performance measures computed with various classifier algorithms
Emergence of Collective Territorial Defense in Bacterial Communities: Horizontal Gene Transfer Can Stabilize Microbiomes
Multispecies bacterial communities such as the microbiota of the gastrointestinal tract can be remarkably stable and resilient even though they consist of cells and species that compete for resources and also produce a large number of antimicrobial agents. Computational modeling suggests that horizontal transfer of resistance genes may greatly contribute to the formation of stable and diverse communities capable of protecting themselves with a battery of antimicrobial agents while preserving a varied metabolic repertoire of the constituent species. In other words horizontal transfer of resistance genes makes a community compatible in terms of exoproducts and capable to maintain a varied and mature metagenome. The same property may allow microbiota to protect a host organism, or if used as a microbial therapy, to purge pathogens and restore a protective environment
GaIn: Human Gait Inference for Lower Limbic Prostheses for Patients Suffering from Double Trans-Femoral Amputation
Several studies have analyzed human gait data obtained from inertial gyroscope and accelerometer sensors mounted on different parts of the body. In this article, we take a step further in gait analysis and provide a methodology for predicting the movements of the legs, which can be applied in prosthesis to imitate the missing part of the leg in walking. In particular, we propose a method, called GaIn, to control non-invasive, robotic, prosthetic legs. GaIn can infer the movements of both missing shanks and feet for humans suffering from double trans-femoral amputation using biologically inspired recurrent neural networks. Predictions are performed for casual walking related activities such as walking, taking stairs, and running based on thigh movement. In our experimental tests, GaIn achieved a 4.55° prediction error for shank movements on average. However, a patient’s intention to stand up and sit down cannot be inferred from thigh movements. In fact, intention causes thigh movements while the shanks and feet remain roughly still. The GaIn system can be triggered by thigh muscle activities measured with electromyography (EMG) sensors to make robotic prosthetic legs perform standing up and sitting down actions. The GaIn system has low prediction latency and is fast and computationally inexpensive to be deployed on mobile platforms and portable devices