42,061 research outputs found
On Prediction Using Variable Order Markov Models
This paper is concerned with algorithms for prediction of discrete sequences
over a finite alphabet, using variable order Markov models. The class of such
algorithms is large and in principle includes any lossless compression
algorithm. We focus on six prominent prediction algorithms, including Context
Tree Weighting (CTW), Prediction by Partial Match (PPM) and Probabilistic
Suffix Trees (PSTs). We discuss the properties of these algorithms and compare
their performance using real life sequences from three domains: proteins,
English text and music pieces. The comparison is made with respect to
prediction quality as measured by the average log-loss. We also compare
classification algorithms based on these predictors with respect to a number of
large protein classification tasks. Our results indicate that a "decomposed"
CTW (a variant of the CTW algorithm) and PPM outperform all other algorithms in
sequence prediction tasks. Somewhat surprisingly, a different algorithm, which
is a modification of the Lempel-Ziv compression algorithm, significantly
outperforms all algorithms on the protein classification problems
Diagnosing serious infections in acutely ill children in ambulatory care (ERNIE 2 study protocol, part A): diagnostic accuracy of a clinical decision tree and added value of a point-of-care C-reactive protein test and oxygen saturation
Background: Acute illness is the most common presentation of children to ambulatory care. In contrast, serious infections are rare and often present at an early stage. To avoid complications or death, early recognition and adequate referral are essential. In a recent large study children were included prospectively to construct a symptom-based decision tree with a sensitivity and negative predictive value of nearly 100%. To reduce the number of false positives, point-of-care tests might be useful, providing an immediate result at bedside. The most probable candidate is C-reactive protein, as well as a pulse oximetry.
Methods: This is a diagnostic accuracy study of signs, symptoms and point-of-care tests for serious infections. Acutely ill children presenting to a family physician or paediatrician will be included consecutively in Flanders, Belgium. Children testing positive on the decision tree will get a point-of-care C-reactive protein test. Children testing negative will randomly either receive a point-of-care C-reactive protein test or usual care. The outcome of interest is hospital admission more than 24 hours with a serious infection within 10 days. Aiming to include over 6500 children, we will report the diagnostic accuracy of the decision tree (+/- the point-of-care C-reactive protein test or pulse oximetry) in sensitivity, specificity, positive and negative likelihood ratios, and positive and negative predictive values. New diagnostic algorithms will be constructed through classification and regression tree and multiple logistic regression analysis.
Discussion: We aim to improve detection of serious infections, and present a practical tool for diagnostic triage of acutely ill children in primary care. We also aim to reduce the number of investigations and admissions in children with non-serious infections
Diagnosis of Ovarian Cancer Using Decision Tree Classification of Mass Spectral Data
Recent reports from our laboratory and others support the SELDI ProteinChip technology as a potential clinical diagnostic tool when combined with n-dimensional analyses algorithms. The objective of this study was to determine if the commercially available classification algorithm biomarker patterns software (BPS), which is based on a classification and regression tree (CART), would be effective in discriminating ovarian cancer from benign diseases and healthy controls. Serum protein mass spectrum profiles from 139 patients with either ovarian cancer, benign pelvic diseases, or healthy women were analyzed using the BPS software. A decision tree, using five protein peaks, resulted in an accuracy of 81.5% in the cross-validation analysis and 80% in a blinded set of samples in differentiating the ovarian cancer from the control groups. The potential, advantages, and drawbacks of the BPS system as a bioinformatic tool for the analysis of the SELDI high-dimensional proteomic data are discussed
NcPred for accurate nuclear protein prediction using n-mer statistics with various classification algorithms
Prediction of nuclear proteins is one of the major challenges in genome annotation. A method, NcPred is described, for predicting nuclear proteins with higher accuracy exploiting n-mer statistics with different classification algorithms namely Alternating Decision (AD) Tree, Best First (BF) Tree, Random Tree and Adaptive (Ada) Boost. On BaCello dataset [1], NcPred improves about 20% accuracy with Random Tree and about 10% sensitivity with Ada Boost for Animal proteins compared to existing techniques. It also increases the accuracy of Fungal protein prediction by 20% and recall by 4% with AD Tree. In case of Human protein, the accuracy is improved by about 25% and sensitivity about 10% with BF Tree. Performance analysis of NcPred clearly demonstrates its suitability over the contemporary in-silico nuclear protein classification research
AmorProt: Amino Acid Molecular Fingerprints Repurposing based Protein Fingerprint
As protein therapeutics play an important role in almost all medical fields,
numerous studies have been conducted on proteins using artificial intelligence.
Artificial intelligence has enabled data driven predictions without the need
for expensive experiments. Nevertheless, unlike the various molecular
fingerprint algorithms that have been developed, protein fingerprint algorithms
have rarely been studied. In this study, we proposed the amino acid molecular
fingerprints repurposing based protein (AmorProt) fingerprint, a protein
sequence representation method that effectively uses the molecular fingerprints
corresponding to 20 amino acids. Subsequently, the performances of the tree
based machine learning and artificial neural network models were compared using
(1) amyloid classification and (2) isoelectric point regression. Finally, the
applicability and advantages of the developed platform were demonstrated
through a case study and the following experiments: (3) comparison of dataset
dependence with feature based methods; (4) feature importance analysis; and (5)
protein space analysis. Consequently, the significantly improved model
performance and data set independent versatility of the AmorProt fingerprint
were verified. The results revealed that the current protein representation
method can be applied to various fields related to proteins, such as predicting
their fundamental properties or interaction with ligands
On the hierarchical classification of G Protein-Coupled Receptors
Motivation: G protein-coupled receptors (GPCRs) play an important role in many physiological systems by transducing an extracellular signal into an intracellular response. Over 50% of all marketed drugs are targeted towards a GPCR. There is considerable interest in developing an algorithm that could effectively predict the function of a GPCR from its primary sequence. Such an algorithm is useful not only in identifying novel GPCR sequences but in characterizing the interrelationships between known GPCRs.
Results: An alignment-free approach to GPCR classification has been developed using techniques drawn from data mining and proteochemometrics. A dataset of over 8000 sequences was constructed to train the algorithm. This represents one of the largest GPCR datasets currently available. A predictive algorithm was developed based upon the simplest reasonable numerical representation of the protein's physicochemical properties. A selective top-down approach was developed, which used a hierarchical classifier to assign sequences to subdivisions within the GPCR hierarchy. The predictive performance of the algorithm was assessed against several standard data mining classifiers and further validated against Support Vector Machine-based GPCR prediction servers. The selective top-down approach achieves significantly higher accuracy than standard data mining methods in almost all cases
Random forests with random projections of the output space for high dimensional multi-label classification
We adapt the idea of random projections applied to the output space, so as to
enhance tree-based ensemble methods in the context of multi-label
classification. We show how learning time complexity can be reduced without
affecting computational complexity and accuracy of predictions. We also show
that random output space projections may be used in order to reach different
bias-variance tradeoffs, over a broad panel of benchmark problems, and that
this may lead to improved accuracy while reducing significantly the
computational burden of the learning stage
- …