4,526 research outputs found
Multi-TGDR: a regularization method for multi-class classification in microarray experiments
Background
With microarray technology becoming mature and popular, the selection and use
of a small number of relevant genes for accurate classification of samples is a
hot topic in the circles of biostatistics and bioinformatics. However, most of
the developed algorithms lack the ability to handle multiple classes, which
arguably a common application. Here, we propose an extension to an existing
regularization algorithm called Threshold Gradient Descent Regularization
(TGDR) to specifically tackle multi-class classification of microarray data.
When there are several microarray experiments addressing the same/similar
objectives, one option is to use meta-analysis version of TGDR (Meta-TGDR),
which considers the classification task as combination of classifiers with the
same structure/model while allowing the parameters to vary across studies.
However, the original Meta-TGDR extension did not offer a solution to the
prediction on independent samples. Here, we propose an explicit method to
estimate the overall coefficients of the biomarkers selected by Meta-TGDR. This
extension permits broader applicability and allows a comparison between the
predictive performance of Meta-TGDR and TGDR using an independent testing set.
Results
Using real-world applications, we demonstrated the proposed multi-TGDR
framework works well and the number of selected genes is less than the sum of
all individualized binary TGDRs. Additionally, Meta-TGDR and TGDR on the
batch-effect adjusted pooled data approximately provided same results. By
adding Bagging procedure in each application, the stability and good predictive
performance are warranted.
Conclusions
Compared with Meta-TGDR, TGDR is less computing time intensive, and requires
no samples of all classes in each study. On the adjusted data, it has
approximate same predictive performance with Meta-TGDR. Thus, it is highly
recommended
ProLanGO: Protein Function Prediction Using Neural~Machine Translation Based on a Recurrent Neural Network
With the development of next generation sequencing techniques, it is fast and
cheap to determine protein sequences but relatively slow and expensive to
extract useful information from protein sequences because of limitations of
traditional biological experimental techniques. Protein function prediction has
been a long standing challenge to fill the gap between the huge amount of
protein sequences and the known function. In this paper, we propose a novel
method to convert the protein function problem into a language translation
problem by the new proposed protein sequence language "ProLan" to the protein
function language "GOLan", and build a neural machine translation model based
on recurrent neural networks to translate "ProLan" language to "GOLan"
language. We blindly tested our method by attending the latest third Critical
Assessment of Function Annotation (CAFA 3) in 2016, and also evaluate the
performance of our methods on selected proteins whose function was released
after CAFA competition. The good performance on the training and testing
datasets demonstrates that our new proposed method is a promising direction for
protein function prediction. In summary, we first time propose a method which
converts the protein function prediction problem to a language translation
problem and applies a neural machine translation model for protein function
prediction.Comment: 13 pages, 5 figure
Kernel methods in genomics and computational biology
Support vector machines and kernel methods are increasingly popular in
genomics and computational biology, due to their good performance in real-world
applications and strong modularity that makes them suitable to a wide range of
problems, from the classification of tumors to the automatic annotation of
proteins. Their ability to work in high dimension, to process non-vectorial
data, and the natural framework they provide to integrate heterogeneous data
are particularly relevant to various problems arising in computational biology.
In this chapter we survey some of the most prominent applications published so
far, highlighting the particular developments in kernel methods triggered by
problems in biology, and mention a few promising research directions likely to
expand in the future
An Extensive Empirical Comparison of Probabilistic Hierarchical Classifiers in Datasets of Ageing-Related Genes
This study comprehensively evaluates the performance of 5 types of probabilistic hierarchical classification methods used for predicting Gene Ontology (GO) terms related to ageing. Of those tested, a new hybrid of a Local Hierarchical Classifier (LHC) and the Predictive Clustering Tree algorithm (LHC-PCT) had the best predictive accuracy results. We also tested the impact of two types of variations in most hierarchical classification algorithms, namely: (a) changing the base algorithm (we tested Naive Bayes and Support Vector Machines), and the impact of (b) using or not the Correlation based Feature Selection (CFS) algorithm in a pre-processing step. In total, we evaluated the predictive performance of 17 variations of hierarchical classifiers across 15 datasets of ageing and longevityrelated genes. We conclude that the LHC-PCT algorithm ranks better across several tests (7 out of 12). In addition, we interpreted the models generated by the PCT algorithm to show how hierarchical classification algorithms can be used to extract biological insights out of the ageing-related datasets that we compiled
- …