6,513 research outputs found
An empirical comparison of supervised machine learning techniques in bioinformatics
Research in bioinformatics is driven by the experimental data.
Current biological databases are populated by vast amounts of
experimental data. Machine learning has been widely applied to
bioinformatics and has gained a lot of success in this research
area. At present, with various learning algorithms available in the
literature, researchers are facing difficulties in choosing the best
method that can apply to their data. We performed an empirical
study on 7 individual learning systems and 9 different combined
methods on 4 different biological data sets, and provide some
suggested issues to be considered when answering the following
questions: (i) How does one choose which algorithm is best
suitable for their data set? (ii) Are combined methods better than
a single approach? (iii) How does one compare the effectiveness
of a particular algorithm to the others
Recommended from our members
Multi-class protein fold classification using a new ensemble machine learning approach.
Protein structure classification represents an important process in understanding the associations
between sequence and structure as well as possible functional and evolutionary relationships.
Recent structural genomics initiatives and other high-throughput experiments have populated the
biological databases at a rapid pace. The amount of structural data has made traditional methods
such as manual inspection of the protein structure become impossible. Machine learning has been
widely applied to bioinformatics and has gained a lot of success in this research area. This work
proposes a novel ensemble machine learning method that improves the coverage of the classifiers
under the multi-class imbalanced sample sets by integrating knowledge induced from different base
classifiers, and we illustrate this idea in classifying multi-class SCOP protein fold data. We have
compared our approach with PART and show that our method improves the sensitivity of the
classifier in protein fold classification. Furthermore, we have extended this method to learning over
multiple data types, preserving the independence of their corresponding data sources, and show
that our new approach performs at least as well as the traditional technique over a single joined
data source. These experimental results are encouraging, and can be applied to other bioinformatics
problems similarly characterised by multi-class imbalanced data sets held in multiple data
sources
Recommended from our members
Integrative machine learning approach for multi-class SCOP protein fold classification
Classification and prediction of protein structure has been a central research theme in structural bioinformatics. Due to the imbalanced distribution of proteins over multi SCOP classification, most discriminative machine learning suffers the well-known âFalse Positives â problem when learning over these types of problems. We have devised eKISS, an ensemble machine learning specifically designed to increase the coverage of positive examples when learning under multiclass imbalanced data sets. We have applied eKISS to classify 25 SCOP folds and show that our learning system improved over classical learning methods
Recommended from our members
Characterisation of FAD-family folds using a machine learning approach
Flavin adenine dinucleotide (FAD) and its derivatives play a crucial role in
biological processes. They are major organic cofactors and electron carriers
in both enzymatic activities and biochemical pathways. We have analysed
the relationships between sequence and structure of FAD-containing proteins
using a machine learning approach. Decision trees were generated using the
C4.5 algorithm as a means of automatically generating rules from biological
databases (TOPS, CATH and PDB). These rules were then used as
background knowledge for an ILP system to characterise the four different
classes of FAD-family folds classified in Dym and Eisenberg (2001). These
FAD-family folds are: glutathione reductase (GR), ferredoxin reductase (FR),
p-cresol methylhydroxylase (PCMH) and pyruvate oxidase (PO). Each FADfamily
was characterised by a set of rules. The âknowledge patternsâ
generated from this approach are a set of rules containing conserved sequence
motifs, secondary structure sequence elements and folding information.
Every rule was then verified using statistical evaluation on the measured
significance of each rule. We show that this machine learning approach is
capable of learning and discovering interesting patterns from large biological
databases and can generate âknowledge patternsâ that characterise the FADcontaining
proteins, and at the same time classify these proteins into four
different families
Shrewd Selection Speeds Surfing: Use Smart EXP3!
In this paper, we explore the use of multi-armed bandit online learning
techniques to solve distributed resource selection problems. As an example, we
focus on the problem of network selection. Mobile devices often have several
wireless networks at their disposal. While choosing the right network is vital
for good performance, a decentralized solution remains a challenge. The
impressive theoretical properties of multi-armed bandit algorithms, like EXP3,
suggest that it should work well for this type of problem. Yet, its real-word
performance lags far behind. The main reasons are the hidden cost of switching
networks and its slow rate of convergence. We propose Smart EXP3, a novel
bandit-style algorithm that (a) retains the good theoretical properties of
EXP3, (b) bounds the number of switches, and (c) yields significantly better
performance in practice. We evaluate Smart EXP3 using simulations, controlled
experiments, and real-world experiments. Results show that it stabilizes at the
optimal state, achieves fairness among devices and gracefully deals with
transient behaviors. In real world experiments, it can achieve 18% faster
download over alternate strategies. We conclude that multi-armed bandit
algorithms can play an important role in distributed resource selection
problems, when practical concerns, such as switching costs and convergence
time, are addressed.Comment: Full pape
Recommended from our members
Ensemble machine learning on gene expression data for cancer classification
Whole genome RNA expression studies permit systematic approaches to understanding the correlation between gene expression profiles to disease states or different developmental stages of a cell. Microarray analysis provides quantitative information about the complete transcription profile of cells that facilitate drug and therapeutics development, disease diagnosis, and understanding in the basic cell biology. One of the challenges in microarray analysis, especially in cancerous gene expression profiles, is to identify genes or groups of genes that are highly expressed in tumour cells but not in normal cells and vice versa. Previously, we have shown that ensemble machine learning consistently performs well in classifying biological data. In this paper, we focus on three different supervised machine learning techniques in cancer classification, namely C4.5 decision tree, and bagged and boosted decision trees. We have performed classification tasks on seven publicly available cancerous microarray data and compared the classification/prediction performance of these methods. We have observed that ensemble learning (bagged and boosted decision trees) often performs better than single decision trees in this classification task
Some Remarks About the Two-Matrix Penner Model and the Kazakov-Migdal Model
I consider the Hermitean two-matrix model with a logarithmic potential which
is associated in the one-matrix case with the Penner model. Using loop
equations I find an explicit solution of the model at large N (or in the
spherical approximation) and demonstrate that it solves the corresponding
Riemann-Hilbert problem. I construct the potential of the Kazakov-Migdal model
on a D-dimensional lattice, which turns out to be a sum of two logarithms as
well, whose large-N solution is given by the same formulas. In the "naive"
continuum limit this potential recovers in D<4 dimensions the standard scalar
theory with quartic self-interaction. I exploit the solution to calculate
explicitly the pair correlator of gauge fields in the Kazakov-Migdal model with
the logarithmic potential.Comment: Latex, 13 pages, NBI-HE-93-2
Designing successful executive program on creativity: Theoretical approaches and practical challenges in Asia
- âŠ