754 research outputs found
Bounded Coordinate-Descent for Biological Sequence Classification in High Dimensional Predictor Space
We present a framework for discriminative sequence classification where the
learner works directly in the high dimensional predictor space of all
subsequences in the training set. This is possible by employing a new
coordinate-descent algorithm coupled with bounding the magnitude of the
gradient for selecting discriminative subsequences fast. We characterize the
loss functions for which our generic learning algorithm can be applied and
present concrete implementations for logistic regression (binomial
log-likelihood loss) and support vector machines (squared hinge loss).
Application of our algorithm to protein remote homology detection and remote
fold recognition results in performance comparable to that of state-of-the-art
methods (e.g., kernel support vector machines). Unlike state-of-the-art
classifiers, the resulting classification models are simply lists of weighted
discriminative subsequences and can thus be interpreted and related to the
biological problem
Fluctuations of the Longest Common Subsequence for Sequences of Independent Blocks
The problem of the fluctuation of the Longest Common Subsequence (LCS) of two
i.i.d. sequences of length has been open for decades. There exist
contradicting conjectures on the topic. Chvatal and Sankoff conjectured in 1975
that asymptotically the order should be , while Waterman conjectured
in 1994 that asymptotically the order should be . A contiguous substring
consisting only of one type of symbol is called a block. In the present work,
we determine the order of the fluctuation of the LCS for a special model of
sequences consisting of i.i.d. blocks whose lengths are uniformly distributed
on the set , with a given positive integer. We showed that
the fluctuation in this model is asymptotically of order , which confirm
Waterman's conjecture. For achieving this goal, we developed a new method which
allows us to reformulate the problem of the order of the variance as a
(relatively) low dimensional optimization problem.Comment: PDFLatex, 40 page
A Monte Carlo Approach to the Fluctuation Problem in Optimal Alignments of Random Strings
The problem of determining the correct order of fluctuation of the optimal alignment score of two random strings of length has been open for several decades. It is known [12] that the biased expected effect of a random letter-change on the optimal score implies an order of fluctuation linear in √. However, in many situations where such a biased effect is observed empirically, it has been impossible to prove analytically. The main result of this paper shows that when the rescaled-limit of the optimal alignment score increases in a certain direction, then the biased effect exists. On the basis of this result one can quantify a confidence level for the existence of such a biased effect and hence of an order √ fluctuation based on simulation of optimal alignments scores. This is an important step forward, as the correct order of fluctuation was previously known only for certain special distributions [12],[13],[5],[10]. To illustrate the usefulness of our new methodology, we apply it to optimal alignments of strings written in the DNA-alphabet. As scoring function, we use the BLASTZ default-substitution matrix together with a realistic gap penalty. BLASTZ is one of the most widely used sequence alignment methodologies in bioinformatics. For this DNA-setting, we show that with a high level of confidence, the fluctuation of the optimal alignment score is of order Θ(√). An important special case of optimal alignment score is the Longest Common Subsequence (LCS) of random strings. For binary sequences with equiprobably symbols the question of the fluctuation of the LCS remains open. The symmetry in that case does not allow for our method. On the other hand, in real-life DNA sequences, it is not the case that all letters occur with the same frequency. So, for many real life situations, our method allows to determine the order of the fluctuation up to a high confidence level
A comprehensive comparison of metaheuristics for the repetition-free longest common subsequence problem
This paper deals with an NP-hard string problem from the bio-informatics field: the repetition-free longest common subsequence problem. This problem has enjoyed an increasing interest in recent years, which has resulted in the application of several pure as well as hybrid metaheuristics. However, the literature lacks a comprehensive comparison between those approaches. Moreover, it has been shown that general purpose integer linear programming solvers are very efficient for solving many of the problem instances that were used so far in the literature. Therefore, in this work we extend the available benchmark set, adding larger instances to which integer linear programming solvers cannot be applied anymore. Moreover, we provide a comprehensive comparison of the approaches found in the literature. Based on the results we propose a hybrid between two of the best methods which turns out to inherit the complementary strengths of both methods.Peer ReviewedPostprint (author's final draft
- …