27,814 research outputs found
A machine learning based framework to identify and classify long terminal repeat retrotransposons
Transposable elements (TEs) are repetitive nucleotide sequences that make up a large portion of eukaryotic genomes. They can move and duplicate within a genome, increasing genome size and contributing to genetic diversity within and across species. Accurate identification and classification of TEs present in a genome is an important step towards understanding their effects on genes and their role in genome evolution. We introduce TE-LEARNER, a framework based on machine learning that automatically identifies TEs in a given genome and assigns a classification to them. We present an implementation of our framework towards LTR retrotransposons, a particular type of TEs characterized by having long terminal repeats (LTRs) at their boundaries. We evaluate the predictive performance of our framework on the well-annotated genomes of Drosophila melanogaster and Arabidopsis thaliana and we compare our results for three LTR retrotransposon superfamilies with the results of three widely used methods for TE identification or classification: REPEATMASKER, CENSOR and LTRDIGEST. In contrast to these methods, TE-LEARNER is the first to incorporate machine learning techniques, outperforming these methods in terms of predictive performance , while able to learn models and make predictions efficiently. Moreover, we show that our method was able to identify TEs that none of the above method could find, and we investigated TE-LEARNER'S predictions which did not correspond to an official annotation. It turns out that many of these predictions are in fact strongly homologous to a known TE
Algorithm Selection Framework for Cyber Attack Detection
The number of cyber threats against both wired and wireless computer systems
and other components of the Internet of Things continues to increase annually.
In this work, an algorithm selection framework is employed on the NSL-KDD data
set and a novel paradigm of machine learning taxonomy is presented. The
framework uses a combination of user input and meta-features to select the best
algorithm to detect cyber attacks on a network. Performance is compared between
a rule-of-thumb strategy and a meta-learning strategy. The framework removes
the conjecture of the common trial-and-error algorithm selection method. The
framework recommends five algorithms from the taxonomy. Both strategies
recommend a high-performing algorithm, though not the best performing. The work
demonstrates the close connectedness between algorithm selection and the
taxonomy for which it is premised.Comment: 6 pages, 7 figures, 1 table, accepted to WiseML '2
Unsupervised Discovery of Phonological Categories through Supervised Learning of Morphological Rules
We describe a case study in the application of {\em symbolic machine
learning} techniques for the discovery of linguistic rules and categories. A
supervised rule induction algorithm is used to learn to predict the correct
diminutive suffix given the phonological representation of Dutch nouns. The
system produces rules which are comparable to rules proposed by linguists.
Furthermore, in the process of learning this morphological task, the phonemes
used are grouped into phonologically relevant categories. We discuss the
relevance of our method for linguistics and language technology
A review of multi-instance learning assumptions
Multi-instance (MI) learning is a variant of inductive machine learning, where each learning example contains a bag of instances instead of a single feature vector. The term commonly refers to the supervised setting, where each bag is associated with a label. This type of representation is a natural fit for a number of real-world learning scenarios, including drug activity prediction and image classification, hence many MI learning algorithms have been proposed. Any MI learning method must relate instances to bag-level class labels, but many types of relationships between instances and class labels are possible. Although all early work in MI learning assumes a specific MI concept class known to be appropriate for a drug activity prediction domain; this ‘standard MI assumption’ is not guaranteed to hold in other domains. Much of the recent work in MI learning has concentrated on a relaxed view of the MI problem, where the standard MI assumption is dropped, and alternative assumptions are considered instead. However, often it is not clearly stated what particular assumption is used and how it relates to other assumptions that have been proposed. In this paper, we aim to clarify the use of alternative MI assumptions by reviewing the work done in this area
Boosting Classifiers for Drifting Concepts
This paper proposes a boosting-like method to train a classifier ensemble from data streams. It naturally adapts to concept drift and allows to quantify the drift in terms of its base learners. The algorithm is empirically shown to outperform learning algorithms that ignore concept drift. It performs no worse than advanced adaptive time window and example selection strategies that store all the data and are thus not suited for mining massive streams. --
- …