3 research outputs found
Prediction using step-wise L1, L2 regularization and feature selection for small data sets with large number of features
<p>Abstract</p> <p>Background</p> <p>Machine learning methods are nowadays used for many biological prediction problems involving drugs, ligands or polypeptide segments of a protein. In order to build a prediction model a so called training data set of molecules with measured target properties is needed. For many such problems the size of the training data set is limited as measurements have to be performed in a wet lab. Furthermore, the considered problems are often complex, such that it is not clear which molecular descriptors (features) may be suitable to establish a strong correlation with the target property. In many applications all available descriptors are used. This can lead to difficult machine learning problems, when thousands of descriptors are considered and only few (e.g. below hundred) molecules are available for training.</p> <p>Results</p> <p>The CoEPrA contest provides four data sets, which are typical for biological regression problems (few molecules in the training data set and thousands of descriptors). We applied the same two-step training procedure for all four regression tasks. In the first stage, we used optimized L1 regularization to select the most relevant features. Thus, the initial set of more than 6,000 features was reduced to about 50. In the second stage, we used only the selected features from the preceding stage applying a milder L2 regularization, which generally yielded further improvement of prediction performance. Our linear model employed a soft loss function which minimizes the influence of outliers.</p> <p>Conclusions</p> <p>The proposed two-step method showed good results on all four CoEPrA regression tasks. Thus, it may be useful for many other biological prediction problems where for training only a small number of molecules are available, which are described by thousands of descriptors.</p
Exploring classification strategies with the CoEPrA 2006 contest
Motivation: In silico methods to classify compounds as potential drugs that bind to a specific target become increasingly important for drug design. To build classification devices training sets of drugs with known activities are needed. For many such classification problems, not only qualitative but also quantitative information of a specific property (e.g. binding affinity) is available. The latter can be used to build a regression scheme to predict this property for new compounds. Predicting a compound property explicitly is generally more difficult than classifying that the property lies below or above a given threshold value. Hence, an indirect classification that is based on regression may lead to poorer results than a direct classification scheme. In fact, initially researchers are only interested to classify compounds as potential drugs. The activities of these compounds are subsequently measured in wet lab
Regression und Klassifikation biochemischer Systeme mit Hilfe der DemPRED- Bibliothek
1 Introduction 6 2 Publications 14 2.1 Predicting human volume of distribution
and clearance of drugs using automated feature selection 15 2.2 Prediction
using step-wise L1, L2 regularization and feature selection for small data
sets with large number of features 18 2.3 Exploring classification strategies
with the CoEPrA 2006 contest 21 2.4 Predicting Protein Complex Geometries with
Linear Scoring Functions 23 3 Discussion 26 4 Summary 27 5 Summary in German
28 Statutory Declaration 29 References 30In silico predictions of particular properties of biological active molecules
can dramatically reduce time and costs needed to measure these properties in a
wet lab. Nevertheless, the implementation of state of the art prediction
techniques needs expert knowledge of machine learning methods and distinctive
programming skills if starting from scratch. Hence, there is a demand for
powerful yet easy to use libraries, which users can employ and extend to build
their own models given a particular prediction task. During my PhD I developed
such a library called DemPRED. The core of DemPRED consists of a linear
scoring function. This scoring function can be combined with various loss
functions, which makes DemPRED suitable for classification and regression. In
cases were a linear model is not flexible enough DemPRED makes use of the
kernel trick to transform the linear core into a non linear one. DemPRED
contains many additional routines, which help users to generate reliable
prediction models. These include various quality measurements as well as re-
sampling strategies and routines for saving and loading of generated models.
DemPRED includes various regularization and feature selection strategies,
which make this library especially suitable for prediction tasks where few
observations are described by thousands of descriptors. The object oriented
implementation of DemPRED allows users to extend and modify the build in
routines by their own ones. During my PhD I successfully used DemPRED on
various classification and re-gression problems such as predicting major
histocompatibility complex II (MHC II) epitopes, prediction of human volume of
distribution and clearance as well as detecting protein interface regions. The
predictive power of all generated models was as good as or even better than
other state of the art classification and regression techniques.Trotz fortgeschrittener Messtechniken kann das Erfassen molekularer
Eigenschaften für die meisten biochemischen Prozesse sehr zeitaufwändig und
teuer sein. Dies gilt insbesondere dann, wenn Eigenschaften umfangreicher
Moleküldatenbanken untersucht werden sollen. Um den Prozess der Messung zu
beschleunigen, werden Laborexperimente heutzutage immer häufiger durch
Computer gestützte Vorhersagemethoden ergänzt. Somit können selbst große
Datenbanken in einem Bruchteil der sonst dafür im Labor benötigten Zeit
untersucht werden. Ohne geeignete Werkzeuge kann die Generierung eines
aussagekräftigen, computergestützten Vorhersagemodels jedoch ebenfalls
kompliziert und zeitaufwändig sein. Aus diesem Grund besteht die Nachfrage
nach einfach zu bedienenden und erweiterbaren Programmbibliotheken, welche die
Grundfunktionen für die Generierung von Vorhersagemodellen zur Verfügung
stellen. Während meiner Promotion habe ich eine solche Bibliothek namens
DemPRED entwickelt. DemPRED basiert im Kern auf einem linearen Model, welches
mit verschiedenen Verlustfunktionen kombiniert werden kann. In Fällen, in
denen ein lineares Model nicht die nötige Flexibilität liefert, kann DemPRED
mit Hilfe des Kernel Tricks zu einem nicht-linearen Model erweitert werden.
Die DemPRED Bibliothek bietet zudem etliche zusätzliche Funktionen an, die dem
Benutzer helfen, gute Vorhersagemodelle zu generieren. Während meiner
Promotion habe ich DemPRED dazu genutzt, unterschiedlichste biochemische
Prozesse vorherzusagen. Unter anderem habe ich Modelle für die Vorhersage der
MHC II bindenden Epitope, humanen Verteilungs- und Ausscheidungskoeffizienten
und Protein Interaktionsflächen entwickelt. Die Qualität der generierten
Vorhersagemodelle war hierbei meist besser oder aber mindestens vergleichbar
zu anderen bisher verwendeten Techniken