3 research outputs found

    Prediction using step-wise L1, L2 regularization and feature selection for small data sets with large number of features

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Machine learning methods are nowadays used for many biological prediction problems involving drugs, ligands or polypeptide segments of a protein. In order to build a prediction model a so called training data set of molecules with measured target properties is needed. For many such problems the size of the training data set is limited as measurements have to be performed in a wet lab. Furthermore, the considered problems are often complex, such that it is not clear which molecular descriptors (features) may be suitable to establish a strong correlation with the target property. In many applications all available descriptors are used. This can lead to difficult machine learning problems, when thousands of descriptors are considered and only few (e.g. below hundred) molecules are available for training.</p> <p>Results</p> <p>The CoEPrA contest provides four data sets, which are typical for biological regression problems (few molecules in the training data set and thousands of descriptors). We applied the same two-step training procedure for all four regression tasks. In the first stage, we used optimized L1 regularization to select the most relevant features. Thus, the initial set of more than 6,000 features was reduced to about 50. In the second stage, we used only the selected features from the preceding stage applying a milder L2 regularization, which generally yielded further improvement of prediction performance. Our linear model employed a soft loss function which minimizes the influence of outliers.</p> <p>Conclusions</p> <p>The proposed two-step method showed good results on all four CoEPrA regression tasks. Thus, it may be useful for many other biological prediction problems where for training only a small number of molecules are available, which are described by thousands of descriptors.</p

    Exploring classification strategies with the CoEPrA 2006 contest

    No full text
    Motivation: In silico methods to classify compounds as potential drugs that bind to a specific target become increasingly important for drug design. To build classification devices training sets of drugs with known activities are needed. For many such classification problems, not only qualitative but also quantitative information of a specific property (e.g. binding affinity) is available. The latter can be used to build a regression scheme to predict this property for new compounds. Predicting a compound property explicitly is generally more difficult than classifying that the property lies below or above a given threshold value. Hence, an indirect classification that is based on regression may lead to poorer results than a direct classification scheme. In fact, initially researchers are only interested to classify compounds as potential drugs. The activities of these compounds are subsequently measured in wet lab

    Regression und Klassifikation biochemischer Systeme mit Hilfe der DemPRED- Bibliothek

    No full text
    1 Introduction 6 2 Publications 14 2.1 Predicting human volume of distribution and clearance of drugs using automated feature selection 15 2.2 Prediction using step-wise L1, L2 regularization and feature selection for small data sets with large number of features 18 2.3 Exploring classification strategies with the CoEPrA 2006 contest 21 2.4 Predicting Protein Complex Geometries with Linear Scoring Functions 23 3 Discussion 26 4 Summary 27 5 Summary in German 28 Statutory Declaration 29 References 30In silico predictions of particular properties of biological active molecules can dramatically reduce time and costs needed to measure these properties in a wet lab. Nevertheless, the implementation of state of the art prediction techniques needs expert knowledge of machine learning methods and distinctive programming skills if starting from scratch. Hence, there is a demand for powerful yet easy to use libraries, which users can employ and extend to build their own models given a particular prediction task. During my PhD I developed such a library called DemPRED. The core of DemPRED consists of a linear scoring function. This scoring function can be combined with various loss functions, which makes DemPRED suitable for classification and regression. In cases were a linear model is not flexible enough DemPRED makes use of the kernel trick to transform the linear core into a non linear one. DemPRED contains many additional routines, which help users to generate reliable prediction models. These include various quality measurements as well as re- sampling strategies and routines for saving and loading of generated models. DemPRED includes various regularization and feature selection strategies, which make this library especially suitable for prediction tasks where few observations are described by thousands of descriptors. The object oriented implementation of DemPRED allows users to extend and modify the build in routines by their own ones. During my PhD I successfully used DemPRED on various classification and re-gression problems such as predicting major histocompatibility complex II (MHC II) epitopes, prediction of human volume of distribution and clearance as well as detecting protein interface regions. The predictive power of all generated models was as good as or even better than other state of the art classification and regression techniques.Trotz fortgeschrittener Messtechniken kann das Erfassen molekularer Eigenschaften für die meisten biochemischen Prozesse sehr zeitaufwändig und teuer sein. Dies gilt insbesondere dann, wenn Eigenschaften umfangreicher Moleküldatenbanken untersucht werden sollen. Um den Prozess der Messung zu beschleunigen, werden Laborexperimente heutzutage immer häufiger durch Computer gestützte Vorhersagemethoden ergänzt. Somit können selbst große Datenbanken in einem Bruchteil der sonst dafür im Labor benötigten Zeit untersucht werden. Ohne geeignete Werkzeuge kann die Generierung eines aussagekräftigen, computergestützten Vorhersagemodels jedoch ebenfalls kompliziert und zeitaufwändig sein. Aus diesem Grund besteht die Nachfrage nach einfach zu bedienenden und erweiterbaren Programmbibliotheken, welche die Grundfunktionen für die Generierung von Vorhersagemodellen zur Verfügung stellen. Während meiner Promotion habe ich eine solche Bibliothek namens DemPRED entwickelt. DemPRED basiert im Kern auf einem linearen Model, welches mit verschiedenen Verlustfunktionen kombiniert werden kann. In Fällen, in denen ein lineares Model nicht die nötige Flexibilität liefert, kann DemPRED mit Hilfe des Kernel Tricks zu einem nicht-linearen Model erweitert werden. Die DemPRED Bibliothek bietet zudem etliche zusätzliche Funktionen an, die dem Benutzer helfen, gute Vorhersagemodelle zu generieren. Während meiner Promotion habe ich DemPRED dazu genutzt, unterschiedlichste biochemische Prozesse vorherzusagen. Unter anderem habe ich Modelle für die Vorhersage der MHC II bindenden Epitope, humanen Verteilungs- und Ausscheidungskoeffizienten und Protein Interaktionsflächen entwickelt. Die Qualität der generierten Vorhersagemodelle war hierbei meist besser oder aber mindestens vergleichbar zu anderen bisher verwendeten Techniken
    corecore