611,731 research outputs found

    A machine learning heuristic to identify biologically relevant and minimal biomarker panels from omics data

    Get PDF
    Background: Investigations into novel biomarkers using omics techniques generate large amounts of data. Due to their size and numbers of attributes, these data are suitable for analysis with machine learning methods. A key component of typical machine learning pipelines for omics data is feature selection, which is used to reduce the raw high-dimensional data into a tractable number of features. Feature selection needs to balance the objective of using as few features as possible, while maintaining high predictive power. This balance is crucial when the goal of data analysis is the identification of highly accurate but small panels of biomarkers with potential clinical utility. In this paper we propose a heuristic for the selection of very small feature subsets, via an iterative feature elimination process that is guided by rule-based machine learning, called RGIFE (Rule-guided Iterative Feature Elimination). We use this heuristic to identify putative biomarkers of osteoarthritis (OA), articular cartilage degradation and synovial inflammation, using both proteomic and transcriptomic datasets.Results and discussion: Our RGIFE heuristic increased the classification accuracies achieved for all datasets when no feature selection is used, and performed well in a comparison with other feature selection methods. Using this method the datasets were reduced to a smaller number of genes or proteins, including those known to be relevant to OA, cartilage degradation and joint inflammation. The results have shown the RGIFE feature reduction method to be suitable for analysing both proteomic and transcriptomics data. Methods that generate large ‘omics’ datasets are increasingly being used in the area of rheumatology.Conclusions: Feature reduction methods are advantageous for the analysis of omics data in the field of rheumatology, as the applications of such techniques are likely to result in improvements in diagnosis, treatment and drug discovery

    Markov blanket: efficient strategy for feature subset selection method for high dimensionality microarray cancer datasets

    Get PDF
    Currently, feature subset selection methods are very important, especially in areas of application for which datasets with tens or hundreds of thousands of variables (genes) are available. Feature subset selection methods help us select a small number of variables out of thousands of genes in microarray datasets for a more accurate and balanced classification. Efficient gene selection can be considered as an easy computational hold of the subsequent classification task, and can give subset of gene set without the loss of classification performance. In classifying microarray data, the main objective of gene selection is to search for the genes while keeping the maximum amount of relevant information about the class and minimize classification errors. In this paper, explain the importance of feature subset selection methods in machine learning and data mining fields. Consequently, the analysis of microarray expression was used to check whether global biological differences underlie common pathological features in different types of cancer datasets and identify genes that might anticipate the clinical behavior of this disease. Using the feature subset selection model for gene expression contains large amounts of raw data that needs analyzing to obtain useful information for specific biological and medical applications. One way of finding relevant (and removing redundant ) genes is by using the Bayesian network based on the Markov blanket [1]. We present and compare the performance of the different approaches to feature (genes) subset selection methods based on Wrapper and Markov Blanket models for the five-microarray cancer datasets. The first way depends on the Memetic algorithms (MAs) used for the feature selection method. The second way uses MRMR (Minimum Redundant Maximum Relevant) for feature subset selection hybridized by genetic search optimization techniques and afterwards compares the Markov blanket model’s performance with the most common classical classification algorithms for the selected set of features. For the memetic algorithm, we present a comparison between two embedded approaches for feature subset selection which are the wrapper filter for feature selection algorithm (WFFSA) and Markov Blanket Embedded Genetic Algorithm (MBEGA). The memetic algorithm depends on genetic operators (crossover, mutation) and the dedicated local search procedure. For comparisons, we depend on two evaluations techniques for learning and testing data which are 10-Kfold cross validation and 30-Bootstraping. The results of the memetic algorithm clearly show MBEGA often outperforms WFFSA methods by yielding more significant differentiation among different microarray cancer datasets. In the second part of this paper, we focus mainly on MRMR for feature subset selection methods and the Bayesian network based on Markov blanket (MB) model that are useful for building a good predictor and defying the curse of dimensionality to improve prediction performance. These methods cover a wide range of concerns: providing a better definition of the objective function, feature construction, feature ranking, efficient search methods, and feature validity assessment methods as well as defining the relationships among attributes to make predictions. We present performance measures for some common (or classical) learning classification algorithms (Naive Bayes, Support vector machine [LiBSVM], K-nearest neighbor, and AdBoostM Ensampling) before and after using the MRMR method. We compare the Bayesian network classification algorithm based on the Markov Blanket model’s performance measure with the performance of these common classification algorithms. The result of performance measures for classification algorithm based on the Bayesian network of the Markov blanket model get higher accuracy rates than other types of classical classification algorithms for the cancer Microarray datasets. Bayesian networks clearly depend on relationships among attributes to make predictions. The Bayesian network based on the Markov blanket (MB) classification method of classifying variables provides all necessary information for predicting its value. In this paper, we recommend the Bayesian network based on the Markov blanket for learning and classification processing, which is highly effective and efficient on feature subset selection measures.Master of Science (MSc) in Computational Science

    Proceedings of the 2nd Int'l Workshop on Enterprise Modelling and Information Systems Architectures - Concepts and Applications (EMISA'07)

    Get PDF
    The 2nd International Workshop on “Enterprise Modelling and Information Systems Architectures – Concepts and Applications” (EMISA’07) addresses all aspects relevant for enterprise modelling as well as for designing enterprise architectures in general and information systems architectures in particular. It was jointly organized by the GI Special Interest Group on Modelling Business Information Systems (GI-SIG MoBIS) and the GI Special Interest Group on Design Methods for Information Systems (GI-SIG EMISA). -- These proceedings feature a selection of 15 high quality contributions from academia and practice on enterprise architecture models, business processes management, information systems engineering, and other important issues in enterprise modelling and information systems architectures

    Robust and efficient approach to feature selection with machine learning

    Get PDF
    Most statistical analyses or modelling studies must deal with the discrepancy between the measured aspects of analysed phenomenona and their true nature. Hence, they are often preceded by a step of altering the data representation into somehow optimal for the following methods.This thesis deals with feature selection, a narrow yet important subset of representation altering methodologies.Feature selection is applied to an information system, i.e., data existing in a tabular form, as a group of objects characterised by values of some set of attributes (also called features or variables), and is defined as a process of finding a strict subset of them which fulfills some criterion.There are two essential classes of feature selection methods: minimal optimal, which aim to find the smallest subset of features that optimise accuracy of certain modelling methods, and all relevant, which aim to find the entire set of features potentially usable for modelling. The first class is mostly used in practice, as it adheres to a well known optimisation problem and has a direct connection to the final model performance. However, I argue that there exists a wide and significant class of applications in which only all relevant approaches may yield usable results, while minimal optimal methods are not only ineffective but even can lead to wrong conclusions.Moreover, all relevant class substantially overlaps with the set of actual research problems in which feature selection is an important result on its own, sometimes even more important than the finally resulting black-box model. In particular this applies to the p>>n problems, i.e., those for which the number of attributes is large and substantially exceeds the number of objects; for instance, such data is produced by high-throughput biological experiments which currently serve as the most powerful tool of molecular biology and a fundament of the arising individualised medicine.In the main part of the thesis I present Boruta, a heuristic, all relevant feature selection method. It is based on the concept of shadows, by-design random attributes incorporated into the information system as a reference for the relevance of original features in the context of whole structure of the analysed data. The variable importance on its own is assessed using the Random Forest method, a popular ensemble classifier.As the performance of the Boruta method turns out insatisfactory for some important applications, the following chapters of the thesis are devoted to Random Ferns, an ensemble classifier with the structure similar to Random Forest, but of a substantially higher computational efficiency. In the thesis, I propose a substantial generalisation of this method, capable of training on generic data and calculating feature importance scores.Finally, I assess both the Boruta method and its Random Ferns-based derivative on a series of p>>n problems of a biological origin. In particular, I focus on the stability of feature selection; I propose a novel methodology based on bootstrap and self-consistency. The results I obtain empirically confirm the validity of aforementioned effects characteristic to minimal optimal selection, as well as the efficiency of proposed heuristics for all relevant selection.The thesis is completed with a study of the applicability of Random Ferns in musical information retrieval, showing the usefulness of this method in other contexts and proposing its generalisation for multi-label classification problems.W większości zagadnień statystycznego modelowania istnieje problem niedostosowania zebranych danych do natury badanego zjawiska; co za tym idzie, analiza danych jest zazwyczaj poprzedzona zmianą ich surowej formy w optymalną dla dalej stosowanych metod.W rozprawie zajmuję się selekcją cech, jedną z klas zabiegów zmiany formy danych. Dotyczy ona systemów informacyjnych, czyli danych dających się przedstawić w formie tabelarycznej jako zbiór obiektów opisanych przez wartości zbioru atrybutów (nazywanych też cechami), oraz jest zdefiniowana jako proces wydzielenia w jakimś sensie optymalnego podzbioru atrybutów.Wyróżnia się dwie zasadnicze grupy metod selekcji cech: poszukujących możliwie małego podzbioru cech zapewniającego możliwie dobrą dokładność jakiejś metody modelowania (minimal optimal) oraz poszukujących podzbioru wszystkich cech, które niosą istotną informację i przez to są potencjalnie użyteczne dla jakiejś metody modelowania (all relevant). Tradycyjnie stosuje się prawie wyłącznie metody minimal optimal, sprowadzają się one bowiem w prosty sposób do znanego problemu optymalizacji i mają bezpośredni związek z efektywnością finalnego modelu. W rozprawie argumentuję jednak, że istnieje szeroka i istotna klasa problemów, w których tylko metody all relevant pozwalają uzyskać użyteczne wyniki, a metody minimal optimal są nie tylko nieefektywne ale często prowadzą do mylnych wniosków. Co więcej, wspomniana klasa pokrywa się też w dużej mierze ze zbiorem faktycznych problemów w których selekcja cech jest sama w sobie użytecznym wynikiem, nierzadko ważniejszym nawet od uzyskanego modelu. W szczególności chodzi tu o zbiory klasy p>>n, to jest takie w których liczba atrybutów w~systemie informacyjnym jest duża i znacząco przekracza liczbę obiektów; dane takie powszechnie występują chociażby w wysokoprzepustowych badaniach biologicznych, będących obecnie najpotężniejszym narzędziem analitycznym biologii molekularnej jak i fundamentem rodzącej się zindywidualizowanej medycyny.W zasadniczej części rozprawy prezentuję metodę Boruta, heurystyczną metodę selekcji zmiennych. Jest ona oparta o koncepcję rozszerzania systemu informacyjnego o cienie, z definicji nieistotne atrybuty wytworzone z oryginalnych cech przez losową permutację wartości, które są wykorzystywane jako odniesienie dla oceny istotności oryginalnych atrybutów w kontekście pełnej struktury analizowanych danych. Do oceny ważności cech metoda wykorzystuje algorytm lasu losowego (Random Forest), popularny klasyfikator zespołowy.Ponieważ wydajność obliczeniowa metody Boruta może być niewystarczająca dla pewnych istotnych zastosowań, w dalszej części rozprawy zajmuję się algorytmem paproci losowych, klasyfikatorem zespołowym zbliżonym strukturą do algorytmu lasu losowego, lecz oferującym znacząco lepszą wydajność obliczeniową. Proponuję uogólnienie tej metody, zdolne do treningu na generycznych systemach informacyjnych oraz do obliczania miary ważności atrybutów.Zarówno metodę Boruta jak i jej modyfikację wykorzystującą paprocie losowe poddaję w rozprawie wyczerpującej analizie na szeregu zbiorów klasy p>>n pochodzenia biologicznego. W szczególności rozważam tu stabilność selekcji; w tym celu formułuję nową metodę oceny opartą o podejście resamplingowe i samozgodność wyników. Wyniki przeprowadzonych eksperymentów potwierdzają empirycznie zasadność wspomnianych wcześniej problemów związanych z selekcją minimal optimal, jak również zasadność przyjętych heurystyk dla selekcji all relevant.Rozprawę dopełnia studium stosowalności algorytmu paproci losowych w problemie rozpoznawania instrumentów muzycznych w nagraniach, ilustrujące przydatność tej metody w innych kontekstach i proponujące jej uogólnienie na klasyfikację wieloetykietową

    Stable Feature Selection for Biomarker Discovery

    Full text link
    Feature selection techniques have been used as the workhorse in biomarker discovery applications for a long time. Surprisingly, the stability of feature selection with respect to sampling variations has long been under-considered. It is only until recently that this issue has received more and more attention. In this article, we review existing stable feature selection methods for biomarker discovery using a generic hierarchal framework. We have two objectives: (1) providing an overview on this new yet fast growing topic for a convenient reference; (2) categorizing existing methods under an expandable framework for future research and development
    corecore