63 research outputs found

    A new boosting design of Support Vector Machine classifiers

    Get PDF
    Boosting algorithms pay attention to the particular structure of the training data when learning, by means of iteratively emphasizing the importance of the training samples according to their difficulty for being correctly classified. If common kernel Support Vector Machines (SVMs) are used as basic learners to construct a Real AdaBoost ensemble, the resulting ensemble can be easily compacted into a monolithic architecture by simply combining the weights that correspond to the same kernels when they appear in different learners, avoiding to increase the operation computational effort for the above potential advantage. This way, the performance advantage that boosting provides can be obtained for monolithic SVMs, i.e., without paying in classification computational effort because many learners are needed. However, SVMs are both stable and strong, and their use for boosting requires to unstabilize and to weaken them. Yet previous attempts in this direction show a moderate success. In this paper, we propose a combination of a new and appropriately designed subsampling process and an SVM algorithm which permits sparsity control to solve the difficulties in boosting SVMs for obtaining improved performance designs. Experimental results support the effectiveness of the approach, not only in performance, but also in compactness of the resulting classifiers, as well as that combining both design ideas is needed to arrive to these advantageous designs.This work was supported in part by the Spanish MICINN under Grants TEC 2011-22480 and TIN 2011-24533

    A Taxonomy of Big Data for Optimal Predictive Machine Learning and Data Mining

    Full text link
    Big data comes in various ways, types, shapes, forms and sizes. Indeed, almost all areas of science, technology, medicine, public health, economics, business, linguistics and social science are bombarded by ever increasing flows of data begging to analyzed efficiently and effectively. In this paper, we propose a rough idea of a possible taxonomy of big data, along with some of the most commonly used tools for handling each particular category of bigness. The dimensionality p of the input space and the sample size n are usually the main ingredients in the characterization of data bigness. The specific statistical machine learning technique used to handle a particular big data set will depend on which category it falls in within the bigness taxonomy. Large p small n data sets for instance require a different set of tools from the large n small p variety. Among other tools, we discuss Preprocessing, Standardization, Imputation, Projection, Regularization, Penalization, Compression, Reduction, Selection, Kernelization, Hybridization, Parallelization, Aggregation, Randomization, Replication, Sequentialization. Indeed, it is important to emphasize right away that the so-called no free lunch theorem applies here, in the sense that there is no universally superior method that outperforms all other methods on all categories of bigness. It is also important to stress the fact that simplicity in the sense of Ockham's razor non plurality principle of parsimony tends to reign supreme when it comes to massive data. We conclude with a comparison of the predictive performance of some of the most commonly used methods on a few data sets.Comment: 18 pages, 2 figures 3 table

    Visual diagnosis of tree boosting methods

    Get PDF
    Tree boosting, which combines weak learners (typically decision trees) to generate a strong learner, is a highly effective and widely used machine learning method. However, the development of a high performance tree boosting model is a time-consuming process that requires numerous trial-and-error experiments. To tackle this issue, we have developed a visual diagnosis tool, BOOSTVis, to help experts quickly analyze and diagnose the training process of tree boosting. In particular, we have designed a temporal confusion matrix visualization, and combined it with a t-SNE projection and a tree visualization. These visualization components work together to provide a comprehensive overview of a tree boosting model, and enable an effective diagnosis of an unsatisfactory training process. Two case studies that were conducted on the Otto Group Product Classification Challenge dataset demonstrate that BOOSTVis can provide informative feedback and guidance to improve understanding and diagnosis of tree boosting algorithms

    Robust machine learning system to predict failure in a virtualized environment

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.Includes bibliographical references (leaves 49-50).The research in this work addresses the need for a warning system to predict future application failures. PreCog, the predictive and regressional error correlating guide system, aims to aid administrators by providing a robust future failure warning system statistically induced from past system behavior. In this work, we show that with the use of machine learning techniques such as Adaptive Boosting and Correlation-based Feature Selection, PreCog, without any prior knowledge of its target, can be accurately and reliably trained within a virtual environment using past system metrics to predict future application in a variety of domains.by Adam Rogal.M.Eng

    A Taxonomy of Massive Data for Optimal Predictive Machine Learning and Data Mining

    Get PDF
    Big data comes in various ways, types, shapes, forms and sizes. Indeed, almost all areas of science, technology, medicine, public health, economics, business, linguistics and social science are bombarded by ever increasing flows of data begging to analyzed efficiently and effectively. In this paper, we propose a rough idea of a possible taxonomy of big data, along with some of the most commonly used tools for handling each particular category of bigness. The dimensionality p of the input space and the sample size n are usually the main ingredients in the characterization of data bigness. The specific statistical machine learning technique used to handle a particular big data set will depend on which category it falls in within the bigness taxonomy. Large p small n data sets for instance require a different set of tools from the large n small p variety. Among other tools, we discuss Preprocessing, Standardization, Imputation, Projection, Regularization, Penalization, Compression, Reduction, Selection, Kernelization, Hybridization, Parallelization, Aggregation, Randomization, Replication, Sequentialization. Indeed, it is important to emphasize right away that the so-called no free lunch theorem applies here, in the sense that there is no universally superior method that outperforms all other methods on all categories of bigness. It is also important to stress the fact that simplicity in the sense of Ockham's razor non plurality principle of parsimony tends to reign supreme when it comes to massive data. We conclude with a comparison of the predictive performance of some of the most commonly used methods on a few data sets.Comment: 18 pages, 2 figures 3 table

    Predicting Pulsars from Imbalanced Dataset with Hybrid Resampling Approach

    Get PDF
    Pulsar stars, usually neutron stars, are spherical and compact objects containing a large quantity of mass. Each pulsar star possesses a magnetic field and emits a slightly different pattern of electromagnetic radiation which is used to identify the potential candidates for a real pulsar star. Pulsar stars are considered an important cosmic phenomenon, and scientists use them to study nuclear physics, gravitational waves, and collisions between black holes. Defining the process of automatic detection of pulsar stars can accelerate the study of pulsar stars by scientists. This study contrives an accurate and efficient approach for true pulsar detection using supervised machine learning. For experiments, the high time-resolution (HTRU2) dataset is used in this study. To resolve the data imbalance problem and overcome model overfitting, a hybrid resampling approach is presented in this study. Experiments are performed with imbalanced and balanced datasets using well-known machine learning algorithms. Results demonstrate that the proposed hybrid resampling approach proves highly influential to avoid model overfitting and increase the prediction accuracy. With the proposed hybrid resampling approach, the extra tree classifier achieves a 0.993 accuracy score for true pulsar star prediction

    Essays on economic forecasting using machine learning

    Get PDF
    This thesis studies the additional value introduced by different machine learning methods to economic forecasting. Flexible machine learning methods can discover various complex relationships in data and are well-suited for analysing so called big data and potential problems therein. Several new extensions to existing machine learning methods are proposed from the viewpoint of economic forecasting. In Chapter 2, the main objective is to predict U.S. economic recession periods with a high-dimensional dataset. A cost-sensitive extension to the gradient boosting machine learning algorithm is proposed, which takes into account the scarcity of recession periods. The results show how the cost-sensitive extension outperforms the traditional gradient boosting model and leads to more accurate recession forecasts. Chapter 3 considers a variety of different machine learning methods when predicting daily returns of the S&P 500 stock market index. A new multinomial approach is suggested, which allows us to focus on predicting the large absolute returns instead of the noisy variation around zero return. In terms of both the statistical and economic evaluation criteria gradient boosting turns out to be the best-performing machine learning method. In Chapter 4, the asset allocation decisions between risky and risk-free assets are determined using a flexible utility maximization based approach. Instead of the merely considered two-step approach where portfolio weights are based on the excess return predictions obtained with statistical predictive regressions, here the optimal weights are found directly by incorporating a custom objective function to the gradient boosting algorithm. The empirical results using monthly U.S. market returns show that the utility-based approach leads to substantial and quantitatively meaningful economic value over the past approaches.Tässä väitöskirjassa tarkastellaan millaista lisäarvoa koneoppimismenetelmät voivat tuoda taloudellisiin ennustesovelluksiin. Joustavat koneoppimismenetelmät kykenevät mallintamaan monimutkaisia funktiomuotoja ja soveltuvat hyvin big datan eli suurten aineistojen analysointiin. Väitöskirjassa laajennetaan koneoppimismenetelmiä erityisesti taloudellisten ennustesovellusten lähtökohdista katsoen. Luvussa 2 ennustetaan Yhdysvaltojen talouden taantumajaksoja käyttäen hyvin suurta selittäjäjoukkoa. Gradient boosting -koneoppimismenetelmää laajennetaan huomioimaan aineiston merkittävä tunnuspiirre eli se, että taantumajaksoja esiintyy melko harvoin talouden ollessa suurimman osan ajasta noususuhdanteessa. Tulokset osoittavat, että laajennettu gradient boosting -menetelmä kykenee ennustamaan tulevia taantumakuukausia huomattavasti perinteisiä menetelmiä tarkemmin. Luvussa 3 hyödynnetään useampaa erilaista koneoppimismenetelmää S&P 500 -osakemarkkinaindeksin päivätuottojen ennustamisessa. Aiemmista lähestymistavoista poiketen tässä tutkimuksessa kategorisoidaan tuotot kolmeen eri luokkaan pyrkimyksenä keskittyä informatiivisempien suurten positiivisten ja negatiivisten tuottojen ennustamiseen. Tulosten perusteella gradient boosting osoittautuu parhaaksi menetelmäksi niin tilastollisten kuin taloudellistenkin ennustekriteerien mukaan. Luvussa 4 tarkastellaan, kuinka perinteisen tuottoennusteisiin nojautuvan kaksivaiheisen lähestymistavan sijaan allokaatiopäätös riskisen ja riskittömän sijoituskohteen välillä voidaan muodostaa suoraan sijoittajan kokeman hyödyn pohjalta. Hyödyn maksimoinnissa käytetään gradient boosting -menetelmää ja sen mahdollistamaa itsemäärättyä tavoitefunktiota. Yhdysvaltojen aineistoon perustuvat empiiriset tulokset osoittavat kuinka sijoittajan hyötyyn pohjautuva salkkuallokaatio johtaa perinteistä kaksivaiheista lähestymistapaa tuottavampiin allokaatiopäätöksiin

    Bankruptcy Prediction by Deep Learning and Machine Learning Methods

    Get PDF
    Bankruptcy prediction plays a crucial role in today’s businesses to survive in a competitive world. For avoiding the risk of bankruptcy, researchers have conducted significant research in field of artificial intelligence for predicting bankruptcy. However, the performance of deep learning methods is not well understood. To address this research gap, we make the following main contributions: We applied deep learning methods into Polish datasets in addition to traditional machine learning techniques. We applied several versions of convolutional neural networks and artificial neural networks to several datasets created from the available dataset. Specifically, we created 5 extra datasets for each year in addition to the entire datasets for five years. We incorporated some techniques to balance the datasets and measured the impacts these techniques have on performance measures. This step is important because the datasets are imbalanced, i.e., the proportion of firms experiencing bankruptcy is much lower than those who did not go bankrupt. For deep learning techniques, we also explored preprocessing approaches and measured their impacts on results. Specifically, we used validation on the same datasets of studies in the literature and compared our results with those available in the literature with the same test bed. Our results shed light on the impact of preprocessing and balancing techniques in deep learning, as well as different architectures for deep learning methods. We observed improvement, compared to the literature, in terms of accuracy and provided insights on the value of different deep learning architectures and preprocessing on the sensitivity of the results

    Machine learning and computational methods to identify molecular and clinical markers for complex diseases – case studies in cancer and obesity

    Get PDF
    In biomedical research, applied machine learning and bioinformatics are the essential disciplines heavily involved in translating data-driven findings into medical practice. This task is especially accomplished by developing computational tools and algorithms assisting in detection and clarification of underlying causes of the diseases. The continuous advancements in high-throughput technologies coupled with the recently promoted data sharing policies have contributed to presence of a massive wealth of data with remarkable potential to improve human health care. In concordance with this massive boost in data production, innovative data analysis tools and methods are required to meet the growing demand. The data analyzed by bioinformaticians and computational biology experts can be broadly divided into molecular and conventional clinical data categories. The aim of this thesis was to develop novel statistical and machine learning tools and to incorporate the existing state-of-the-art methods to analyze bio-clinical data with medical applications. The findings of the studies demonstrate the impact of computational approaches in clinical decision making by improving patients risk stratification and prediction of disease outcomes. This thesis is comprised of five studies explaining method development for 1) genomic data, 2) conventional clinical data and 3) integration of genomic and clinical data. With genomic data, the main focus is detection of differentially expressed genes as the most common task in transcriptome profiling projects. In addition to reviewing available differential expression tools, a data-adaptive statistical method called Reproducibility Optimized Test Statistic (ROTS) is proposed for detecting differential expression in RNA-sequencing studies. In order to prove the efficacy of ROTS in real biomedical applications, the method is used to identify prognostic markers in clear cell renal cell carcinoma (ccRCC). In addition to previously known markers, novel genes with potential prognostic and therapeutic role in ccRCC are detected. For conventional clinical data, ensemble based predictive models are developed to provide clinical decision support in treatment of patients with metastatic castration resistant prostate cancer (mCRPC). The proposed predictive models cover treatment and survival stratification tasks for both trial-based and realworld patient cohorts. Finally, genomic and conventional clinical data are integrated to demonstrate the importance of inclusion of genomic data in predictive ability of clinical models. Again, utilizing ensemble-based learners, a novel model is proposed to predict adulthood obesity using both genetic and social-environmental factors. Overall, the ultimate objective of this work is to demonstrate the importance of clinical bioinformatics and machine learning for bio-clinical marker discovery in complex disease with high heterogeneity. In case of cancer, the interpretability of clinical models strongly depends on predictive markers with high reproducibility supported by validation data. The discovery of these markers would increase chance of early detection and improve prognosis assessment and treatment choice
    corecore