2,479 research outputs found

    A Comparative Analysis of Ensemble Classifiers: Case Studies in Genomics

    Full text link
    The combination of multiple classifiers using ensemble methods is increasingly important for making progress in a variety of difficult prediction problems. We present a comparative analysis of several ensemble methods through two case studies in genomics, namely the prediction of genetic interactions and protein functions, to demonstrate their efficacy on real-world datasets and draw useful conclusions about their behavior. These methods include simple aggregation, meta-learning, cluster-based meta-learning, and ensemble selection using heterogeneous classifiers trained on resampled data to improve the diversity of their predictions. We present a detailed analysis of these methods across 4 genomics datasets and find the best of these methods offer statistically significant improvements over the state of the art in their respective domains. In addition, we establish a novel connection between ensemble selection and meta-learning, demonstrating how both of these disparate methods establish a balance between ensemble diversity and performance.Comment: 10 pages, 3 figures, 8 tables, to appear in Proceedings of the 2013 International Conference on Data Minin

    Statistics in the Big Data era

    Get PDF
    It is estimated that about 90% of the currently available data have been produced over the last two years. Of these, only 0.5% is effectively analysed and used. However, this data can be a great wealth, the oil of 21st century, when analysed with the right approach. In this article, we illustrate some specificities of these data and the great interest that they can represent in many fields. Then we consider some challenges to statistical analysis that emerge from their analysis, suggesting some strategies

    Variable selection with Random Forests for missing data

    Get PDF
    Variable selection has been suggested for Random Forests to improve their efficiency of data prediction and interpretation. However, its basic element, i.e. variable importance measures, can not be computed straightforward when there is missing data. Therefore an extensive simulation study has been conducted to explore possible solutions, i.e. multiple imputation, complete case analysis and a newly suggested importance measure for several missing data generating processes. The ability to distinguish relevant from non-relevant variables has been investigated for these procedures in combination with two popular variable selection methods. Findings and recommendations: Complete case analysis should not be applied as it lead to inaccurate variable selection and models with the worst prediction accuracy. Multiple imputation is a good means to select variables that would be of relevance in fully observed data. It produced the best prediction accuracy. By contrast, the application of the new importance measure causes a selection of variables that reflects the actual data situation, i.e. that takes the occurrence of missing values into account. It's error was only negligible worse compared to imputation

    Bagging model with cost sensitive analysis on diabetes data

    Get PDF
    Diabetes patients might suffer from an unhealthy life, long-term treatment and chronic complicated diseases. The decreasing hospitalization rate is a crucial problem for health care centers. This study combines the bagging method with base classifier decision tree and costs-sensitive analysis for diabetes patients' classification purpose. Real patients' data collected from a regional hospital in Thailand were analyzed. The relevance factors were selected and used to construct base classifier decision tree models to classify diabetes and non-diabetes patients. The bagging method was then applied to improve accuracy. Finally, asymmetric classification cost matrices were used to give more alternative models for diabetes data analysis

    Visual Sensing and Defect Detection of Gas Tungsten Arc Welding

    Get PDF
    Weld imperfections or defects such as incomplete penetration and lack of fusion are critical issues that affect the integration of welding components. The molten weld pool geometry is the major source of information related to the formation of these defects. In this dissertation, a new visual sensing system has been designed and set up to obtain weld pool images during GTAW. The weld pool dynamical behavior can be monitored using both active and passive vision method with the interference of arc light in the image significantly reduced through the narrow band pass filter and laser based auxiliary light source.Computer vision algorithms based on passive vision images were developed to measure the 3D weld pool surface geometry in real time. Specifically, a new method based on the reversed electrode image (REI) was developed to calculate weld pool surface height in real time. Meanwhile, the 2D weld pool boundary was extracted with landmarks detection algorithms. The method was verified with bead-on-plate and butt-joint welding experiments.Supervised machine learning was used to develop the capability to predict, in real-time, the incomplete penetration on thin SS304 plate with the key features extracted from weld pool images. An integrated self-adaptive close loop control system consisting the non-contact visual sensor, machine learning based defect predictor, and welding power source was developed for real-time welding penetration control for bead on plate welding. Moreover, the data driven methods were first applied to detect incomplete penetration and LOF in multi-pass U groove welding. New features extracted from reversed electrode image played the most important role to predict these defects. Finally, real time welding experiments were conducted to verify the feasibility of the developed models
    • ā€¦
    corecore