13 research outputs found

    Segmentation of sales for a mobile phone service through CART classification tree algorithm

    Get PDF
    The work consisted of detailing the CRISP-DM method in order to identify optimal groups of customers who are more likely to migrate from a prepaid to postpaid option in order to formulate an improvement plan for in call management by sorting the database. Classification models were applied to analyze the characteristics generated by the purchase of the different services. The CART Classification Tree algorithm. As a result, groups differentiated by probabilities of sales success (migrate from a prepaid to postpaid plan) were found, segments that reflect particular needs and characteristics to design marketing actions focused on the objective of increasing the effectiveness rate, contact information, and sales increase

    Principal variable selection to explain grain yield variation in winter wheat from features extracted from UAV imagery

    Get PDF
    Background: Automated phenotyping technologies are continually advancing the breeding process. However, collecting various secondary traits throughout the growing season and processing massive amounts of data still take great efforts and time. Selecting a minimum number of secondary traits that have the maximum predictive power has the potential to reduce phenotyping efforts. The objective of this study was to select principal features extracted from UAV imagery and critical growth stages that contributed the most in explaining winter wheat grain yield. Five dates of multispectral images and seven dates of RGB images were collected by a UAV system during the spring growing season in 2018. Two classes of features (variables), totaling to 172 variables, were extracted for each plot from the vegetation index and plant height maps, including pixel statistics and dynamic growth rates. A parametric algorithm, LASSO regression (the least angle and shrinkage selection operator), and a non-parametric algorithm, random forest, were applied for variable selection. The regression coefficients estimated by LASSO and the permutation importance scores provided by random forest were used to determine the ten most important variables influencing grain yield from each algorithm. Results: Both selection algorithms assigned the highest importance score to the variables related with plant height around the grain filling stage. Some vegetation indices related variables were also selected by the algorithms mainly at earlier to mid growth stages and during the senescence. Compared with the yield prediction using all 172 variables derived from measured phenotypes, using the selected variables performed comparable or even better. We also noticed that the prediction accuracy on the adapted NE lines (r = 0.58–0.81) was higher than the other lines (r = 0.21–0.59) included in this study with different genetic backgrounds. Conclusions: With the ultra-high resolution plot imagery obtained by the UAS-based phenotyping we are now able to derive more features, such as the variation of plant height or vegetation indices within a plot other than just an averaged number, that are potentially very useful for the breeding purpose. However, too many features or variables can be derived in this way. The promising results from this study suggests that the selected set from those variables can have comparable prediction accuracies on the grain yield prediction than the full set of them but possibly resulting in a better allocation of efforts and resources on phenotypic data collection and processing

    A methodology for exploring biomarker – phenotype associations: application to flow cytometry data and systemic sclerosis clinical manifestations

    Full text link
    BACKGROUND: This work seeks to develop a methodology for identifying reliable biomarkers of disease activity, progression and outcome through the identification of significant associations between high-throughput flow cytometry (FC) data and interstitial lung disease (ILD) - a systemic sclerosis (SSc, or scleroderma) clinical phenotype which is the leading cause of morbidity and mortality in SSc. A specific aim of the work involves developing a clinically useful screening tool that could yield accurate assessments of disease state such as the risk or presence of SSc-ILD, the activity of lung involvement and the likelihood to respond to therapeutic intervention. Ultimately this instrument could facilitate a refined stratification of SSc patients into clinically relevant subsets at the time of diagnosis and subsequently during the course of the disease and thus help in preventing bad outcomes from disease progression or unnecessary treatment side effects. The methods utilized in the work involve: (1) clinical and peripheral blood flow cytometry data (Immune Response In Scleroderma, IRIS) from consented patients followed at the Johns Hopkins Scleroderma Center. (2) machine learning (Conditional Random Forests - CRF) coupled with Gene Set Enrichment Analysis (GSEA) to identify subsets of FC variables that are highly effective in classifying ILD patients; and (3) stochastic simulation to design, train and validate ILD risk screening tools. RESULTS: Our hybrid analysis approach (CRF-GSEA) proved successful in predicting SSc patient ILD status with a high degree of success (>82 % correct classification in validation; 79 patients in the training data set, 40 patients in the validation data set). CONCLUSIONS: IRIS flow cytometry data provides useful information in assessing the ILD status of SSc patients. Our new approach combining Conditional Random Forests and Gene Set Enrichment Analysis was successful in identifying a subset of flow cytometry variables to create a screening tool that proved effective in correctly identifying ILD patients in the training and validation data sets. From a somewhat broader perspective, the identification of subsets of flow cytometry variables that exhibit coordinated movement (i.e., multi-variable up or down regulation) may lead to insights into possible effector pathways and thereby improve the state of knowledge of systemic sclerosis pathogenesis. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0722-x) contains supplementary material, which is available to authorized users

    Nitrogen mineralized in anaerobiosis as indicator of soil aggregate stability

    No full text
    Monitoring soil health status is imperative to pursue sustainable agriculture. Aggregate stability (AS) is fundamental to define several soil functions and, therefore, physical soil health. The objectives of thisworkwere to (i) evaluate the effect of contrasting cropping systems on AS, soil (SOC) and particulate (POC) organic carbon, and anaerobic nitrogen (AN) both in bulk soil and in macroaggregates (MA), and (ii) assess the relationship between AS and AN both in bulk soil and in MA to facilitate soil physical health monitoring. Aggregate stability, AN, SOC and POC were evaluated at three depths (0–5, 5–20, and 0–20 cm) in a Mollisol of the Southeastern Argentinean Pampas under a long-term experiment of cropping systems (crop-pasture rotations under conventional tillage [CT] and no-tillage [NT]). Bulk-soil SOC and POC contents and AN showed the effect of cropping systems, especially the effect of crop-pasture rotation and at 0–5 cm depth. However, NT did not lead to SOC sequestration except at 0–5 cm depth. In turn, pastures in the rotation and NT improved AS. Bulk-soil AN explained 75, 41, and 71% of AS at 0–5, 5–20, and 0–20 cm depths, respectively, and provides an indication of AS status. Instead, AN in MA did not explain bulk-soil AS changes as much as bulk-soil AN, except at 0–5 cm depth. Therefore, it is not worth determining AN in MA. However, routine bulk-soil AN determination at 0–20 cm depth by producers to diagnose nitrogen soil fertility would also provide an additional valuable indication of AS status
    corecore