25 research outputs found

    Quantifying the diversity of news around stock market moves

    Get PDF
    The dynamics of news are such that some days are dominated by a single story while others see news outlets reporting on a range of different events. While these largescale features of news are familiar to many, they are often ignored in settings where they may be important in understanding complex decision-making processes, such as in financial markets. In this paper, we use a topic-modeling approach to quantify the changing attentions of a major news outlet, the Financial Times, to issues of interest. Our analysis reveals that the diversity of financial news, as quantified by our method, can improve forecasts of trading volume. We also find evidence which suggests that, while attention in financial news tends to be concentrated on a smaller number of topics following stock market falls, there is a "healthy diversity" of news following upward market movements. We conclude that the diversity of financial news can be a useful forecasting tool, offering early warning signals of increased activity in financial markets

    New algorithms in machine learning with applications in personalized medicine

    No full text
    Thesis: Ph. D., Massachusetts Institute of Technology, Sloan School of Management, Operations Research Center, 2018.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student-submitted PDF version of thesis.Includes bibliographical references (pages 165-173).Recent advances in machine learning and optimization hold much promise for influencing real-world decision making, especially in areas such as health care where abundant data are increasingly being collected. However, imperfections in the data pose a major challenge to realizing their full potential: missing values, noisy observations, and unobserved counterfactuals all impact the performance of data-driven methods. In this thesis, with a fresh perspective from optimization, I revisit some of the well-known problems in statistics and machine learning, and develop new methods for prescriptive analytics. I show examples of how common machine learning tasks, such as missing data imputation in Chapter 2 and classication in Chapter 3, can benet from the added edge of rigorous optimization formulations and solution techniques. In particular, the proposed opt.impute algorithm improves imputation quality by 13.7% over state-of-the-art methods, as averaged over 95 real data sets, which leads to further performance gains in downstream tasks. The power of prescriptive analytics is shown in Chapter 4 by our approach to personalized diabetes management, which identifies response patterns using machine learning and individualizes treatments via optimization. These newly developed machine learning algorithms not only demonstrate improved performance in large-scale experiments, but are also applied to solve the problems in health care that motivated them. Our simulated trial for diabetic patients in Chapter 4 demonstrates a clinically relevant reduction in average hemoglobin A1c levels compared to current practice. Finally, when predicting mortality for cancer patients in Chapter 5, applying opt.impute on missing data along with the cutting-edge algorithm Optimal Classication Tree on a rich data set prepared from electronic medical records, we are able to accurately risk stratify patients, providing physicians with interpretable insights and valuable risk estimates at time of treatment decisions and end-of-life planning.by Ying Daisy Zhuo.Ph. D

    From predictive methods to missing data imputation: An optimization approach

    No full text
    Missing data is a common problem in real-world settings and for this reason has attracted significant attention in the statistical literature. We propose a flexible framework based on formal optimization to impute missing data with mixed continuous and categorical variables. This framework can readily incorporate various predictive models including Knearest neighbors, support vector machines, and decision tree based methods, and can be adapted for multiple imputation. We derive fast first-order methods that obtain high quality solutions in seconds following a general imputation algorithm opt.impute presented in this paper. We demonstrate that our proposed method improves out-of-sample accuracy in large-scale computational experiments across a sample of 84 data sets taken from the UCI Machine Learning Repository. In all scenarios of missing at random mechanisms and various missing percentages, opt.impute produces the best overall imputation in most data sets benchmarked against five other methods: mean impute, K-nearest neighbors, iterative knn, Bayesian PCA, and predictive-mean matching, with an average reduction in mean absolute error of 8.3% against the best cross-validated benchmark method. Moreover, opt.impute leads to improved out-of-sample performance of learning algorithms trained using the imputed data, demonstrated by computational experiments on 10 downstream tasks. For models trained using opt.impute single imputations with 50% data missing, the average out-of-sample R2 is 0.339 in the regression tasks and the average out-of-sample accuracy is 86.1% in the classification tasks, compared to 0.315 and 84.4% for the best cross-validated benchmark method. In the multiple imputation setting, downstream models trained using opt.impute obtain a statistically significant improvement over models trained using multivariate imputation by chained equations (mice) in 8/10 missing data scenarios considered

    Interpretable predictive maintenance for hard drives

    No full text
    Existing machine learning approaches for data-driven predictive maintenance are usually black boxes that claim high predictive power yet cannot be understood by humans. This limits the ability of humans to use these models to derive insights and understanding of the underlying failure mechanisms, and also limits the degree of confidence that can be placed in such a system to perform well on future data. We consider the task of predicting hard drive failure in a data center using recent algorithms for interpretable machine learning. We demonstrate that these methods provide meaningful insights about short- and long-term drive health, while also maintaining high predictive performance. We also show that these analyses still deliver useful insights even when limited historical data is available, enabling their use in situations where data collection has only recently begun

    Benchmarking in Congenital Heart Surgery Using Machine Learning-Derived Optimal Classification Trees

    Get PDF
    Background: We have previously shown that the machine learning methodology of optimal classification trees (OCTs) can accurately predict risk after congenital heart surgery (CHS). We have now applied this methodology to define benchmarking standards after CHS, permitting case-adjusted hospital-specific performance evaluation. Methods: The European Congenital Heart Surgeons Association Congenital Database data subset (31 792 patients) who had undergone any of the 10 “benchmark procedure group” primary procedures were analyzed. OCT models were built predicting hospital mortality (HM), and prolonged postoperative mechanical ventilatory support time (MVST) or length of hospital stay (LOS), thereby establishing case-adjusted benchmarking standards reflecting the overall performance of all participating hospitals, designated as the “virtual hospital.” These models were then used to predict individual hospitals’ expected outcomes (both aggregate and, importantly, for risk-matched patient cohorts) for their own specific cases and case-mix, based on OCT analysis of aggregate data from the “virtual hospital.” Results: The raw average rates were HM = 4.4%, MVST = 15.3%, and LOS = 15.5%. Of 64 participating centers, in comparison with each hospital's specific case-adjusted benchmark, 17.0% statistically (under 90% confidence intervals) overperformed and 26.4% underperformed with respect to the predicted outcomes for their own specific cases and case-mix. For MVST and LOS, overperformers were 34.0% and 26.4%, and underperformers were 28.3% and 43.4%, respectively. OCT analyses reveal hospital-specific patient cohorts of either overperformance or underperformance. Conclusions: OCT benchmarking analysis can assess hospital-specific case-adjusted performance after CHS, both overall and patient cohort-specific, serving as a tool for hospital self-assessment and quality improvement. </jats:p

    Adverse Outcomes Prediction for Congenital Heart Surgery: A Machine Learning Approach

    Get PDF
    Objective: Risk assessment tools typically used in congenital heart surgery (CHS) assume that various possible risk factors interact in a linear and additive fashion, an assumption that may not reflect reality. Using artificial intelligence techniques, we sought to develop nonlinear models for predicting outcomes in CHS. Methods: We built machine learning (ML) models to predict mortality, postoperative mechanical ventilatory support time (MVST), and hospital length of stay (LOS) for patients who underwent CHS, based on data of more than 235,000 patients and 295,000 operations provided by the European Congenital Heart Surgeons Association Congenital Database. We used optimal classification trees (OCTs) methodology for its interpretability and accuracy, and compared to logistic regression and state-of-the-art ML methods (Random Forests, Gradient Boosting), reporting their area under the curve (AUC or c-statistic) for both training and testing data sets. Results: Optimal classification trees achieve outstanding performance across all three models (mortality AUC = 0.86, prolonged MVST AUC = 0.85, prolonged LOS AUC = 0.82), while being intuitively interpretable. The most significant predictors of mortality are procedure, age, and weight, followed by days since previous admission and any general preoperative patient risk factors. Conclusions: The nonlinear ML-based models of OCTs are intuitively interpretable and provide superior predictive power. The associated risk calculator allows easy, accurate, and understandable estimation of individual patient risks, in the theoretical framework of the average performance of all centers represented in the database. This methodology has the potential to facilitate decision-making and resource optimization in CHS, enabling total quality management and precise benchmarking initiatives. </jats:sec

    Safety and efficacy of insulin glargine 300 u/mL compared with other basal insulin therapies in patients with type 2 diabetes mellitus: a network meta-analysis

    No full text
    Objective: To compare the efficacy and safety of a concentrated formulation of insulin glargine (Gla-300) with other basal insulin therapies in patients with type 2 diabetes mellitus (T2DM). Design: This was a network meta-analysis (NMA) of randomised clinical trials of basal insulin therapy in T2DM identified via a systematic literature review of Cochrane library databases, MEDLINE and MEDLINE In-Process, EMBASE and PsycINFO. Outcome measures: Changes in HbA1c (%) and body weight, and rates of nocturnal and documented symptomatic hypoglycaemia were assessed. Results: 41 studies were included; 25 studies comprised the main analysis population: patients on basal insulin-supported oral therapy (BOT). Change in glycated haemoglobin (HbA1c) was comparable between Gla-300 and detemir (difference: -0.08; 95% credible interval (CrI): -0.40 to 0.24), neutral protamine Hagedorn (NPH; 0.01; -0.28 to 0.32), degludec (-0.12; -0.42 to 0.20) and premixed insulin (0.26; -0.04 to 0.58). Change in body weight was comparable between Gla-300 and detemir (0.69; -0.31 to 1.71), NPH (-0.76; -1.75 to 0.21) and degludec (-0.63; -1.63 to 0.35), but significantly lower compared with premixed insulin (-1.83; -2.85 to -0.75). Gla-300 was associated with a significantly lower nocturnal hypoglycaemia rate versus NPH (risk ratio: 0.18; 95% CrI: 0.05 to 0.55) and premixed insulin (0.36; 0.14 to 0.94); no significant differences were noted in Gla-300 versus detemir (0.52; 0.19 to 1.36) and degludec (0.66; 0.28 to 1.50). Differences in documented symptomatic hypoglycaemia rates of Gla-00 versus detemir (0.63; 0.19to 2.00), NPH (0.66; 0.27 to 1.49) and degludec (0.55; 0.23 to 1.34) were not significant. Extensive sensitivity analyses supported the robustness of these findings. Conclusions: NMA comparisons are useful in the absence of direct randomised controlled data. This NMA suggests that Gla-300 is also associated with a significantly lower risk of nocturnal hypoglycaemia compared with NPH and premixed insulin, with glycaemic control comparable to available basal insulin comparators

    Targeted Workup after Initial Febrile Urinary Tract Infection: Using a Novel Machine Learning Model to Identify Children Most Likely to Benefit from Voiding Cystourethrogram

    No full text
    Purpose:Significant debate persists regarding the appropriate workup in children with an initial urinary tract infection. Greatly preferable to all or none approaches in the current guideline would be a model to identify children at highest risk for a recurrent urinary tract infection plus vesicoureteral reflux to allow for targeted voiding cystourethrogram while children at low risk could be observed. We sought to develop a model to predict the probability of recurrent urinary tract infection associated vesicoureteral reflux in children after an initial urinary tract infection.Materials and Methods:We included subjects from the RIVUR (Randomized Intervention for Children with Vesico-Ureteral Reflux) and CUTIE (Careful Urinary Tract Infection Evaluation) trials in our study, excluding the prophylaxis treatment arm of the RIVUR. The main outcome was defined as recurrent urinary tract infection associated vesicoureteral reflux. Missing data were imputed using optimal tree imputation. Data were split into training, validation and testing sets. Machine learning algorithm hyperparameters were tuned by the validation set with fivefold cross-validation.Results:A total of 500 subjects, including 305 from the RIVUR and 195 from the CUTIE trials, were included in study. Of the subjects 90% were female and mean ± SD age was 21 ± 19 months. A recurrent urinary tract infection developed in 72 patients, of whom 53 also had vesicoureteral reflux (10.6% of the total). The final model included age, sex, race, weight, the systolic blood pressure percentile, dysuria, the urine albumin-to-creatinine ratio, prior antibiotic exposure and current medication. The model predicted recurrent urinary tract infection associated vesicoureteral reflux with an AUC of 0.761 (95% CI 0.714-0.808) in the testing set.Conclusions:Our predictive model using a novel machine learning algorithm provided promising performance to facilitate individualized treatment of children with an initial urinary tract infection and identify those most likely to benefit from voiding cystourethrogram after the initial urinary tract infection. This would allow for more selective application of this test, increasing the yield while also minimizing overuse
    corecore