77 research outputs found

    Machine Learning Morphisms: A Framework for Designing and Analyzing Machine Learning Work ows, Applied to Separability, Error Bounds, and 30-Day Hospital Readmissions

    Get PDF
    A machine learning workflow is the sequence of tasks necessary to implement a machine learning application, including data collection, preprocessing, feature engineering, exploratory analysis, and model training/selection. In this dissertation we propose the Machine Learning Morphism (MLM) as a mathematical framework to describe the tasks in a workflow. The MLM is a tuple consisting of: Input Space, Output Space, Learning Morphism, Parameter Prior, Empirical Risk Function. This contains the information necessary to learn the parameters of the learning morphism, which represents a workflow task. In chapter 1, we give a short review of typical tasks present in a workflow, as well as motivation for and innovations in the MLM framework. In chapter 2, we first define data as realizations of an unknown probability space. Then, after a brief introduction to statistical learning, the MLM is formally defined. Examples of MLM\u27s are presented, including linear regression, standardization, and the Naive Bayes Classifier. Asymptotic equality is defined between MLM\u27s by analyzing the parameters in the limit of infinite training data. Two definitions of composition are proposed, output and structural. Output composition is a sequential optimization of MLM\u27s, for example standardization followed by regression. Structural composition is a joint optimization inspired by backpropagation from neural nets. While structural compositions yield better overall performance, output compositions are easier to compute and interpret. In Chapter 3, we define the property of separability, where an MLM can be optimized by solving lower dimensional sub problems. A separable MLM represents a divide and conquer strategy for learning without sacrificing optimality. We show three cases of separable MLM\u27s for mean-squared error with increasing complexity. First, if the input space consists of centered, independent random variables, OLS Linear Regression is separable. This is extended to linear combinations of uncorrelated ensembles, and ensembles of non-linear, uncorrelated learning morphisms. The example of principal component regression is explored thoroughly as a separable workflow, and the choice between equivalent linear regressions is discussed. These separability results apply to a wide variety of problems via asymptotic equality. Functions which can be represented as power series can be learned via polynomial regression. Further, independent and centered power series can be generated using an orthogonal extension of principal component analysis (PCA). In Chapter 4, we explore the connection between generalization error and lower bounds used in estimation. We start by defining the ``Bayes MLM , the best possible MLM for a given problem. When the loss function is mean-squared error, Cramer-Rao lower bounds exist for an MLM which depend on the bias of the MLM and the underlying probability distribution. This can be used as a design tool when selecting candidate MLM\u27s, or as a tool for sensitivity analysis to examine the error of an MLM across a variety of parameterizations. A lower bound on the composition of MLM\u27s is constructed by applying a nonlinear filtering framework to the composition. Examples are presented for centering, PCA, ordinary least-squares linear regression, and the composition of these MLM\u27s. In Chapter 5 we apply the MLM framework to design a workflow that predicts 30-day hospital readmissions. Hospital readmissions occur when a patient is admitted less than 30 days after a previous hospital stay. We examine readmissions for a group of medicare/medicaid patients with the four most common diagnoses at Barnes Jewish Hospital. Using MLM\u27s, we incorporate the Mapper algorithm from topological data analysis into the predictive workflow in a novel ensemble. This ensemble first performs fuzzy clustering on the training set, and then trains models independently on each cluster. We compare an assortment of workflows predicting readmissions, and workflows featuring mapper outperform other standard models and current tools used for risk prediction at Barnes Jewish. Finally, we examine the separability of this workflow. Mapper workflows incorporating AdaBoost and logistic regression create node models with low correlation. When PCA is applied to each node, Random Forest node models also become decorrelated. Support Vector Machine node models are highly correlated, and do not converge when PCA is applied. This is consistent with their worse performance. In Chapter 6 we provide final comments and future work

    Generalized and Transferable Patient Language Representation for Phenotyping with Limited Data

    Get PDF
    The paradigm of representation learning through transfer learning has the potential to greatly enhance clinical natural language processing. In this work, we propose a multi-task pre-training and fine-tuning approach for learning generalized and transferable patient representations from medical language. The model is first pre-trained with different but related high-prevalence phenotypes and further fine-tuned on downstream target tasks. Our main contribution focuses on the impact this technique can have on low-prevalence phenotypes, a challenging task due to the dearth of data. We validate the representation from pre-training, and fine-tune the multi-task pre-trained models on low-prevalence phenotypes including 38 circulatory diseases, 23 respiratory diseases, and 17 genitourinary diseases. We find multi-task pre-training increases learning efficiency and achieves consistently high performance across the majority of phenotypes. Most important, the multi-task pre-training is almost always either the best-performing model or performs tolerably close to the best-performing model, a property we refer to as robust. All these results lead us to conclude that this multi-task transfer learning architecture is a robust approach for developing generalized and transferable patient language representations for numerous phenotypes.Comment: Journal of Biomedical Informatics (in press

    Time-Series Embedded Feature Selection Using Deep Learning: Data Mining Electronic Health Records for Novel Biomarkers

    Get PDF
    As health information technologies continue to advance, routine collection and digitisation of patient health records in the form of electronic health records present as an ideal opportunity for data-mining and exploratory analysis of biomarkers and risk factors indicative of a potentially diverse domain of patient outcomes. Patient records have continually become more widely available through various initiatives enabling open access whilst maintaining critical patient privacy. In spite of such progress, health records remain not widely adopted within the current clinical statistical analysis domain due to challenging issues derived from such “big data”.Deep learning based temporal modelling approaches present an ideal solution to health record challenges through automated self-optimisation of representation learning, able to man-ageably compose the high-dimensional domain of patient records into data representations able to model complex data associations. Such representations can serve to condense and reduce dimensionality to emphasise feature sparsity and importance through novel embedded feature selection approaches. Accordingly, application towards patient records enable complex mod-elling and analysis of the full domain of clinical features to select biomarkers of predictive relevance.Firstly, we propose a novel entropy regularised neural network ensemble able to highlight risk factors associated with hospitalisation risk of individuals with dementia. The application of which, was able to reduce a large domain of unique medical events to a small set of relevant risk factors able to maintain hospitalisation discrimination.Following on, we continue our work on ensemble architecture approaches with a novel cas-cading LSTM ensembles to predict severe sepsis onset within critical patients in an ICU critical care centre. We demonstrate state-of-the-art performance capabilities able to outperform that of current related literature.Finally, we propose a novel embedded feature selection application dubbed 1D convolu-tion feature selection using sparsity regularisation. Said methodology was evaluated on both domains of dementia and sepsis prediction objectives to highlight model capability and generalisability. We further report a selection of potential biomarkers for the aforementioned case study objectives highlighting clinical relevance and potential novelty value for future clinical analysis.Accordingly, we demonstrate the effective capability of embedded feature selection ap-proaches through the application of temporal based deep learning architectures in the discovery of effective biomarkers across a variety of challenging clinical applications

    Predictive Learning from Real-World Medical Data: Overcoming Quality Challenges

    Get PDF
    Randomized controlled trials (RCTs) are pivotal in medical research, notably as the gold standard, but face challenges, especially with specific groups like pregnant women and newborns. Real-world data (RWD), from sources like electronic medical records and insurance claims, complements RCTs in areas like disease risk prediction and diagnosis. However, RWD's retrospective nature leads to issues such as missing values and data imbalance, requiring intensive data preprocessing. To enhance RWD's quality for predictive modeling, this thesis introduces a suite of algorithms developed to automatically resolve RWD's low-quality issues for predictive modeling. In this study, the AMI-Net method is first introduced, innovatively treating samples as bags with various feature-value pairs and unifying them in an embedding space using a multi-instance neural network. It excels in handling incomplete datasets, a frequent issue in real-world scenarios, and shows resilience to noise and class imbalances. AMI-Net's capability to discern informative instances minimizes the effects of low-quality data. The enhanced version, AMI-Net+, improves instance selection, boosting performance and generalization. However, AMI-Net series initially only processes binary input features, a constraint overcome by AMI-Net3, which supports binary, nominal, ordinal, and continuous features. Despite advancements, challenges like missing values, data inconsistencies, and labeling errors persist in real-world data. The AMI-Net series also shows promise for regression and multi-task learning, potentially mitigating low-quality data issues. Tested on various hospital datasets, these methods prove effective, though risks of overfitting and bias remain, necessitating further research. Overall, while promising for clinical studies and other applications, ensuring data quality and reliability is crucial for these methods' success

    Counting what counts : time-driven activity-based costing in health care

    Get PDF
    Introduction: Patients with multiple chronic conditions consume over 40% of health care resources. The si- loed nature of the health care system exacerbates these costs, and integrated care solutions are required to adequately meet their needs. However, such integrated multidisciplinary ap- proaches are seen as costly. Therefore, costing care for patients with multiple chronic condi- tions becomes important to support health care professionals, management, and policy makers understand the true financial impact of integrated multidisciplinary care. Aim: The aim of this thesis is to explore how Time-Driven Activity-Based Costing (TDABC) can be applied to capture and compare the cost of integrated multidisciplinary versus traditional siloed care processes for patients with multiple chronic conditions. Method: This thesis is comprised of four studies. Study I was a systematic review performed according to the PRISMA statement and used qualitative methods to analyze data through content analy- sis. Studies II to IV were based on a randomized controlled trial CareHND (NCT03362983). Study II used descriptive statistics to describe patient diagnostic data, Charlson Comorbidity Index scores, and performed a comparison of care utilization patterns between integrated mul- tidisciplinary care and traditional care. Study III adopted a mixed-methods approach to perform a TDABC analysis of integrated multidisciplinary care. Study IV expanded on Study III to compare the costs of integrated multidisciplinary care to that of traditional siloed care. Findings: Study I found that TDABC is an efficient and accurate tool for costing processes in health care, but has not been demonstrated to effectively cost care across the care continuum. Study II found that patients with multiple chronic conditions experience care that is characterized by high vol- ume and high variation, and no difference in care utilization was detected when comparing integrated multidisciplinary care to traditional siloed care. The TDABC cost analysis in Study III successfully estimated the outpatient care costs for patients with multiple chronic condi- tions. Study IV found that the integrated multidisciplinary care center saved a hospital an av- erage of 5,098.00 € per patient per year. Discussion: This thesis demonstrates how TDABC can be applied to capture and compare costs of pro- cesses for patients with multiple chronic conditions. More broadly, this thesis demonstrates how to conceptualize and evaluate real-world care pathways for patients with multiple chronic conditions in order inform actionable changes to clinical management within hospitals. This thesis lays the groundwork for empowering hospitals and other providers to incorporate finan- cial analyses into their evidence development, quality improvement, and decision making, and to contribute to the wider financial and economic systems in health care. Conclusion: This thesis demonstrates that a hospital-based integrated multidisciplinary care approach to a complex medical condition makes economic sense for the hospital and the system. The TDABC approach developed in this thesis project brought to light a set of core capacities which can be prioritized in future quality improvement efforts. Through these core capacities, clinical organizations will hopefully become empowered to make wise, value-driven decisions that will serve as the new incentive for organizational improvement. Information that demonstrates value delivery will make financial needs clear to managers and policy makers, who in turn should understand that evidence-based investment in care facilities and services will ultimately demonstrate a return, benefiting not only IMD-Care patients, but also the larger populations they serve

    Development of a hospital readmission reduction program for patients discharged to skilled nursing facilities: An application of artificial intelligence and machine learning techniques

    Full text link
    Background Hospital readmissions within 30 days after discharge have drawn national policy attention as they are a reflection of suboptimal patient care. Readmissions are costly, accounting for more than $17 billion in potentially avoidable Medicare expenditures - nearly 78% of readmissions may be avoidable. Rich electronic data from medical records, growing computing capacities, and open source machine learning algorithms offer new opportunities to predict patients at high risk for readmission and prevent readmission through focused interventions. Prediction models might also serve to provide a more nuanced context of patient characteristics that lead to variations in readmission rates. Furthermore, transitional care between hospitals and skilled nursing facilities is a critical component of patient readmission prevention management. Successful transitional care must include the development of a comprehensive care plan and the availability of experienced health practitioners who are provided relevant medical information on patients’ readmission risk. Methods Predictive models were developed using statistical and machine learning algorithms to identify patients at risk for readmission as well as readmissions associated with pneumonia, sepsis and urinary tract infections after discharge to skilled nursing facilities. Over 3,000 features associated with patients discharged to skilled nursing facilities were extracted from NYU Langone Heath’s electronic health record system, and analyzed using logistic regression, gradient boosting trees, support vector machine, and neural network algorithms. A time split-sample approach was used to partition the data into training, validation, and test sets according to year: 2012-2017 data for training (n = 9,725), 2018 data for validation (n=3,878) and 2019 data for test data (n = 4,342). The most accurate model was selected based on discrimination and calibration performance. The selected model for overall readmission risk was compared to previously published index score models using discrimination and calibration performance. A variable importance algorithm was used to determine the important features of the selected models for overall readmission and readmissions associated with infections. Lastly, using the risk estimates from the models with the four readmission outcomes, a notification and reporting system for key stakeholders was created, including a standardized readmission ratio comparing the observed to the expected number of readmissions by discharging provider and skilled nursing facility. Results A gradient boosting model was selected as the best model to predict overall readmission risk using only real-time data. Discrimination performance was better or similar to previously published index score models that rely on coded data, and calibration was superior. Gradient boosting models were also used to classify readmission risk associated with sepsis, pneumonia, and urinary tract infections. Risk estimates from the models were successfully used to calculate a Readmission Risk Ratio metric. This metric was incorporated into an email to notify key stakeholders and develop risk-adjusted reports. Conclusions Hospitals can leverage the rich data found in electronic health records to generate readmission prediction models optimized for their patient population. This study builds several predictions models, develops an artificial intelligence notification tool, and explores potential interventions as part of a broader program. It does not however asses the effectiveness of the tool nor the interventions’ effect on readmission rates. Validated models can be deployed to target resources for patients at high risk for readmission with proven interventional programs and facilitate collaboration among transitional care teams

    Preface

    Get PDF
    • …
    corecore