7,552 research outputs found

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    A reduced-rank approach to predicting multiple binary responses through machine learning

    Full text link
    This paper investigates the problem of simultaneously predicting multiple binary responses by utilizing a shared set of covariates. Our approach incorporates machine learning techniques for binary classification, without making assumptions about the underlying observations. Instead, our focus lies on a group of predictors, aiming to identify the one that minimizes prediction error. Unlike previous studies that primarily address estimation error, we directly analyze the prediction error of our method using PAC-Bayesian bounds techniques. In this paper, we introduce a pseudo-Bayesian approach capable of handling incomplete response data. Our strategy is efficiently implemented using the Langevin Monte Carlo method. Through simulation studies and a practical application using real data, we demonstrate the effectiveness of our proposed method, producing comparable or sometimes superior results compared to the current state-of-the-art method

    A method for comparing multiple imputation techniques: A case study on the U.S. national COVID cohort collaborative.

    Get PDF
    Healthcare datasets obtained from Electronic Health Records have proven to be extremely useful for assessing associations between patients’ predictors and outcomes of interest. However, these datasets often suffer from missing values in a high proportion of cases, whose removal may introduce severe bias. Several multiple imputation algorithms have been proposed to attempt to recover the missing information under an assumed missingness mechanism. Each algorithm presents strengths and weaknesses, and there is currently no consensus on which multiple imputation algorithm works best in a given scenario. Furthermore, the selection of each algorithm’s pa- rameters and data-related modeling choices are also both crucial and challenging

    From bilinear regression to inductive matrix completion: a quasi-Bayesian analysis

    Full text link
    In this paper we study the problem of bilinear regression and we further address the case when the response matrix contains missing data that referred as the problem of inductive matrix completion. We propose a quasi-Bayesian approach first to the problem of bilinear regression where a quasi-likelihood is employed. Then, we adapt this approach to the context of inductive matrix completion. Under a low-rankness assumption and leveraging PAC-Bayes bound technique, we provide statistical properties for our proposed estimators and for the quasi-posteriors. We propose a Langevin Monte Carlo method to approximately compute the proposed estimators. Some numerical studies are conducted to demonstrated our methods.Comment: arXiv admin note: substantial text overlap with arXiv:2206.0861

    Efficient Baseline Utilization In Crossover Clinical Trials Through Linear Combinations Of Baselines: Parametric, Nonparametric, And Model Selection Approaches

    Get PDF
    In a crossover clinical trial, including period-specific baselines as covariates in a regression model is known to increase the precision of the estimated treatment effect. The potential efficiency gain depends, in part, on the true model, the distribution and covariance matrix of the vector of baselines and outcomes, and the model chosen for analysis. We examine improvements in power that can be achieved by incorporating optimal linear combination of baselines (LCB). For a known distribution, the optimal LCB minimizes the conditional variance corresponding to a treatment effect. The use of a single metric to capture the information in the baseline measurements is appealing for crossover designs. Because of their efficiency, crossover designs tend to have small sample sizes and thus the number of covariates in a model can significantly impact the degrees of freedom in the analysis. We start by examining optimal LCB models under a normality assumption for uniform and incomplete block designs. For uniform designs, such as the AB/BA design, estimation is entirely through within-subject contrasts (and thus ordinary least squares [OLS]) and the optimal LCB minimizes the conditional variance corresponding to the treatment effect. However, since the optimal LCB is a function of the unknown covariance matrix, we propose an adaptive method that uses the LCB covariate corresponding to the most plausible covariance structure guided by the data. For incomplete block designs, data are commonly analyzed using a mixed effects model. Treatment effect estimates from this analysis are complex functions of both within-subject and between-subject treatment contrasts. To improve efficiency, we propose incorporating period-specific optimal LCBs which minimize the conditional variance of the period-specific outcomes. A simpler fixed effects analysis of covariance involving only within-subject contrasts is also described for small sample situations. In the latter, hypothesis tests based on the mixed effects analyses exhibit inflated type I error rates even when using a Kenward and Rogers approach to adjust the degrees of freedom. Lastly, we extend this work to the more general setting where the optimal LCB depends on the distribution of the response vector. In practice, the distribution is unknown and the optimal LCB is estimated under some loss function. To handle both normal and non-normal response data, OLS and a rank-based nonparametric regression model (R-estimation), are considered. A data-driven approach is then proposed which adaptively chooses the best fitting model among a set of models which work well under a range of conditions. Relative to commonly used methods, such as change from baseline analyses without use of covariates, our methods using functions of baselines as period-specific or period-invariant covariates consistently demonstrate improved power across a number of crossover designs, covariance structures, and response distributions

    Participatory Mapping to Address Neighborhood Level Data Deficiencies for food Security Assessment in Southeastern Virginia, USA

    Get PDF
    Background: Food is not equitably available. Deficiencies and generalizations limit national datasets, food security assessments, and interventions. Additional neighborhood level studies are needed to develop a scalable and transferable process to complement national and internationally comparative data sets with timely, granular, nuanced data. Participatory geographic information systems (PGIS) offer a means to address these issues by digitizing local knowledge. Methods: The objectives of this study were two-fold: (i) identify granular locations missing from food source and risk datasets and (ii) examine the relation between the spatial, socio-economic, and agency contributors to food security. Twenty-nine subject matter experts from three cities in Southeastern Virginia with backgrounds in food distribution, nutrition management, human services, and associated research engaged in a participatory mapping process. Results: Results show that publicly available and other national datasets are not inclusive of non-traditional food sources or updated frequently enough to reflect changes associated with closures, expansion, or new programs. Almost 6 percent of food sources were missing from publicly available and national datasets. Food pantries, community gardens and fridges, farmers markets, child and adult care programs, and meals served in community centers and homeless shelters were not well represented. Over 24 km2 of participant identified need was outside United States Department of Agriculture low income, low access areas. Economic, physical, and social barriers to food security were interconnected with transportation limitations. Recommendations address an international call from development agencies, countries, and world regions for intervention methods that include systemic and generational issues with poverty, incorporate non-traditional spaces into food distribution systems, incentivize or regulate healthy food options in stores, improve educational opportunities, increase data sharing. Conclusions: Leveraging city and regional agency as appropriate to capitalize upon synergistic activities was seen as critical to achieve these goals, particularly for non-traditional partnership building. To address neighborhood scale food security needs in Southeastern Virginia, data collection and assessment should address both environment and utilization issues from consumer and producer perspectives including availability, proximity, accessibility, awareness, affordability, cooking capacity, and preference. The PGIS process utilized to facilitate information sharing about neighborhood level contributors to food insecurity and translate those contributors to intervention strategies through discussion with local subject matter experts and contextualization within larger scale food systems dynamics is transferable

    Persuasion, Political Warfare, and Deterrence: Behavioral and Behaviorally Robust Models

    Get PDF
    This dissertation examines game theory models in the context of persuasion and competition wherein decision-makers are not completely rational by considering two complementary threads of research. The first thread of research pertains to offensive and preemptively defensive behavioral models. Research in this thread makes three notable contributions. First, an offensive modeling framework is created to identify how an entity optimally influences a populace to take a desired course of action. Second, a defensive modeling framework is defined wherein a regulating entity takes action to bound the behavior of multiple adversaries simultaneously attempting to persuade a group of decision-makers. Third, an offensive influence modeling framework under conditions of ambiguity is developed in accordance with historical information limitations, and we demonstrate how it can be used to select a robust course of action on a specific, data-driven use case. The second thread of research pertains to behavioral and behaviorally robust approaches to deterrence. Research in this thread makes two notable contributions. First, we demonstrate the alternative insights behavioral game theory generates for the analysis of classic deterrence games, and explicate the rich analysis generated from its combined use with standard equilibrium models. Second, we define behaviorally robust models for an agent to use in a normal form game under varying forms of uncertainty in order to inform deterrence policy decisions
    • …
    corecore