5,432 research outputs found

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    A Methodology to Develop a Decision Model Using a Large Categorical Database with Application to Identifying Critical Variables during a Transport-Related Hazardous Materials Release

    Get PDF
    An important problem in the use of large categorical databases is extracting information to make decisions, including identification of critical variables. Due to the complexity of a dataset containing many records, variables, and categories, a methodology for simplification and measurement of associations is needed to build the decision model. To this end, the proposed methodology uses existing methods for categorical exploratory analysis. Specifically, latent class analysis and loglinear modeling, which together constitute a three-step, non-simultaneous approach, were used to simplify the variables and measure their associations, respectively. This methodology has not been used to extract data-driven decision models from large categorical databases. A case in point is a large categorical database at the DoT for hazardous materials releases during transportation. This dataset is important due to the risk from an unintentional release. However, due to the lack of a data-congruent decision model of a hazmat release, current decision making, including critical variable identification, is limited at the Office of Hazardous Materials within the DoT. This gap in modeling of a release is paralleled by a similar gap in the hazmat transportation literature. The literature has an operations research and quantitative risk assessment focus, in which the models consist of simple risk equations or more complex, theoretical equations. Thus, based on critical opportunities at the DoT and gaps in the literature, the proposed methodology was demonstrated using the hazmat release database. The methodology can be applied to other categorical databases for extracting decision models, such as those at the National Center for Health Statistics. A key goal of the decision model, a Bayesian network, was identification of the most influential variables relative to two consequences or measures of risk in a hazmat release, dollar loss and release quantity. The most influential variables for dollar loss were found to be variables related to container failure, specifically the causing object and item-area of failure on the container. Similarly, for release quantity, the container failure variables were also most influential, specifically the contributing action and failure mode. In addition, potential changes in these variables for reducing consequences were identified

    Do opening hours and unobserved heterogeneity affect economies of scale and scope in postal outlets?

    Get PDF
    The purpose of this study is to analyze the cost structure of Swiss Post’s postal outlets. In particular, the idea is to assess economies of scale and scope in post offices and franchised postal agencies. Information on their optimal size and production structure is of importance from the policy-makers’point of view because this hypothetical situation may be a basis for calculation of reimbursements when providing the universal service. Two important novelties are introduced in this study. First, the latent class model accounts for postal outlets with different underlying production technologies, caused by unobserved factors. Second, the cost model includes standby time as an indicator of public service because regulated accessibility and negotiated opening hours that enhance public service frequently lead to opening hours that exceed the time necessary to operate the demand. Overall, this analysis confirms the existence of increasing unexploited economies of scale and scope with falling outputs in the Swiss Post office network. Furthermore, the results for the latent class model point to the existence of unobserved heterogeneity in the industry.economies of scale, economies of scope, postal outlet network, unobserved heterogeneity, latent class model, opening hours, standby time

    Measuring the quality of life and the construction of social indicators

    Get PDF

    Statistical Tools for Network Data: Prediction and Resampling

    Full text link
    Advances in data collection and social media have led to more and more network data appearing in diverse areas, such as social sciences, internet, transportation and biology. This thesis develops new principled statistical tools for network analysis, with emphasis on both appealing statistical properties and computational efficiency. Our first project focuses on building prediction models for network-linked data. Prediction algorithms typically assume the training data are independent samples, but in many modern applications samples come from individuals connected by a network. For example, in adolescent health studies of risk-taking behaviors, information on the subjects' social network is often available and plays an important role through network cohesion, the empirically observed phenomenon of friends behaving similarly. Taking cohesion into account in prediction models should allow us to improve their performance. We propose a network-based penalty on individual node effects to encourage similarity between predictions for linked nodes, and show that incorporating it into prediction leads to improvement over traditional models both theoretically and empirically when network cohesion is present. The penalty can be used with many loss-based prediction methods, such as regression, generalized linear models, and Cox's proportional hazard model. Applications to predicting levels of recreational activity and marijuana usage among teenagers from the AddHealth study based on both demographic covariates and friendship networks are discussed in detail. Our approach to taking friendships into account can significantly improve predictions of behavior while providing interpretable estimates of covariate effects. Resampling, data splitting, and cross-validation are powerful general strategies in statistical inference, but resampling from a network remains a challenging problem. Many statistical models and methods for networks need model selection and tuning parameters, which could be done by cross-validation if we had a good method for splitting network data; however, splitting network nodes into groups requires deleting edges and destroys some of the structure. Here we propose a new network cross-validation strategy based on splitting edges rather than nodes, which avoids losing information and is applicable to a wide range of network models. We provide a theoretical justification for our method in a general setting and demonstrate how our method can be used in a number of specific model selection and parameter tuning tasks, with extensive numerical results on simulated networks. We also apply the method to analysis of a citation network of statisticians and obtain meaningful research communities. Finally, we consider the problem of community detection on partially observed networks. However, in practice, network data are often collected through sampling mechanisms, such as survey questionnaires, instead of direct observation. The noise and bias introduced by such sampling mechanisms can obscure the community structure and invalidate the assumptions of standard community detection methods. We propose a model to incorporate neighborhood sampling, through a model reflective of survey designs, into community detection for directed networks, since friendship networks obtained from surveys are naturally directed. We model the edge sampling probabilities as a function of both individual preferences and community parameters, and fit the model by a combination of spectral clustering and the method of moments. The algorithm is computationally efficient and comes with a theoretical guarantee of consistency. We evaluate the proposed model in extensive simulation studies and applied it to a faculty hiring dataset, discovering a meaningful hierarchy of communities among US business schools.PHDStatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/145894/1/tianxili_1.pd

    Application of modern statistical methods in worldwide health insurance

    Get PDF
    With the increasing availability of internal and external data in the (health) insurance industry, the demand for new data insights from analytical methods is growing. This dissertation presents four examples of the application of advanced regression-based prediction techniques for claims and network management in health insurance: patient segmentation for and economic evaluation of disease management programs, fraud and abuse detection and medical quality assessment. Based on different health insurance datasets, it is shown that tailored models and newly developed algorithms, like Bayesian latent variable models, can optimize the business steering of health insurance companies. By incorporating and structuring medical and insurance knowledge these tailored regression approaches can at least compete with machine learning and artificial intelligence methods while being more transparent and interpretable for the business users. In all four examples, methodology and outcomes of the applied approaches are discussed extensively from an academic perspective. Various comparisons to analytical and market best practice methods allow to also judge the added value of the applied approaches from an economic perspective.Mit der wachsenden Verfügbarkeit von internen und externen Daten in der (Kranken-) Versicherungsindustrie steigt die Nachfrage nach neuen Erkenntnissen gewonnen aus analytischen Verfahren. In dieser Dissertation werden vier Anwendungsbeispiele für komplexe regressionsbasierte Vorhersagetechniken im Schaden- und Netzwerkmanagement von Krankenversicherungen präsentiert: Patientensegmentierung für und ökonomische Auswertung von Gesundheitsprogrammen, Betrugs- und Missbrauchserkennung und Messung medizinischer Behandlungsqualität. Basierend auf verschiedenen Krankenversicherungsdatensätzen wird gezeigt, dass maßgeschneiderte Modelle und neu entwickelte Algorithmen, wie bayesianische latente Variablenmodelle, die Geschäftsteuerung von Krankenversicherern optimieren können. Durch das Einbringen und Strukturieren von medizinischem und versicherungstechnischem Wissen können diese maßgeschneiderten Regressionsansätze mit Methoden aus dem maschinellen Lernen und der künstlichen Intelligenz zumindest mithalten. Gleichzeitig bieten diese Ansätze dem Businessanwender ein höheres Maß an Transparenz und Interpretierbarkeit. In allen vier Beispielen werden Methodik und Ergebnisse der angewandten Verfahren ausführlich aus einer akademischen Perspektive diskutiert. Verschiedene Vergleiche mit analytischen und marktüblichen Best-Practice-Methoden erlauben es, den Mehrwert der angewendeten Ansätze auch aus einer ökonomischen Perspektive zu bewerten
    corecore