1,600 research outputs found

    Calibrated imputation for multivariate categorical data

    Get PDF
    Non-response is a major problem for anyone collecting and processing data. A commonly used technique to deal with missing data is imputation, where missing values are estimated and filled in into the dataset. Imputation can become challenging if the variable to be imputed has to comply with a known total. Even more challenging is the case where several variables in the same dataset need to be imputed and, in addition to known totals, logical restrictions between variables have to be satisfied. In our paper, we develop an approach for a broad class of imputation methods for multivariate categorical data such that previously published totals are preserved while logical restrictions on the data are satisfied. The developed approach can be used in combination with any imputation model that estimates imputation probabilities, i.e. the probability that imputation of a certain category for a variable in a certain unit leads to the correct value for this variable and unit

    Efficient Benchmarking of Algorithm Configuration Procedures via Model-Based Surrogates

    Get PDF
    The optimization of algorithm (hyper-)parameters is crucial for achieving peak performance across a wide range of domains, ranging from deep neural networks to solvers for hard combinatorial problems. The resulting algorithm configuration (AC) problem has attracted much attention from the machine learning community. However, the proper evaluation of new AC procedures is hindered by two key hurdles. First, AC benchmarks are hard to set up. Second and even more significantly, they are computationally expensive: a single run of an AC procedure involves many costly runs of the target algorithm whose performance is to be optimized in a given AC benchmark scenario. One common workaround is to optimize cheap-to-evaluate artificial benchmark functions (e.g., Branin) instead of actual algorithms; however, these have different properties than realistic AC problems. Here, we propose an alternative benchmarking approach that is similarly cheap to evaluate but much closer to the original AC problem: replacing expensive benchmarks by surrogate benchmarks constructed from AC benchmarks. These surrogate benchmarks approximate the response surface corresponding to true target algorithm performance using a regression model, and the original and surrogate benchmark share the same (hyper-)parameter space. In our experiments, we construct and evaluate surrogate benchmarks for hyperparameter optimization as well as for AC problems that involve performance optimization of solvers for hard combinatorial problems, drawing training data from the runs of existing AC procedures. We show that our surrogate benchmarks capture overall important characteristics of the AC scenarios, such as high- and low-performing regions, from which they were derived, while being much easier to use and orders of magnitude cheaper to evaluate

    Mathematical modelling and survival prediction in cancer

    Get PDF
    Cancer is one of the leading causes of death, thus opening a vast need for extensive research and insights. The survival prospects, along with treatment benefits and costs (economical or health-related), can be predicted with tools from mathematical modelling and regression analysis. Promising results have been gained on suggesting mono- and combination therapies, that could potentially improve the treatment strategies. Furthermore, multiple biological features have been recognized as important predictors of treatment outcome. However, since cancer remains a challenging and unpredictable enemy, the need for more effective and personalized predictions and treatment suggestions remains. In this dissertation, various modelling approaches were used to predict cancer behavior, treatment outcomes and patient survival. An ordinary differential equation model was developed to investigate the changes in the cancer cell density as different treatment regimen were applied. In addition, we included the immune system along with immunotherapy, since the immune response is an important part of cancer development and has a potential to eradicate tumors. It was noted, that an adaptive treatment resulted in lower cancer burden and less time in treatment. In addition, combination treatments (immunotherapy with either chemo- or targeted therapy) generally resulted in smaller cancer burden than monotherapies, however, the potential additional side effects of two therapies have to be considered. A metapopulation model was developed for the cancer development, in which the focus was on emergence of angiogenesis and cancer cell emigration. We investigated, in which conditions cancer cells would become angiogenic with or without treatments (anti-angiogenic, cytotoxic or combination). In general, angiogenesis contribution was desired quality for cancer cells, if no anti-angiogenic treatment was administrated. With anti-angiogenic treatment, angiogenesis diminished, however the risk of resistance against anti-angiogenic treatment also increased. Two new regression methods were developed with focus on survival prediction. A greedy budget-constrained Cox regression (Greedy Cox) utilizes L2-penalty and considers the cost of selected parameters. It was also compared to LASSO selection (L1). Optimal Subset CArdinality Regression (OSCAR) method was developed with L0-pseudonorm penalty to provide sparse models. The costs of measuring the selected model features were also considered in comparison to prediction accuracy. The methods were validated on clinical prostate cancer data and it was noted that a comparable level of prediction accuracy was already reached with a few parameters, resulting in relatively low costs. All of the investigated methods also selected reasonable, cancer-related parameters such as prostate specific antigen (PSA). Taken together, this dissertation provides a comprehensive research of novel tools for modelling and predicting cancer behavior and patient survival. Important hallmarks of cancer development, such as immune response and angiogenic switch have been included along with corresponding treatments that have potential to change the traditional treatment regimens

    Financial risk management in shipping investment, a machine learning approach

    Get PDF
    There has been a plethora of research into company credit risk and financial default prediction from both academics and financial professionals alike. However, only a limited volume of the literature has focused on international shipping company financial distress prediction, with previous research concentrating largely on classic linear based modelling techniques. The gaps, identified in this research, demonstrate the need for increased effort to address the inherent nonlinear nature of shipping operations, as well as the noisy and incomplete composition of shipping company financial statement data. Furthermore, the gaps illustrate the need for a workable definition of financial distress, which to date has too often been classed only by the ultimate state of bankruptcy/insolvency. This definition prohibits the practical application of methodologies which should be aimed at the timely identification of financial distress, thereby allowing for remedial measures to be implemented to avoid ultimate financial collapse. This research contributes to the field by addressing these gaps through i) the creation of a machine learning based financial distress forecasting methodology and ii) utilising this as the foundation for the development of a software toolkit for financial distress prediction. This toolkit enables the practical application of the financial risk principles, embedded within the methodology, to be readily integrated into an enterprise/corporate risk management system. The methodology and software were tested through the application of a bulk shipping company case study utilising 5000 bulk shipping company-year accounting observations for the period 2000-2018, in combination with market and macroeconomic data. The results demonstrate that the methodology improves the capture of distress correlations, that traditional financial distress models have struggled to achieve. The methodology's capacity to adequately treat the problem of missing data in company financial statements was also validated. Finally, the results also highlight the successful application of the software toolkit for the development of a multi-model, real time system which can enhance the financial monitoring of shipping companies by acting as a practical "early warning system" for financial distress.There has been a plethora of research into company credit risk and financial default prediction from both academics and financial professionals alike. However, only a limited volume of the literature has focused on international shipping company financial distress prediction, with previous research concentrating largely on classic linear based modelling techniques. The gaps, identified in this research, demonstrate the need for increased effort to address the inherent nonlinear nature of shipping operations, as well as the noisy and incomplete composition of shipping company financial statement data. Furthermore, the gaps illustrate the need for a workable definition of financial distress, which to date has too often been classed only by the ultimate state of bankruptcy/insolvency. This definition prohibits the practical application of methodologies which should be aimed at the timely identification of financial distress, thereby allowing for remedial measures to be implemented to avoid ultimate financial collapse. This research contributes to the field by addressing these gaps through i) the creation of a machine learning based financial distress forecasting methodology and ii) utilising this as the foundation for the development of a software toolkit for financial distress prediction. This toolkit enables the practical application of the financial risk principles, embedded within the methodology, to be readily integrated into an enterprise/corporate risk management system. The methodology and software were tested through the application of a bulk shipping company case study utilising 5000 bulk shipping company-year accounting observations for the period 2000-2018, in combination with market and macroeconomic data. The results demonstrate that the methodology improves the capture of distress correlations, that traditional financial distress models have struggled to achieve. The methodology's capacity to adequately treat the problem of missing data in company financial statements was also validated. Finally, the results also highlight the successful application of the software toolkit for the development of a multi-model, real time system which can enhance the financial monitoring of shipping companies by acting as a practical "early warning system" for financial distress

    Uncovering Hidden Diversity in Plants

    Get PDF
    One of the greatest challenges to human civilization in the 21st century will be to provide global food security to a growing population while reducing the environmental footprint of agriculture. Despite increasing demand, the fundamental issue of limited genetic diversity in domesticated crops provides windows of opportunity for emerging pandemics and the insufficient ability of modern crops to respond to a changing global environment. The wild relatives of crop plants, with large reservoirs of untapped genetic diversity, offer great potential to improve the resilience of elite cultivars. Utilizing this diversity requires advanced technologies to comprehensively identify genetic diversity and understand the genetic architecture of beneficial traits. The primary focus of the dissertation is developing computational tools to facilitate variant discovery and trait mapping for plant genomics. In Chapter 1, I benchmarked the performance of variant discovery algorithms based on simulated and diverse plant datasets. The comparison of sequence aligners found that BWA-MEM consistently aligned the most plant reads with high accuracy, whereas Bowtie2 had a slightly higher overall accuracy. Variant callers, such as GATK HaplotypCaller and SAMtools mpileup, were shown to significantly differ in their ability to minimize the frequency of false negatives and maximize the discovery of true positives. A cross-reference experiment of Solanum lycopersicum and Solanum pennellii reference genomes revealed significant limitations of using a single reference genome for variant discovery. Next, I demonstrated that a machine-learning-based variant filtering strategy outperformed the traditional hard-cutoff filtering strategy, resulting in a significantly higher number of true positive and fewer false-positive variants. Finally, I developed a 2-step imputation method resulted in up to 60% higher accuracy than direct LD-based imputation methods. In Chapter 2, I focused on developing a trait mapping algorithm tailored for plants considering the high levels of diversity found in plant datasets. This novel trait mapping framework, HapFM, had the ability to incorporate biological priors into the mapping model to identify casual haplotypes for traits of interest. Compared to conventional GWAS analyses, the haplotype-based approach significantly reduced the number of variables while aggregating small effect SNPs to increase mapping power. HapFM could account for LD between haplotype segments to infer the causal haplotypes directly. Furthermore, HapFM could systemically incorporate biological priors into the probability function during the mapping process resulting in greater mapping resolution. Overall, HapFM achieves a balance between powerfulness, interpretability, and verifiability. In Chapter 3, I developed a computational algorithm to select a pan-genome cohort to maximize the haplotype representativeness of the cohort. Increasing evidence suggest that a single reference genome is often inadequate for plant diversity studies due to extensive sequence and structural rearrangements found in many plant genomes. HapPS was developed to utilize local haplotype information to select the reference cohort. There are three steps in HapPS, including genome-wide block partition, representative haplotype identification, and genetic algorithm for reference cohort selection. The comparison of HapPS with global-distance-based selection showed that HapPS resulted in significantly higher block coverage in the highly diverse genic regions. The GO-term enrichment analysis of the highly diverse genic region identified by HapPS showed enrichment for genes involved in defense pathways and abiotic stress, which might identify genomic regions involved in local adaptation. In summary, HapPS provides a systemic and objective solution to pan-genome cohort selection

    Genomics and metabonomics in severe alcoholic hepatitis

    Get PDF
    Severe alcoholic hepatitis is a florid presentation of alcohol-related liver disease and is associated with very high short-term mortality, in excess of 20% within 28 days. Severe alcoholic hepatitis occurs in a minority of patients who develop alcohol-related liver disease. A combination of genetic and environmental factors is likely to predispose to severe alcoholic hepatitis. To date the clinical phenotype has not been extensively examined in candidate gene studies and has been the subject of a single, small genome-wide association study. A genome-wide association study of severe alcoholic hepatitis identified two loci potentially associated with the risk of developing severe alcoholic hepatitis: i) A strong association with PNPLA3, a well-recognised risk locus for alcohol-related liver disease, and ii) a novel but weaker association with SLC38A4, an amino acid transporter. The primary genetic variant at each locus was evaluated to determine whether there was an influence on disease phenotype or outcome. The primary variant in PNPLA3, rs738409, is a missense variant. Analyses indicated a deleterious effect of homozygosity on medium-term survival in addition to more severe disease on baseline histology and a slower recovery in liver function over the short-term period; consistent with established literature in alcohol-related cirrhosis. In contrast the primary variant in SLC38A4, rs11183620, is intronic with no clear evidence for an effect on gene expression or function. Analyses did not indicate an influence on histology, clinical phenotypes or outcomes. In light of the locus’ novelty further work was undertaken to determine any potential contribution to disease pathogenesis. SLC38A4 was down-regulated in whole liver tissue in severe alcoholic hepatitis. Experiments with cell lines in culture suggested the pro-inflammatory cytokine IL-1 as a potential driver. SLC38A4 knockdown resulted in upregulation of some cellular responses associated with nutrient deprivation. There was no influence of the variant on serum amino acid profiles. The functional significance of SLC38A4 down-regulation remains the subject of ongoing work.Open Acces

    The prediction of HLA genotypes from next generation sequencing and genome scan data

    Full text link
    Genome-wide association studies have very successfully found highly significant disease associations with single nucleotide polymorphisms (SNP) in the Major Histocompatibility Complex for adverse drug reactions, autoimmune diseases and infectious diseases. However, the extensive linkage disequilibrium in the region has made it difficult to unravel the HLA alleles underlying these diseases. Here I present two methods to comprehensively predict 4-digit HLA types from the two types of experimental genome data widely available. The Virtual SNP Imputation approach was developed for genome scan data and demonstrated a high precision and recall (96% and 97% respectively) for the prediction of HLA genotypes. A reanalysis of 6 genome-wide association studies using the HLA imputation method identified 18 significant HLA allele associations for 6 autoimmune diseases: 2 in ankylosing spondylitis, 2 in autoimmune thyroid disease, 2 in Crohn's disease, 3 in multiple sclerosis, 2 in psoriasis and 7 in rheumatoid arthritis. The EPIGEN consortium also used the Virtual SNP Imputation approach to detect a novel association of HLA-A*31:01 with adverse reactions to carbamazepine. For the prediction of HLA genotypes from next generation sequencing data, I developed a novel approach using a naïve Bayes algorithm called HLA-Genotyper. The validation results covered whole genome, whole exome and RNA-Seq experimental designs in the European and Yoruba population samples available from the 1000 Genomes Project. The RNA-Seq data gave the best results with an overall precision and recall near 0.99 for Europeans and 0.98 for the Yoruba population. I then successfully used the method on targeted sequencing data to detect significant associations of idiopathic membranous nephropathy with HLA-DRB1*03:01 and HLA-DQA1*05:01 using the 1000 Genomes European subjects as controls. Using the results reported here, researchers may now readily unravel the association of HLA alleles with many diseases from genome scans and next generation sequencing experiments without the expensive and laborious HLA typing of thousands of subjects. Both algorithms enable the analysis of diverse populations to help researchers pinpoint HLA loci with biological roles in infection, inflammation, autoimmunity, aging, mental illness and adverse drug reactions
    • …
    corecore