690,578 research outputs found

    Empirical Bayes conditional density estimation

    Full text link
    The problem of nonparametric estimation of the conditional density of a response, given a vector of explanatory variables, is classical and of prominent importance in many prediction problems since the conditional density provides a more comprehensive description of the association between the response and the predictor than, for instance, does the regression function. The problem has applications across different fields like economy, actuarial sciences and medicine. We investigate empirical Bayes estimation of conditional densities establishing that an automatic data-driven selection of the prior hyper-parameters in infinite mixtures of Gaussian kernels, with predictor-dependent mixing weights, can lead to estimators whose performance is on par with that of frequentist estimators in being minimax-optimal (up to logarithmic factors) rate adaptive over classes of locally H\"older smooth conditional densities and in performing an adaptive dimension reduction if the response is independent of (some of) the explanatory variables which, containing no information about the response, are irrelevant to the purpose of estimating its conditional density

    A model-based approach to selection of tag SNPs

    Get PDF
    BACKGROUND: Single Nucleotide Polymorphisms (SNPs) are the most common type of polymorphisms found in the human genome. Effective genetic association studies require the identification of sets of tag SNPs that capture as much haplotype information as possible. Tag SNP selection is analogous to the problem of data compression in information theory. According to Shannon's framework, the optimal tag set maximizes the entropy of the tag SNPs subject to constraints on the number of SNPs. This approach requires an appropriate probabilistic model. Compared to simple measures of Linkage Disequilibrium (LD), a good model of haplotype sequences can more accurately account for LD structure. It also provides a machinery for the prediction of tagged SNPs and thereby to assess the performances of tag sets through their ability to predict larger SNP sets. RESULTS: Here, we compute the description code-lengths of SNP data for an array of models and we develop tag SNP selection methods based on these models and the strategy of entropy maximization. Using data sets from the HapMap and ENCODE projects, we show that the hidden Markov model introduced by Li and Stephens outperforms the other models in several aspects: description code-length of SNP data, information content of tag sets, and prediction of tagged SNPs. This is the first use of this model in the context of tag SNP selection. CONCLUSION: Our study provides strong evidence that the tag sets selected by our best method, based on Li and Stephens model, outperform those chosen by several existing methods. The results also suggest that information content evaluated with a good model is more sensitive for assessing the quality of a tagging set than the correct prediction rate of tagged SNPs. Besides, we show that haplotype phase uncertainty has an almost negligible impact on the ability of good tag sets to predict tagged SNPs. This justifies the selection of tag SNPs on the basis of haplotype informativeness, although genotyping studies do not directly assess haplotypes. A software that implements our approach is available

    Comparison of tagging single-nucleotide polymorphism methods in association analyses

    Get PDF
    Several methods to identify tagging single-nucleotide polymorphisms (SNPs) are in common use for genetic epidemiologic studies; however, there may be loss of information when using only a subset of SNPs. We sought to compare the ability of commonly used pairwise, multimarker, and haplotype-based tagging SNP selection methods to detect known associations with quantitative expression phenotypes. Using data from HapMap release 21 on unrelated Utah residents with ancestors from northern and western Europe (CEPH-Utah, CEU), we selected tagging SNPs in five chromosomal regions using ldSelect, Tagger, and TagSNPs. We found that SNP subsets did not substantially overlap, and that the use of trio data did not greatly impact SNP selection. We then tested associations between HapMap genotypes and expression phenotypes on 28 CEU individuals as part of Genetic Analysis Workshop 15. Relative to the use of all SNPs (n = 210 SNPs across all regions), most subset methods were able to detect single-SNP and haplotype associations. Generally, pairwise selection approaches worked extremely well, relative to use of all SNPs, with marked reductions in the number of SNPs required. Haplotype-based approaches, which had identified smaller SNP subsets, missed associations in some regions. We conclude that the optimal tagging SNP method depends on the true model of the genetic association (i.e., whether a SNP or haplotype is responsible); unfortunately, this is often unknown at the time of SNP selection. Additional evaluations using empirical and simulated data are needed

    Precision medicine in type 2 diabetes: Using individualised prediction models to optimise selection of treatment

    Get PDF
    This is the author accepted manuscript. The final version is available from the American Diabetes Association via the DOI in this recordDespite the known heterogeneity of type 2 diabetes and variable response to glucose lowering medications, current evidence on optimal treatment is predominantly based on average effects in clinical trials rather than individual-level characteristics. A precision medicine approach based on treatment response would aim to improve on this by identifying predictors of differential drug response for people based on their characteristics and then using this information to select optimal treatment. Recent research has demonstrated robust and clinically relevant differential drug response with all noninsulin treatments after metformin (sulfonylureas, thiazolidinediones, dipeptidyl peptidase 4 [DPP-4] inhibitors, glucagon-like peptide-1 [GLP-1] receptor agonists, and sodium–glucose cotransporter 2 [SGLT2] inhibitors) using routinely available clinical features. This Perspective reviews this current evidence and discusses how differences in drug response could inform selection of optimal type 2 diabetes treatment in the near future. It presents a novel framework for developing and testing precision medicine–based strategies to optimize treatment, harnessing existing routine clinical and trial data sources. This framework was recently applied to demonstrate that “subtype” approaches, in which people are classified into subgroups based on features reflecting underlying pathophysiology, are likely to have less clinical utility compared with approaches that combine the same features as continuous measures in probabilistic “individualized prediction” models.Research EnglandMedical Research Council (MRC

    The Systemic Imprint of Growth and Its Uses in Ecological (Meta)Genomics

    Get PDF
    Microbial minimal generation times range from a few minutes to several weeks. They are evolutionarily determined by variables such as environment stability, nutrient availability, and community diversity. Selection for fast growth adaptively imprints genomes, resulting in gene amplification, adapted chromosomal organization, and biased codon usage. We found that these growth-related traits in 214 species of bacteria and archaea are highly correlated, suggesting they all result from growth optimization. While modeling their association with maximal growth rates in view of synthetic biology applications, we observed that codon usage biases are better correlates of growth rates than any other trait, including rRNA copy number. Systematic deviations to our model reveal two distinct evolutionary processes. First, genome organization shows more evolutionary inertia than growth rates. This results in over-representation of growth-related traits in fast degrading genomes. Second, selection for these traits depends on optimal growth temperature: for similar generation times purifying selection is stronger in psychrophiles, intermediate in mesophiles, and lower in thermophiles. Using this information, we created a predictor of maximal growth rate adapted to small genome fragments. We applied it to three metagenomic environmental samples to show that a transiently rich environment, as the human gut, selects for fast-growers, that a toxic environment, as the acid mine biofilm, selects for low growth rates, whereas a diverse environment, like the soil, shows all ranges of growth rates. We also demonstrate that microbial colonizers of babies gut grow faster than stabilized human adults gut communities. In conclusion, we show that one can predict maximal growth rates from sequence data alone, and we propose that such information can be used to facilitate the manipulation of generation times. Our predictor allows inferring growth rates in the vast majority of uncultivable prokaryotes and paves the way to the understanding of community dynamics from metagenomic data

    Information and optimisation in investment and risk measurement

    Get PDF
    The thesis explores applications of optimisation in investment management and risk measurement. In investment management the information issues are largely concerned with generating optimal forecasts. It is difficult to get inputs that have the properties they are supposed to have. Thus optimisation is prone to 'Garbage In, Garbage Out', that leads to substantial biases in portfolio selection, unless forecasts are adjusted suitably for estimation error. We consider three case studies where we investigate the impact of forecast error on portfolio performance and examine ways of adjusting for resulting bias. Treynor and Black (1973) first tried to make the best possible use of the information provided by security analysis based on Markovitz (1952) portfolio selection. They established a relationship between the correlation of forecasts, the number of independent securities available and the Sharpe ratio which can be obtained. Their analysis was based on the assumption that the correlation between the forecasts and outcomes is known precisely. In practice, given the low levels of correlation possible, an investor may believe himself to have a different degree of correlation from what he actually has. Using two different metrics we explore how the portfolio performance depends on both the anticipated and realised correlation when these differ. One measure, the Sharpe ratio, captures the efficiency loss, attributed to the change in reward for risk. The other measure, the Generalised Sharpe Ratio (GSR), introduced by Hodges (1997), quantifies the reduction in the welfare of a particular investor due to adopting an inappropriate risk profile. We show that these two metrics, the Sharpe ratio and GSR, complement each other and in combination provide a fair ranking of existing investment opportunities. Using Bayesian adjustment is a popular way of dealing with estimation error in portfolio selection. In a Bayesian implementation, we study how to use non-sample information to infer optimal scaling of unknown forecasts of asset returns in the presence of uncertainty about the quality of our information, and how the efficient use of information affects portfolio decision. Optimal portfolios, derived under full use of information, differ strikingly from those derived from the sample information only; the latter, unlike the former, are highly affected by estimation error and favour several (up to ten) times larger holdings. The impact of estimation error in a dynamic setting is particularly severe because of the complexity of the setting in which it is necessary to have time varying forecasts. We take Brennan, Schwartz and Lagnado's structure (1997) as a specific illustration of a generic problem and investigate the bias in long-term portfolio selection models that comes from optimisation with (unadjusted) parameters estimated from historical data. Using a Monte Carlo simulation analysis, we quantify the degree of bias in the optimisation approach of Brennan, Schwartz and Lagnado. We find that estimated parameters make an investor believe in investment opportunities five times larger than they actually are. Also a mild real time-variation in opportunities inflates wildly when measured with estimated parameters. In the latter part of the thesis we look at slightly less straightforward optimisation applications in risk measurement, which arise in reporting risk. We ask, what is the most efficient way of complying with the rules? In other words, we investigate how to report the smallest exposure within a rule. For this purpose we develop two optimal efficient algorithms that calculate the minimal amount of the position risk required, to cover a firm's open positions and obligations, as required by respective rules in the FSA (Financial Securities Association) Handbook. Both algorithms lead to interesting generalisations

    A machine learning pipeline for quantitative phenotype prediction from genotype data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Quantitative phenotypes emerge everywhere in systems biology and biomedicine due to a direct interest for quantitative traits, or to high individual variability that makes hard or impossible to classify samples into distinct categories, often the case with complex common diseases. Machine learning approaches to genotype-phenotype mapping may significantly improve Genome-Wide Association Studies (GWAS) results by explicitly focusing on predictivity and optimal feature selection in a multivariate setting. It is however essential that stringent and well documented Data Analysis Protocols (DAP) are used to control sources of variability and ensure reproducibility of results. We present a genome-to-phenotype pipeline of machine learning modules for quantitative phenotype prediction. The pipeline can be applied for the direct use of whole-genome information in functional studies. As a realistic example, the problem of fitting complex phenotypic traits in heterogeneous stock mice from single nucleotide polymorphims (SNPs) is here considered.</p> <p>Methods</p> <p>The core element in the pipeline is the L1L2 regularization method based on the naïve elastic net. The method gives at the same time a regression model and a dimensionality reduction procedure suitable for correlated features. Model and SNP markers are selected through a DAP originally developed in the MAQC-II collaborative initiative of the U.S. FDA for the identification of clinical biomarkers from microarray data. The L1L2 approach is compared with standard Support Vector Regression (SVR) and with Recursive Jump Monte Carlo Markov Chain (MCMC). Algebraic indicators of stability of partial lists are used for model selection; the final panel of markers is obtained by a procedure at the chromosome scale, termed ’saturation’, to recover SNPs in Linkage Disequilibrium with those selected.</p> <p>Results</p> <p>With respect to both MCMC and SVR, comparable accuracies are obtained by the L1L2 pipeline. Good agreement is also found between SNPs selected by the L1L2 algorithms and candidate loci previously identified by a standard GWAS. The combination of L1L2-based feature selection with a saturation procedure tackles the issue of neglecting highly correlated features that affects many feature selection algorithms.</p> <p>Conclusions</p> <p>The L1L2 pipeline has proven effective in terms of marker selection and prediction accuracy. This study indicates that machine learning techniques may support quantitative phenotype prediction, provided that adequate DAPs are employed to control bias in model selection.</p

    Causes of variability in latent phenotypes of childhood wheeze

    Get PDF
    Background Latent class analysis (LCA) has been used extensively to identify (latent) phenotypes of childhood wheezing. However, the number and trajectory of discovered phenotypes differed substantially between studies. Objective We sought to investigate sources of variability affecting the classification of phenotypes, identify key time points for data collection to understand wheeze heterogeneity, and ascertain the association of childhood wheeze phenotypes with asthma and lung function in adulthood. Methods We used LCA to derive wheeze phenotypes among 3167 participants in the ALSPAC cohort who had complete information on current wheeze recorded at 14 time points from birth to age 16½ years. We examined the effects of sample size and data collection age and intervals on the results and identified time points. We examined the associations of derived phenotypes with asthma and lung function at age 23 to 24 years. Results A relatively large sample size (>2000) underestimated the number of phenotypes under some conditions (eg, number of time points <11). Increasing the number of data points resulted in an increase in the optimal number of phenotypes, but an identical number of randomly selected follow-up points led to different solutions. A variable selection algorithm identified 8 informative time points (months 18, 42, 57, 81, 91, 140, 157, and 166). The proportion of asthmatic patients at age 23 to 24 years differed between phenotypes, whereas lung function was lower among persistent wheezers. Conclusions Sample size, frequency, and timing of data collection have a major influence on the number and type of wheeze phenotypes identified by using LCA in longitudinal data
    corecore