15 research outputs found
Semimartingale decomposition of convex functions of continuous semimartingales by Brownian perturbation
In this note we prove that the local martingale part of a convex function f
of a d-dimensional semimartingale X = M + A can be written in terms of an It^o
stochastic integral \int H(X)dM, where H(x) is some particular measurable
choice of subgradient of f at x, and M is the martingale part of X. This result
was first proved by Bouleau in [2]. Here we present a new treatment of the
problem. We first prove the result for X' = X + eB, e > 0, where B is a
standard Brownian motion, and then pass to the limit as e tends to 0, using
results in [1] and [4].Comment: 16 pages. Re-submitted to ESAIMPS December, 201
Recommended from our members
Multi-tissue transcriptome-wide association studies.
A transcriptome-wide association study (TWAS) attempts to identify disease associated genes by imputing gene expression into a genome-wide association study (GWAS) using an expression quantitative trait loci (eQTL) data set and then testing for associations with a trait of interest. Regulatory processes may be shared across related tissues and one natural extension of TWAS is harnessing cross-tissue correlation in gene expression to improve prediction accuracy. Here, we studied multi-tissue extensions of lasso regression and random forests (RF), joint lasso and RF-MTL (multi-task learning RF), respectively. We found that, on our chosen eQTL data set, multi-tissue methods were generally more accurate than their single-tissue counterparts, with RF-MTL performing the best. Simulations showed that these benefits generally translated into more associated genes identified, although highlighted that joint lasso had a tendency to erroneously identify genes in one tissue if there existed an eQTL signal for that gene in another. Applying the four methods to a type 1 diabetes GWAS, we found that multi-tissue methods found more unique associated genes for most of the tissues considered. We conclude that multi-tissue methods are competitive and, for some cell types, superior to single-tissue approaches and hold much promise for TWAS studies
An evaluation of machine-learning for predicting phenotype: studies in yeast, rice, and wheat
Abstract: In phenotype prediction the physical characteristics of an organism are predicted from knowledge of its genotype and environment. Such studies, often called genome-wide association studies, are of the highest societal importance, as they are of central importance to medicine, crop-breeding, etc. We investigated three phenotype prediction problems: one simple and clean (yeast), and the other two complex and real-world (rice and wheat). We compared standard machine learning methods; elastic net, ridge regression, lasso regression, random forest, gradient boosting machines (GBM), and support vector machines (SVM), with two state-of-the-art classical statistical genetics methods; genomic BLUP and a two-step sequential method based on linear regression. Additionally, using the clean yeast data, we investigated how performance varied with the complexity of the biological mechanism, the amount of observational noise, the number of examples, the amount of missing data, and the use of different data representations. We found that for almost all the phenotypes considered, standard machine learning methods outperformed the methods from classical statistical genetics. On the yeast problem, the most successful method was GBM, followed by lasso regression, and the two statistical genetics methods; with greater mechanistic complexity GBM was best, while in simpler cases lasso was superior. In the wheat and rice studies the best two methods were SVM and BLUP. The most robust method in the presence of noise, missing data, etc. was random forests. The classical statistical genetics method of genomic BLUP was found to perform well on problems where there was population structure. This suggests that standard machine learning methods need to be refined to include population structure information when this is present. We conclude that the application of machine learning methods to phenotype prediction problems holds great promise, but that determining which methods is likely to perform well on any given problem is elusive and non-trivial
Recommended from our members
Functional effects of variation in transcription factor binding highlight long-range gene regulation by epromoters.
Identifying DNA cis-regulatory modules (CRMs) that control the expression of specific genes is crucial for deciphering the logic of transcriptional control. Natural genetic variation can point to the possible gene regulatory function of specific sequences through their allelic associations with gene expression. However, comprehensive identification of causal regulatory sequences in brute-force association testing without incorporating prior knowledge is challenging due to limited statistical power and effects of linkage disequilibrium. Sequence variants affecting transcription factor (TF) binding at CRMs have a strong potential to influence gene regulatory function, which provides a motivation for prioritizing such variants in association testing. Here, we generate an atlas of CRMs showing predicted allelic variation in TF binding affinity in human lymphoblastoid cell lines and test their association with the expression of their putative target genes inferred from Promoter Capture Hi-C and immediate linear proximity. We reveal >1300 CRM TF-binding variants associated with target gene expression, the majority of them undetected with standard association testing. A large proportion of CRMs showing associations with the expression of genes they contact in 3D localize to the promoter regions of other genes, supporting the notion of 'epromoters': dual-action CRMs with promoter and distal enhancer activity
Stochastic search and joint fine-mapping increases accuracy and identifies previously unreported associations in immune-mediated diseases
Abstract: Thousands of genetic variants are associated with human disease risk, but linkage disequilibrium (LD) hinders fine-mapping the causal variants. Both lack of power, and joint tagging of two or more distinct causal variants by a single non-causal SNP, lead to inaccuracies in fine-mapping, with stochastic search more robust than stepwise. We develop a computationally efficient multinomial fine-mapping (MFM) approach that borrows information between diseases in a Bayesian framework. We show that MFM has greater accuracy than single disease analysis when shared causal variants exist, and negligible loss of precision otherwise. MFM analysis of six immune-mediated diseases reveals causal variants undetected in individual disease analysis, including in IL2RA where we confirm functional effects of multiple causal variants using allele-specific expression in sorted CD4+ T cells from genotype-selected individuals. MFM has the potential to increase fine-mapping resolution in related diseases enabling the identification of associated cellular and molecular phenotypes
Genetic dissection of the tissue‐specific roles of type III effectors and phytotoxins in the pathogenicity of Pseudomonas syringae pv. syringae to cherry
When compared with other phylogroups (PGs) of the Pseudomonas syringae species complex, P. syringae pv. syringae (Pss) strains within PG2 have a reduced repertoire of type III effectors (T3Es) but produce several phytotoxins. Effectors within the cherry pathogen Pss 9644 were grouped based on their frequency in strains from Prunus as the conserved effector locus (CEL) common to most P. syringae pathogens; a core of effectors common to PG2; a set of PRUNUS effectors common to cherry pathogens; and a FLEXIBLE set of T3Es. Pss 9644 also contains gene clusters for biosynthesis of toxins syringomycin, syringopeptin and syringolin A. After confirmation of virulence gene expression, mutants with a sequential series of T3E and toxin deletions were pathogenicity tested on wood, leaves and fruits of sweet cherry (Prunus avium) and leaves of ornamental cherry (Prunus incisa). The toxins had a key role in disease development in fruits but were less important in leaves and wood. An effectorless mutant retained some pathogenicity to fruit but not wood or leaves. Striking redundancy was observed amongst effector groups. The CEL effectors have important roles during the early stages of leaf infection and possibly acted synergistically with toxins in all tissues. Deletion of separate groups of T3Es had more effect in P. incisa than in P. avium. Mixed inocula were used to complement the toxin mutations in trans and indicated that strain mixtures may be important in the field. Our results highlight the niche‐specific role of toxins in P. avium tissues and the complexity of effector redundancy in the pathogen Pss 9644
Implementation of genomic prediction in Lolium perenne (L.) breeding populations
Perennial ryegrass (Lolium perenne L.) is one of the most widely grown forage grasses in temperate agriculture. In order to maintain and increase its usage as forage in livestock agriculture, there is a continued need for improvement in biomass yield, quality, disease resistance and seed yield. Genetic gain for traits such as biomass yield has been relatively modest. This has been attributed to its long breeding cycle, and the necessity to use population based breeding methods. Thanks to recent advances in genotyping techniques there is increasing interest in genomic selection from which genomically estimated breeding values (GEBV) are derived. In this paper we compare the classical RRBLUP model with state-of-the-art machine learning (ML) techniques that should yield themselves easily to use in GS and demonstrate their application to predicting quantitative traits in a breeding population of L. perenne. Prediction accuracies varied from 0 to 0.59 depending on trait, prediction model and composition of the training population. The BLUP model produced the highest prediction accuracies for most traits and training populations. Forage quality traits had the highest accuracies compared to yield related traits. There appeared to be no clear pattern to the effect of the training population composition on the prediction accuracies. The heritability of the forage quality traits was generally higher than for the yield related traits, and could partly explain the difference in accuracy. Some population structure was evident in the breeding populations, and probably contributed to the varying effects of training population on the predictions. The average linkage disequilibrium (LD) between adjacent markers ranged from 0.121 to 0.215. Higher marker density and larger training population closely related with the test population are likely to improve the prediction accuracy
Probabilistic classification of anti-SARS-CoV-2 antibody responses improves seroprevalence estimates.
OBJECTIVES: Population-level measures of seropositivity are critical for understanding the epidemiology of an emerging pathogen, yet most antibody tests apply a strict cutoff for seropositivity that is not learnt in a data-driven manner, leading to uncertainty when classifying low-titer responses. To improve upon this, we evaluated cutoff-independent methods for their ability to assign likelihood of SARS-CoV-2 seropositivity to individual samples. METHODS: Using robust ELISAs based on SARS-CoV-2 spike (S) and the receptor-binding domain (RBD), we profiled antibody responses in a group of SARS-CoV-2 PCR+ individuals (n = 138). Using these data, we trained probabilistic learners to assign likelihood of seropositivity to test samples of unknown serostatus (n = 5100), identifying a support vector machines-linear discriminant analysis learner (SVM-LDA) suited for this purpose. RESULTS: In the training data from confirmed ancestral SARS-CoV-2 infections, 99% of participants had detectable anti-S and -RBD IgG in the circulation, with titers differing > 1000-fold between persons. In data of otherwise healthy individuals, 7.2% (n = 367) of samples were of uncertain serostatus, with values in the range of 3-6SD from the mean of pre-pandemic negative controls (n = 595). In contrast, SVM-LDA classified 6.4% (n = 328) of test samples as having a high likelihood (> 99% chance) of past infection, 4.5% (n = 230) to have a 50-99% likelihood, and 4.0% (n = 203) to have a 10-49% likelihood. As different probabilistic approaches were more consistent with each other than conventional SD-based methods, such tools allow for more statistically-sound seropositivity estimates in large cohorts. CONCLUSION: Probabilistic antibody testing frameworks can improve seropositivity estimates in populations with large titer variability
Semimartingale decomposition of convex functions of continuous semimartingales by Brownian perturbation
In this note we prove that the local martingale part of a convex function
f of a d-dimensional semimartingale
X = M + A can be written in terms of
an Itô stochastic integral
∫H(X)dM, where
H(x) is some particular measurable choice of
subgradient
\hbox{} of
f at x, and M is the martingale part
of X. This result was first proved by Bouleau in [N. Bouleau, C.
R. Acad. Sci. Paris Sér. I Math. 292 (1981) 87–90]. Here we
present a new treatment of the problem. We first prove the result for
\hbox{},
ϵ > 0, where B is a standard
Brownian motion, and then pass to the limit as ϵ → 0, using results in
[M.T. Barlow and P. Protter, On convergence of semimartingales. In Séminaire de
Probabilités, XXIV, 1988/89, Lect. Notes Math., vol. 1426.
Springer, Berlin (1990) 188–193; E. Carlen and P. Protter, Illinois J. Math.
36 (1992) 420–427]. The former paper concerns convergence of
semimartingale decompositions of semimartingales, while the latter studies a special case
of converging convex functions of semimartingales
Recommended from our members
An evaluation of machine-learning for predicting phenotype: studies in yeast, rice, and wheat
Abstract: In phenotype prediction the physical characteristics of an organism are predicted from knowledge of its genotype and environment. Such studies, often called genome-wide association studies, are of the highest societal importance, as they are of central importance to medicine, crop-breeding, etc. We investigated three phenotype prediction problems: one simple and clean (yeast), and the other two complex and real-world (rice and wheat). We compared standard machine learning methods; elastic net, ridge regression, lasso regression, random forest, gradient boosting machines (GBM), and support vector machines (SVM), with two state-of-the-art classical statistical genetics methods; genomic BLUP and a two-step sequential method based on linear regression. Additionally, using the clean yeast data, we investigated how performance varied with the complexity of the biological mechanism, the amount of observational noise, the number of examples, the amount of missing data, and the use of different data representations. We found that for almost all the phenotypes considered, standard machine learning methods outperformed the methods from classical statistical genetics. On the yeast problem, the most successful method was GBM, followed by lasso regression, and the two statistical genetics methods; with greater mechanistic complexity GBM was best, while in simpler cases lasso was superior. In the wheat and rice studies the best two methods were SVM and BLUP. The most robust method in the presence of noise, missing data, etc. was random forests. The classical statistical genetics method of genomic BLUP was found to perform well on problems where there was population structure. This suggests that standard machine learning methods need to be refined to include population structure information when this is present. We conclude that the application of machine learning methods to phenotype prediction problems holds great promise, but that determining which methods is likely to perform well on any given problem is elusive and non-trivial