203 research outputs found
Fused kernel-spline smoothing for repeatedly measured outcomes in a generalized partially linear model with functional single index
We propose a generalized partially linear functional single index risk score
model for repeatedly measured outcomes where the index itself is a function of
time. We fuse the nonparametric kernel method and regression spline method, and
modify the generalized estimating equation to facilitate estimation and
inference. We use local smoothing kernel to estimate the unspecified
coefficient functions of time, and use B-splines to estimate the unspecified
function of the single index component. The covariance structure is taken into
account via a working model, which provides valid estimation and inference
procedure whether or not it captures the true covariance. The estimation method
is applicable to both continuous and discrete outcomes. We derive large sample
properties of the estimation procedure and show a different convergence rate
for each component of the model. The asymptotic properties when the kernel and
regression spline methods are combined in a nested fashion has not been studied
prior to this work, even in the independent data case.Comment: Published at http://dx.doi.org/10.1214/15-AOS1330 in the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Recommended from our members
Transcription activity hot spot, is it real or an artifact?
Transcription activity 'hot spots', defined as chromosome regions that contain more expression quantitative trait loci than would have been expected by chance, have been frequently detected both in humans and in model organisms. It has been common to consider the existence of hot spots as evidence for master regulation of gene expression. However, hot spots could also simply be due to highly correlated gene expressions or linkage disequilibrium and do not truly represent master regulators. A recent simulation study using real human gene expression data but simulated random single-nucleotide polymorphism genotypes showed patterns of clustering of expression quantitative trait loci that resemble those in actual studies [Perez-Enciso: Genetics 2004, 166: 547β554.]. In this study, to assess the credibility of transcription activity hot spots, we conducted genetic analyses on gene expressions provided by Genetic Analysis Workshop 15 Problem 1
Recommended from our members
Analysis of genome-wide association data by large-scale Bayesian logistic regression
Single-locus analysis is often used to analyze genome-wide association (GWA) data, but such analysis is subject to severe multiple comparisons adjustment. Multivariate logistic regression is proposed to fit a multi-locus model for case-control data. However, when the sample size is much smaller than the number of single-nucleotide polymorphisms (SNPs) or when correlation among SNPs is high, traditional multivariate logistic regression breaks down. To accommodate the scale of data from a GWA while controlling for collinearity and overfitting in a high dimensional predictor space, we propose a variable selection procedure using Bayesian logistic regression. We explored a connection between Bayesian regression with certain priors and L1 and L2 penalized logistic regression. After analyzing large number of SNPs simultaneously in a Bayesian regression, we selected important SNPs for further consideration. With much fewer SNPs of interest, problems of multiple comparisons and collinearity are less severe. We conducted simulation studies to examine probability of correctly selecting disease contributing SNPs and applied developed methods to analyze Genetic Analysis Workshop 16 North American Rheumatoid Arthritis Consortium data
Armitage's trend test for genome-wide association analysis: one-sided or two-sided?
The importance of considering confounding due to population stratification in genome-wide association analysis using case-control designs has been a source of debate. Armitage's trend test, together with some other methods developed from it, can correct for population stratification to some extent. However, there is a question whether the one-sided or the two-sided alternative hypothesis is appropriate, or to put it another way, whether examining both the one-sided and the two-sided alternative hypotheses can give more information. The dataset for Problem 1 of Genetic Analysis Workshop 16 provides us with a chance to address this question. Because it is a part of a combined sample from the North American Rheumatoid Arthritis Consortium (NARAC) and the Swedish Epidemiological Investigation of Rheumatoid Arthritis (EIRA), the results from the combined sample can be used as references. To test this aim, the last 10,000 single-nucleotide polymorphisms (SNPs) on chromosome 9, which contain the common genetic variant at the TRAF1-C5 locus, were examined by conducting Armitage's trend tests. Examining the two-sided alternative hypothesis shows that SNPs rs12380341 (p = 9.7 Γ 10-11) and rs872863 (p = 1.7 Γ 10-15), along with six SNPs across the TRAF1-C5 locus, rs1953126, rs10985073, rs881375, rs3761847, rs10760130, and rs2900180 (p~1 Γ 10-7), are significantly associated with anti-cyclic citrullinated peptide-positive rheumatoid arthritis. But examining the one-sided alternative hypothesis that the minor allele is positively associated with the disease shows that only those six SNPs across the TRAF1-C5 locus are significantly associated with the disease (p~1 Γ 10-8), which is consistent with the results from the combined sample of the NARAC and the EIRA
Support vector machine for dynamic survival prediction with time-dependent covariates
Predicting time-to-event outcomes using time-dependent covariates is a
challenging problem. Many machine learning approaches, such as tree-based
methods and support vector regression, predominantly utilize only baseline
covariates. Only a few methods can incorporate time-dependent covariates,
but they often lack theoretical justification. In this paper we present a new
framework for event time prediction, leveraging the support vector machines
to forecast the associated counting processes. Utilizing the kernel trick, we
accommodate nonlinear functions in both time and covariate spaces. Subsequently,
we use a chain algorithm to predict future events. Theoretical analysis
proves that our method is equivalent to comparing time-varying hazard
rates among at-risk subjects, and we obtain the convergence rate of the resulting
prediction loss. Through simulation studies and a case study on Huntingtonβs
disease, we demonstrate the superior performance of our approach
compared to alternative methods based on machine learning, deep learning,
and statistical models
Fusing Individualized Treatment Rules Using Secondary Outcomes
An individualized treatment rule (ITR) is a decision rule that recommends
treatments for patients based on their individual feature variables. In many
practices, the ideal ITR for the primary outcome is also expected to cause
minimal harm to other secondary outcomes. Therefore, our objective is to learn
an ITR that not only maximizes the value function for the primary outcome, but
also approximates the optimal rule for the secondary outcomes as closely as
possible. To achieve this goal, we introduce a fusion penalty to encourage the
ITRs based on different outcomes to yield similar recommendations. Two
algorithms are proposed to estimate the ITR using surrogate loss functions. We
prove that the agreement rate between the estimated ITR of the primary outcome
and the optimal ITRs of the secondary outcomes converges to the true agreement
rate faster than if the secondary outcomes are not taken into consideration.
Furthermore, we derive the non-asymptotic properties of the value function and
misclassification rate for the proposed method. Finally, simulation studies and
a real data example are used to demonstrate the finite-sample performance of
the proposed method
Support Vector Hazards Machine: A Counting Process Framework for Learning Risk Scores for Censored Outcomes
Learning risk scores to predict dichotomous or continuous outcomes using machine learning approaches has been studied extensively. However, how to learn risk scores for time-to-event outcomes subject to right censoring has received little attention until recently. Existing approaches rely on inverse probability weighting or rank-based regression, which may be inefficient. In this paper, we develop a new support vector hazards machine (SVHM) approach to predict censored outcomes. Our method is based on predicting the counting process associated with the time-to-event outcomes among subjects at risk via a series of support vector machines. Introducing counting processes to represent time-to-event data leads to a connection between support vector machines in supervised learning and hazards regression in standard survival analysis. To account for different at risk populations at observed event times, a time-varying offset is used in estimating risk scores. The resulting optimization is a convex quadratic programming problem that can easily incorporate non-linearity using kernel trick. We demonstrate an interesting link from the profiled empirical risk function of SVHM to the Cox partial likelihood. We then formally show that SVHM is optimal in discriminating covariate-specific hazard function from population average hazard function, and establish the consistency and learning rate of the predicted risk using the estimated risk scores. Simulation studies show improved prediction accuracy of the event times using SVHM compared to existing machine learning methods and standard conventional approaches. Finally, we analyze two real world biomedical study data where we use clinical markers and neuroimaging biomarkers to predict age-at-onset of a disease, and demonstrate superiority of SVHM in distinguishing high risk versus low risk subjects
Joint Rare Variant Association Test of the Average and Individual Effects for Sequencing Studies
For many complex traits, single nucleotide polymorphisms (SNPs) identified from genome-wide association studies (GWAS) only explain a small percentage of heritability. Next generation sequencing technology makes it possible to explore unexplained heritability by identifying rare variants (RVs). Existing tests designed for RVs look for optimal strategies to combine information across multiple variants. Many of the tests have good power when the true underlying associations are either in the same direction or in opposite directions. We propose three tests for examining the association between a phenotype and RVs, where two of them jointly consider the common association across RVs and the individual deviations from the common effect. On one hand, similar to some of the best existing methods, the individual deviations are modeled as random effects to borrow information across multiple RVs. On the other hand, unlike the existing methods which pool individual effects towards zero, we pool them towards a possibly non-zero common effect by adding a pooled variant into the model. The common effect and the individual effects are jointly tested. We show through extensive simulations that at least one of the three tests proposed here is the most powerful or very close to being the most powerful in various settings of true models. This is appealing in practice because the direction and size of the true effects of the associated RVs are unknown. Researchers can apply the developed tests to improve power under a wide range of true models
- β¦