330 research outputs found
Beyond Zipf's Law: The Lavalette Rank Function and its Properties
Although Zipf's law is widespread in natural and social data, one often
encounters situations where one or both ends of the ranked data deviate from
the power-law function. Previously we proposed the Beta rank function to
improve the fitting of data which does not follow a perfect Zipf's law. Here we
show that when the two parameters in the Beta rank function have the same
value, the Lavalette rank function, the probability density function can be
derived analytically. We also show both computationally and analytically that
Lavalette distribution is approximately equal, though not identical, to the
lognormal distribution. We illustrate the utility of Lavalette rank function in
several datasets. We also address three analysis issues on the statistical
testing of Lavalette fitting function, comparison between Zipf's law and
lognormal distribution through Lavalette function, and comparison between
lognormal distribution and Lavalette distribution.Comment: 15 pages, 4 figure
Correcting for cryptic relatedness by a regression-based genomic control method
<p>Abstract</p> <p>Background</p> <p>Genomic control (GC) method is a useful tool to correct for the cryptic relatedness in population-based association studies. It was originally proposed for correcting for the variance inflation of Cochran-Armitage's additive trend test by using information from unlinked null markers, and was later generalized to be applicable to other tests with the additional requirement that the null markers are matched with the candidate marker in allele frequencies. However, matching allele frequencies limits the number of available null markers and thus limits the applicability of the GC method. On the other hand, errors in genotype/allele frequencies may cause further bias and variance inflation and thereby aggravate the effect of GC correction.</p> <p>Results</p> <p>In this paper, we propose a regression-based GC method using null markers that are not necessarily matched in allele frequencies with the candidate marker. Variation of allele frequencies of the null markers is adjusted by a regression method.</p> <p>Conclusion</p> <p>The proposed method can be readily applied to the Cochran-Armitage's trend tests other than the additive trend test, the Pearson's chi-square test and other robust efficiency tests. Simulation results show that the proposed method is effective in controlling type I error in the presence of population substructure.</p
Statistical significance for hierarchical clustering in genetic association and microarray expression studies
BACKGROUND: With the increasing amount of data generated in molecular genetics laboratories, it is often difficult to make sense of results because of the vast number of different outcomes or variables studied. Examples include expression levels for large numbers of genes and haplotypes at large numbers of loci. It is then natural to group observations into smaller numbers of classes that allow for an easier overview and interpretation of the data. This grouping is often carried out in multiple steps with the aid of hierarchical cluster analysis, each step leading to a smaller number of classes by combining similar observations or classes. At each step, either implicitly or explicitly, researchers tend to interpret results and eventually focus on that set of classes providing the "best" (most significant) result. While this approach makes sense, the overall statistical significance of the experiment must include the clustering process, which modifies the grouping structure of the data and often removes variation. RESULTS: For hierarchically clustered data, we propose considering the strongest result or, equivalently, the smallest p-value as the experiment-wise statistic of interest and evaluating its significance level for a global assessment of statistical significance. We apply our approach to datasets from haplotype association and microarray expression studies where hierarchical clustering has been used. CONCLUSION: In all of the cases we examine, we find that relying on one set of classes in the course of clustering leads to significance levels that are too small when compared with the significance level associated with an overall statistic that incorporates the process of clustering. In other words, relying on one step of clustering may furnish a formally significant result while the overall experiment is not significant
Partial correlation analysis indicates causal relationships between GC-content, exon density and recombination rate in the human genome
{\bf Background}: Several features are known to correlate with the GC-content
in the human genome, including recombination rate, gene density and distance to
telomere. However, by testing for pairwise correlation only, it is impossible
to distinguish direct associations from indirect ones and to distinguish
between causes and effects. {\bf Results}: We use partial correlations to
construct partially directed graphs for the following four variables:
GC-content, recombination rate, exon density and distance-to-telomere.
Recombination rate and exon density are unconditionally uncorrelated, but
become inversely correlated by conditioning on GC-content. This pattern
indicates a model where recombination rate and exon density are two independent
causes of GC-content variation. {\bf Conclusions}: Causal inference and
graphical models are useful methods to understand genome evolution and the
mechanisms of isochore evolution in the human genome
Effective Sample Size: Quick Estimation of the Effect of Related Samples in Genetic Case-Control Association Analyses
Correlated samples have been frequently avoided in case-control
genetic association
 studies in part because the methods for handling them are either not
easily implemented or not widely known. We
advocate one method for case-control association analysis of correlated
samples -- the effective sample size method -- as a simple and
accessible approach that does not require specialized computer programs.
The effective sample size method captures the variance inflation
of allele frequency estimation exactly, and can be used to modify the
chi-square test statistic, p-value, and 95% confidence interval of
odds-ratio simply by replacing the apparent number of allele counts with the
effective ones. For genotype frequency estimation, although a single
effective sample size is unable to completely characterize the variance inflation,
an averaged one can satisfactorily approximate the simulated result.
The effective sample size method is applied to the rheumatoid arthritis
siblings data collected from the North American Rheumatoid Arthritis Consortium (NARAC)
to establish a significant association with the interferon-induced
helicasel gene (IFIH1) previously being identified as a type 1 diabetes
susceptibility locus. Connections between the effective sample size
method and other methods, such as generalized estimation equation,
variance of eigenvalues for correlation matrices, and genomic controls,
are also discussed.

Likelihood ratio tests in random graph models with increasing dimensions
We explore the Wilks phenomena in two random graph models: the -model
and the Bradley-Terry model. For two increasing dimensional null hypotheses,
including a specified null for and a
homogenous null , we reveal high dimensional
Wilks' phenomena that the normalized log-likelihood ratio statistic,
, converges in distribution to the standard normal distribution
as goes to infinity. Here, is the log-likelihood
function on the model parameter , is its maximum likelihood estimator
(MLE) under the full parameter space, and is the
restricted MLE under the null parameter space. For the homogenous null with a
fixed , we establish Wilks-type theorems that
converges in distribution to a chi-square distribution with degrees of
freedom, as the total number of parameters, , goes to infinity. When testing
the fixed dimensional specified null, we find that its asymptotic null
distribution is a chi-square distribution in the -model. However,
unexpectedly, this is not true in the Bradley-Terry model. By developing
several novel technical methods for asymptotic expansion, we explore Wilks type
results in a principled manner; these principled methods should be applicable
to a class of random graph models beyond the -model and the
Bradley-Terry model. Simulation studies and real network data applications
further demonstrate the theoretical results.Comment: This paper supersedes arxiv article arXiv:2211.10055 titled "Wilks'
theorems in the -model" by T. Yan, Y. Zhang, J. Xu, Y. Yang and J. Zh
The spatiotemporal response of soil moisture to precipitation and temperature changes in an arid region, China
Soil moisture plays a crucial role in the hydrological cycle and climate system. The reliable estimation of soil moisture in space and time is important to monitor and even predict hydrological and meteorological disasters. Here we studied the spatiotemporal variations of soil moisture and explored the effects of precipitation and temperature on soil moisture in different land cover types within the Tarim River Basin from 2001 to 2015, based on high-spatial-resolution soil moisture data downscaled from the European Space Agency's (ESA) Climate Change Initiative (CCI) soil moisture data. The results show that the spatial average soil moisture increased slightly from 2001 to 2015, and the soil moisture variation in summer contributed most to regional soil moisture change. For the land cover, the highest soil moisture occurred in the forest and the lowest value was found in bare land, and soil moisture showed significant increasing trends in grassland and bare land during 2001 similar to 2015. Both partial correlation analysis and multiple linear regression analysis demonstrate that in the study area precipitation had positive effects on soil moisture, while temperature had negative effects, and precipitation made greater contributions to soil moisture variations than temperature. The results of this study can be used for decision making for water management and allocation
Carbonaceous material fractions in sediments and their effect on the sorption and persistence of organic pollutants in small urban watersheds
U.S. Department of the InteriorU.S. Geological SurveyOpe
- …