42,319 research outputs found
Detecting mutations in mixed sample sequencing data using empirical Bayes
We develop statistically based methods to detect single nucleotide DNA
mutations in next generation sequencing data. Sequencing generates counts of
the number of times each base was observed at hundreds of thousands to billions
of genome positions in each sample. Using these counts to detect mutations is
challenging because mutations may have very low prevalence and sequencing error
rates vary dramatically by genome position. The discreteness of sequencing data
also creates a difficult multiple testing problem: current false discovery rate
methods are designed for continuous data, and work poorly, if at all, on
discrete data. We show that a simple randomization technique lets us use
continuous false discovery rate methods on discrete data. Our approach is a
useful way to estimate false discovery rates for any collection of discrete
test statistics, and is hence not limited to sequencing data. We then use an
empirical Bayes model to capture different sources of variation in sequencing
error rates. The resulting method outperforms existing detection approaches on
example data sets.Comment: Published in at http://dx.doi.org/10.1214/12-AOAS538 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Summon, EBSCO Discovery Service, and Google Scholar: Comparing Search Performance Using User Queries
When the NCSU Libraries initially subscribed to the Summon Discovery Service in 2009, there were few other competitors on the market and none offered an API interface that could be used to populate the âArticlesâ portion of our QuickSearch application (http://search.lib.ncsu.edu/). Since then, EBSCO Discovery Service (EDS) has emerged as a viable competitor. Using a random sample of actual user searches and bootstrap randomization tests (also referred to as permutation tests), the NCSU Librariesâs WebâScale Discovery Product Team conducted a study to compare the search performance of Summon, EDS, and Google Scholar
Multiple Hypothesis Testing in Pattern Discovery
The problem of multiple hypothesis testing arises when there are more than
one hypothesis to be tested simultaneously for statistical significance. This
is a very common situation in many data mining applications. For instance,
assessing simultaneously the significance of all frequent itemsets of a single
dataset entails a host of hypothesis, one for each itemset. A multiple
hypothesis testing method is needed to control the number of false positives
(Type I error). Our contribution in this paper is to extend the multiple
hypothesis framework to be used with a generic data mining algorithm. We
provide a method that provably controls the family-wise error rate (FWER, the
probability of at least one false positive) in the strong sense. We evaluate
the performance of our solution on both real and generated data. The results
show that our method controls the FWER while maintaining the power of the test.Comment: 28 page
Explicit probabilistic models for databases and networks
Recent work in data mining and related areas has highlighted the importance
of the statistical assessment of data mining results. Crucial to this endeavour
is the choice of a non-trivial null model for the data, to which the found
patterns can be contrasted. The most influential null models proposed so far
are defined in terms of invariants of the null distribution. Such null models
can be used by computation intensive randomization approaches in estimating the
statistical significance of data mining results.
Here, we introduce a methodology to construct non-trivial probabilistic
models based on the maximum entropy (MaxEnt) principle. We show how MaxEnt
models allow for the natural incorporation of prior information. Furthermore,
they satisfy a number of desirable properties of previously introduced
randomization approaches. Lastly, they also have the benefit that they can be
represented explicitly. We argue that our approach can be used for a variety of
data types. However, for concreteness, we have chosen to demonstrate it in
particular for databases and networks.Comment: Submitte
- âŠ