Binary data are used in a broad area of biological sciences. Using binary
presence-absence data, we can evaluate species co-occurrences that help
elucidate relationships among organisms and environments. To summarize
similarity between occurrences of species, we routinely use the
Jaccard/Tanimoto coefficient, which is the ratio of their intersection to their
union. It is natural, then, to identify statistically significant
Jaccard/Tanimoto coefficients, which suggest non-random co-occurrences of
species. However, statistical hypothesis testing using this similarity
coefficient has been seldom used or studied.
We introduce a hypothesis test for similarity for biological presence-absence
data, using the Jaccard/Tanimoto coefficient. Several key improvements are
presented including unbiased estimation of expectation and centered
Jaccard/Tanimoto coefficients, that account for occurrence probabilities. We
derived the exact and asymptotic solutions and developed the bootstrap and
measurement concentration algorithms to compute statistical significance of
binary similarity. Comprehensive simulation studies demonstrate that our
proposed methods produce accurate p-values and false discovery rates. The
proposed estimation methods are orders of magnitude faster than the exact
solution. The proposed methods are implemented in an open source R package
called jaccard (https://cran.r-project.org/package=jaccard).
We introduce a suite of statistical methods for the Jaccard/Tanimoto
similarity coefficient, that enable straightforward incorporation of
probabilistic measures in analysis for species co-occurrences. Due to their
generality, the proposed methods and implementations are applicable to a wide
range of binary data arising from genomics, biochemistry, and other areas of
science