23 research outputs found

    Evaluation of Jackknife and Bootstrap for Defining Confidence Intervals for Pairwise Agreement Measures

    Get PDF
    Several research fields frequently deal with the analysis of diverse classification results of the same entities. This should imply an objective detection of overlaps and divergences between the formed clusters. The congruence between classifications can be quantified by clustering agreement measures, including pairwise agreement measures. Several measures have been proposed and the importance of obtaining confidence intervals for the point estimate in the comparison of these measures has been highlighted. A broad range of methods can be used for the estimation of confidence intervals. However, evidence is lacking about what are the appropriate methods for the calculation of confidence intervals for most clustering agreement measures. Here we evaluate the resampling techniques of bootstrap and jackknife for the calculation of the confidence intervals for clustering agreement measures. Contrary to what has been shown for some statistics, simulations showed that the jackknife performs better than the bootstrap at accurately estimating confidence intervals for pairwise agreement measures, especially when the agreement between partitions is low. The coverage of the jackknife confidence interval is robust to changes in cluster number and cluster size distribution

    Non-parametric class completeness estimators for collaborative knowledge graphs — the case of wikidata

    Get PDF
    Collaborative Knowledge Graph platforms allow humans and automated scripts to collaborate in creating, updating and interlinking entities and facts. To ensure both the completeness of the data as well as a uniform coverage of the different topics, it is crucial to identify underrepresented classes in the Knowledge Graph. In this paper, we tackle this problem by developing statistical techniques for class cardinality estimation in collaborative Knowledge Graph platforms. Our method is able to estimate the completeness of a class—as defined by a schema or ontology—hence can be used to answer questions such as “Does the knowledge base have a complete list of all {Beer Brands—Volcanos—Video Game Consoles}?” As a use-case, we focus on Wikidata, which poses unique challenges in terms of the size of its ontology, the number of users actively populating its graph, and its extremely dynamic nature. Our techniques are derived from species estimation and data-management methodologies, and are applied to the case of graphs and collaborative editing. In our empirical evaluation, we observe that i) the number and frequency of unique class instances drastically influence the performance of an estimator, ii) bursts of inserts cause some estimators to overestimate the true size of the class if they are not properly handled, and iii) one can effectively measure the convergence of a class towards its true size by considering the stability of an estimator against the number of available instances

    Assessing Conservation Values: Biodiversity and Endemicity in Tropical Land Use Systems

    Get PDF
    Despite an increasing amount of data on the effects of tropical land use on continental forest fauna and flora, it is debatable whether the choice of the indicator variables allows for a proper evaluation of the role of modified habitats in mitigating the global biodiversity crisis. While many single-taxon studies have highlighted that species with narrow geographic ranges especially suffer from habitat modification, there is no multi-taxa study available which consistently focuses on geographic range composition of the studied indicator groups. We compiled geographic range data for 180 bird, 119 butterfly, 204 tree and 219 understorey plant species sampled along a gradient of habitat modification ranging from near-primary forest through young secondary forest and agroforestry systems to annual crops in the southwestern lowlands of Cameroon. We found very similar patterns of declining species richness with increasing habitat modification between taxon-specific groups of similar geographic range categories. At the 8 km2 spatial level, estimated richness of endemic species declined in all groups by 21% (birds) to 91% (trees) from forests to annual crops, while estimated richness of widespread species increased by +101% (trees) to +275% (understorey plants), or remained stable (- 2%, butterflies). Even traditional agroforestry systems lost estimated endemic species richness by - 18% (birds) to - 90% (understorey plants). Endemic species richness of one taxon explained between 37% and 57% of others (positive correlations) and taxon-specific richness in widespread species explained up to 76% of variation in richness of endemic species (negative correlations). The key implication of this study is that the range size aspect is fundamental in assessments of conservation value via species inventory data from modified habitats. The study also suggests that even ecologically friendly agricultural matrices may be of much lower value for tropical conservation than indicated by mere biodiversity value

    Estimating Animal Abundance in Ground Beef Batches Assayed with Molecular Markers

    Get PDF
    Estimating animal abundance in industrial scale batches of ground meat is important for mapping meat products through the manufacturing process and for effectively tracing the finished product during a food safety recall. The processing of ground beef involves a potentially large number of animals from diverse sources in a single product batch, which produces a high heterogeneity in capture probability. In order to estimate animal abundance through DNA profiling of ground beef constituents, two parameter-based statistical models were developed for incidence data. Simulations were applied to evaluate the maximum likelihood estimate (MLE) of a joint likelihood function from multiple surveys, showing superiority in the presence of high capture heterogeneity with small sample sizes, or comparable estimation in the presence of low capture heterogeneity with a large sample size when compared to other existing models. Our model employs the full information on the pattern of the capture-recapture frequencies from multiple samples. We applied the proposed models to estimate animal abundance in six manufacturing beef batches, genotyped using 30 single nucleotide polymorphism (SNP) markers, from a large scale beef grinding facility. Results show that between 411∼1367 animals were present in six manufacturing beef batches. These estimates are informative as a reference for improving recall processes and tracing finished meat products back to source

    Monitoring Nekton as a Bioindicator in Shallow Estuarine Habitats

    No full text
    Long-term monitoring of estuarine nekton has many practical and ecological benefits but efforts are hampered by a lack of standardized sampling procedures. This study provides a rationale for monitoring nekton in shallow (\u3c 1 m), temperate, estuarine habitats and addresses some important issues that arise when developing monitoring protocols. Sampling in seagrass and salt marsh habitats is emphasized due to the susceptibility of each habitat to anthropogenic stress and to the abundant and rich nekton assemblages that each habitat supports. Extensive sampling with quantitative enclosure traps that estimate nekton density is suggested. These gears have a high capture efficiency in most habitats and are small enough (e.g., 1 m2) to permit sampling in specific microhabitats. Other aspects of nekton monitoring are discussed, including spatial and temporal sampling considerations, station selection, sample size estimation, and data collection and analysis. Developing and initiating long-term nekton monitoring programs will help evaluate natural and human-induced changes in estuarine nekton over time and advance our understanding of the interactions between nekton and the dynamic estuarine environment
    corecore