2 research outputs found
A statistical significance testing approach for measuring term burstiness with applications to domain-specific terminology extraction
A term in a corpus is said to be ``bursty'' (or overdispersed) when its
occurrences are concentrated in few out of many documents. In this paper, we
propose Residual Inverse Collection Frequency (RICF), a statistical
significance test inspired heuristic for quantifying term burstiness. The
chi-squared test is, to our knowledge, the sole test of statistical
significance among existing term burstiness measures. Chi-squared test term
burstiness scores are computed from the collection frequency statistic (i.e.,
the proportion that a specified term constitutes in relation to all terms
within a corpus). However, the document frequency of a term (i.e., the
proportion of documents within a corpus in which a specific term occurs) is
exploited by certain other widely used term burstiness measures. RICF addresses
this shortcoming of the chi-squared test by virtue of its term burstiness
scores systematically incorporating both the collection frequency and document
frequency statistics. We evaluate the RICF measure on a domain-specific
technical terminology extraction task using the GENIA Term corpus benchmark,
which comprises 2,000 annotated biomedical article abstracts. RICF generally
outperformed the chi-squared test in terms of precision at k score with percent
improvements of 0.00% (P@10), 6.38% (P@50), 6.38% (P@100), 2.27% (P@500), 2.61%
(P@1000), and 1.90% (P@5000). Furthermore, RICF performance was competitive
with the performances of other well-established measures of term burstiness.
Based on these findings, we consider our contributions in this paper as a
promising starting point for future exploration in leveraging statistical
significance testing in text analysis.Comment: 19 pages, 1 figure, 6 table
Can We Quantify Domainhood? : Exploring Measures to Assess Domain-Specificity in Web Corpora
Web corpora are a cornerstone of modern Language Technology. Corpora built from the web are convenient because their creation is fast and inexpensive. Several studies have been carried out to assess the representativeness of general-purpose web corpora by comparing them to traditional corpora. Less attention has been paid to assess the representativeness of specialized or domain-specific web corpora. In this paper, we focus on the assessment of domain representativeness of web corpora and we claim that it is possible to assess the degree of domain-specificity, or domainhood, of web corpora. We present a case study where we explore the effectiveness of different measures - namely the Mann-Withney-Wilcoxon Test, Kendall correlation coefficient, Kullback–Leibler divergence, log-likelihood and burstiness - to gauge domainhood. Our findings indicate that burstiness is the most suitable measure to single out domain-specific words from a specialized corpus and to allow for the quantification of domainhood.Funding agencies: E-care@home, a "SIDUS - Strong Distributed Research Environment" project - Swedish Knowledge Foundation</p