112 research outputs found
Combining Thesaurus Knowledge and Probabilistic Topic Models
In this paper we present the approach of introducing thesaurus knowledge into
probabilistic topic models. The main idea of the approach is based on the
assumption that the frequencies of semantically related words and phrases,
which are met in the same texts, should be enhanced: this action leads to their
larger contribution into topics found in these texts. We have conducted
experiments with several thesauri and found that for improving topic models, it
is useful to utilize domain-specific knowledge. If a general thesaurus, such as
WordNet, is used, the thesaurus-based improvement of topic models can be
achieved with excluding hyponymy relations in combined topic models.Comment: Accepted to AIST-2017 conference (http://aistconf.ru/). The final
publication will be available at link.springer.co
A unifying representation for a class of dependent random measures
We present a general construction for dependent random measures based on
thinning Poisson processes on an augmented space. The framework is not
restricted to dependent versions of a specific nonparametric model, but can be
applied to all models that can be represented using completely random measures.
Several existing dependent random measures can be seen as specific cases of
this framework. Interesting properties of the resulting measures are derived
and the efficacy of the framework is demonstrated by constructing a
covariate-dependent latent feature model and topic model that obtain superior
predictive performance
A statistical significance testing approach for measuring term burstiness with applications to domain-specific terminology extraction
A term in a corpus is said to be ``bursty'' (or overdispersed) when its
occurrences are concentrated in few out of many documents. In this paper, we
propose Residual Inverse Collection Frequency (RICF), a statistical
significance test inspired heuristic for quantifying term burstiness. The
chi-squared test is, to our knowledge, the sole test of statistical
significance among existing term burstiness measures. Chi-squared test term
burstiness scores are computed from the collection frequency statistic (i.e.,
the proportion that a specified term constitutes in relation to all terms
within a corpus). However, the document frequency of a term (i.e., the
proportion of documents within a corpus in which a specific term occurs) is
exploited by certain other widely used term burstiness measures. RICF addresses
this shortcoming of the chi-squared test by virtue of its term burstiness
scores systematically incorporating both the collection frequency and document
frequency statistics. We evaluate the RICF measure on a domain-specific
technical terminology extraction task using the GENIA Term corpus benchmark,
which comprises 2,000 annotated biomedical article abstracts. RICF generally
outperformed the chi-squared test in terms of precision at k score with percent
improvements of 0.00% (P@10), 6.38% (P@50), 6.38% (P@100), 2.27% (P@500), 2.61%
(P@1000), and 1.90% (P@5000). Furthermore, RICF performance was competitive
with the performances of other well-established measures of term burstiness.
Based on these findings, we consider our contributions in this paper as a
promising starting point for future exploration in leveraging statistical
significance testing in text analysis.Comment: 19 pages, 1 figure, 6 table
Computer-Aided Geometry Modeling
Techniques in computer-aided geometry modeling and their application are addressed. Mathematical modeling, solid geometry models, management of geometric data, development of geometry standards, and interactive and graphic procedures are discussed. The applications include aeronautical and aerospace structures design, fluid flow modeling, and gas turbine design
- …