23,779 research outputs found
Multi-mode partitioning for text clustering to reduce dimensionality and noises
Co-clustering in text mining has been proposed to partition words and documents simultaneously. Although the
main advantage of this approach may improve interpretation of clusters on the data, there are still few proposals
on these methods; while one-way partition is even now widely utilized for information retrieval. In contrast to
structured information, textual data suffer of high dimensionality and sparse matrices, so it is strictly necessary
to pre-process texts for applying clustering techniques. In this paper, we propose a new procedure to reduce high
dimensionality of corpora and to remove the noises from the unstructured data. We test two different processes
to treat data applying two co-clustering algorithms; based on the results we present the procedure that provides
the best interpretation of the data
Using text analysis to quantify the similarity and evolution of scientific disciplines
We use an information-theoretic measure of linguistic similarity to
investigate the organization and evolution of scientific fields. An analysis of
almost 20M papers from the past three decades reveals that the linguistic
similarity is related but different from experts and citation-based
classifications, leading to an improved view on the organization of science. A
temporal analysis of the similarity of fields shows that some fields (e.g.,
computer science) are becoming increasingly central, but that on average the
similarity between pairs has not changed in the last decades. This suggests
that tendencies of convergence (e.g., multi-disciplinarity) and divergence
(e.g., specialization) of disciplines are in balance.Comment: 9 pages, 4 figure
Design and enhanced evaluation of a robust anaphor resolution algorithm
Syntactic coindexing restrictions are by now known to be of central importance to practical anaphor resolution approaches. Since, in particular due to structural ambiguity, the assumption of the availability of a unique syntactic reading proves to be unrealistic, robust anaphor resolution relies on techniques to overcome this deficiency.
This paper describes the ROSANA approach, which generalizes the verification of coindexing restrictions in order to make it applicable to the deficient syntactic descriptions that are provided by a robust state-of-the-art parser. By a formal evaluation on two corpora that differ with respect to text genre and domain, it is shown that ROSANA achieves high-quality robust coreference resolution. Moreover, by an in-depth analysis, it is proven that the robust implementation of syntactic disjoint reference is nearly optimal. The study reveals that, compared with approaches that rely on shallow preprocessing, the largely nonheuristic disjoint reference algorithmization opens up the possibility/or a slight improvement. Furthermore, it is shown that more significant gains are to be expected elsewhere, particularly from a text-genre-specific choice of preference strategies.
The performance study of the ROSANA system crucially rests on an enhanced evaluation methodology for coreference resolution systems, the development of which constitutes the second major contribution o/the paper. As a supplement to the model-theoretic scoring scheme that was developed for the Message Understanding Conference (MUC) evaluations, additional evaluation measures are defined that, on one hand, support the developer of anaphor resolution systems, and, on the other hand, shed light on application aspects of pronoun interpretation
- …