4 research outputs found
Iterative Optimization and Simplification of Hierarchical Clusterings
Clustering is often used for discovering structure in data. Clustering
systems differ in the objective function used to evaluate clustering quality
and the control strategy used to search the space of clusterings. Ideally, the
search strategy should consistently construct clusterings of high quality, but
be computationally inexpensive as well. In general, we cannot have it both
ways, but we can partition the search so that a system inexpensively constructs
a `tentative' clustering for initial examination, followed by iterative
optimization, which continues to search in background for improved clusterings.
Given this motivation, we evaluate an inexpensive strategy for creating initial
clusterings, coupled with several control strategies for iterative
optimization, each of which repeatedly modifies an initial clustering in search
of a better one. One of these methods appears novel as an iterative
optimization strategy in clustering contexts. Once a clustering has been
constructed it is judged by analysts -- often according to task-specific
criteria. Several authors have abstracted these criteria and posited a generic
performance task akin to pattern completion, where the error rate over
completed patterns is used to `externally' judge clustering utility. Given this
performance task, we adapt resampling-based pruning strategies used by
supervised learning systems to the task of simplifying hierarchical
clusterings, thus promising to ease post-clustering analysis. Finally, we
propose a number of objective functions, based on attribute-selection measures
for decision-tree induction, that might perform well on the error rate and
simplicity dimensions.Comment: See http://www.jair.org/ for any accompanying file
Incremental and Scalable Computation of Dynamic Topography Information Landscapes
Dynamic topography information landscapes are capable of visualizing longitudinal changes in large document repositories. Resembling tectonic processes in the natural world, dynamic rendering reflects both long-term trends and short-term fluctuations in such repositories. To visualize the rise and decay of topics, the mapping algorithm elevates and lowers related sets of concentric contour lines. Acknowledging the growing number of documents to be processed by state-of-the-art Web intelligence applications, we present a scalable, incremental approach for generating such landscapes. The processing pipeline includes a number of sequential tasks, from crawling, filtering and pre-processing Web content to projecting, labeling and rendering the aggregated information. Processing steps central to incremental processing are found in the projection stage which consists of document clustering, cluster force-directed placement, and fast document positioning. We introduce two different positioning methods and compare them in an incremental setting using two different quality measures. The evaluation is performed on a set of approximately 5000 documents taken from the environmental blog sample of the Media Watch on Climate Change (www.ecoresearch.net/climate), a Web content aggregator about climate change and related environmental issues that serves static versions of the information landscapes presented in this paper as part of a multiple coordinated view representation