4 research outputs found
Recommended from our members
Enhanced classification through exploitation of hierarchical structures
textHumans often organize information by encoding it in structures that link
together entities such as concepts, objects, properties etc. Among the various
structures possible, hierarchies are commonly used. For instance, taxonomies
of categories commonly employ hierarchies to indicate that one category “is a”
type of another. The Yahoo! Web Directory and the Open Directory Project
are two examples of large taxonomies where topics are hierarchically arranged.
Hierarchies are also used to recursively decompose composite objects into their
constituent parts. Examples of this are webpages that can be parsed and then
represented as DOM-trees, where the DOM nodes correspond to sections of
the webpages.
In this thesis we argue that these hierarchical relationships between entities can be exploited to facilitate common data mining tasks defined upon
them, like automated classification. Specifically, we show that the information
encoded in these hierarchies can be reduced to constraints on class membership scores that can then be enforced as a post-processing step to enhance the accuracy of classification. We demonstrate our ideas and algorithms on three
real-world tasks.
First, we tackle the problem of classification into hierarchical taxonomies.
We show how different taxonomy structures can be translated into constraints
on the outputs of classifiers learned at the nodes of the hierarchy. In addition,
we give algorithms to optimally enforce these constraints and show that this
results in improved classification accuracy. In cases where the taxonomies
are not available, we give an approach to automatically derive hierarchical
relationships amongst a flat set of categories. Next, we work on the problem
of detecting noisy (templated) parts of webpages. We give algorithms that
rate each section of a webpage in terms of how templated it is. Then we show
that smoothing the output of these template classifiers over the DOM-tree
hierarchy improves the template detection performance of our system. Finally,
we investigate the task of segmenting websites into topically cohesive regions.
We define a framework and within it a set of measures that characterize good
segmentations, and give an efficient algorithm to find the best segmentation
within this framework.
We formalize the problem of enforcing constraints on the outputs of classifiers as regularized isotonic or unimodal regression on rooted trees; these are
generalizations of the classic isotonic regression problem. The nature of the
constraints as well as the cost functions is different in each of the applications
mentioned above. For all these formulations we give efficient algorithms to optimally smooth the classifier outputs. These novel formulations and algorithms
might be of interest independent of the applications in this thesis.Electrical and Computer Engineerin
Recommended from our members
Soft cluster ensembles
Cluster Ensembles is a framework for combining multiple partitionings obtained from separate clustering runs into a final consensus clustering without accessing the original features of the data or the algorithms that determined these partitions. This framework was first proposed by Strehl and Ghosh [31] who also provided three techniques to solve the problem. Since then there have been numerous attempts to solve cluster ensembles using approaches such as Maximum Likelihood using EM, Bipartite Graph Partitioning, Genetic algorithms, and Voting-Merging. Most of this work has focused on devising approaches that aceept hard clusterings as input. Also, there has been no comparison of combining accuracy on soft vs hard cluster ensembles. In this thesis we will show experimentally as well as intuitively that using soft clusterings as input does offer signficant advantages, especially when dealing with vertically partitioned data. We modify many of the above mentioned algorithms to accept soft clusterings and experiment over multiple real-life datasetsElectrical and Computer Engineerin
Enhanced Classification through Exploitation of Hierarchical Structures
Dedicated to my parents, Cdr. Vinod Punera and Shashi Punera