Search CORE

50 research outputs found

Recommended from our members

QProber: A System for Automatic Classification of Hidden-Web Resources

Author: Gravano Luis
Ipeirotis Panagiotis G.
Sahami Mehran
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2001
Field of study

The contents of many valuable web-accessible databases are only available through search interfaces and are hence invisible to traditional web "crawlers." Recently, commercial web sites have started to manually organize web-accessible databases into Yahoo!-like hierarchical classification schemes. Here, we introduce QProber, a modular system that automates this classification process by using a small number of query probes, generated by document classifiers. QProber can use a variety of types of classifiers to generate the probes. To classify a database, QProber does not retrieve or inspect any documents or pages from the database, but rather just exploits the number of matches that each query probe generates at the database in question. We have conducted an extensive experimental evaluation of QProber over collections of real documents, experimenting with different types of document classifiers and retrieval models. We have also tested our system with over one hundred web-accessible databases. Our experiments show that our system has low overhead and achieves high classification accuracy across a variety of databases

Columbia University Academic Commons

Generating Neural Networks through the Induction of Threshold Logic Unit Trees (Extended Abstract)

Author: Mehran Sahami
Publication venue
Publication date: 01/01/1995
Field of study

) Mehran Sahami Computer Science Department, Stanford University, Stanford, CA 94305, USA Email: [email protected] Abstract. We investigate the generation of neural networks through the induction of binary trees of threshold logic units (TLUs). Initially, we describe the framework for our tree construction algorithm and how such trees can be transformed into an isomorphic neural network topology. Several methods for learning the linear discriminant functions at each node of the tree structure are examined and shown to produce accuracy results that are comparable to classical information theoretic methods for constructing decision trees (which use single feature tests at each node). Our TLU trees, however, are smaller and thus easier to understand. Moreover, we show that it is possible to simultaneously learn both the topology and weight settings of a neural network simply using the training data set that we are given. 1 Introduction We present a non-incremental algorithm that..

CiteSeerX

Learning Non-Linearly Separable Boolean Functions With Linear Threshold Unit Trees and Madaline-Style Networks

Author: Mehran Sahami
Publication venue: AAAI Press
Publication date
Field of study

This paper investigates an algorithm for the construction of decisions trees comprised of linear threshold units and also presents a novel algorithm for the learning of nonlinearly separable boolean functions using Madalinestyle networks which are isomorphic to decision trees. The construction of such networks is discussed, and their performance in learning is compared with standard BackPropagation on a sample problem in which many irrelevant attributes are introduced. Littlestone's Winnow algorithm is also explored within this architecture as a means of learning in the presence of many irrelevant attributes. The learning ability of this Madaline-style architecture on non-optimal (larger than necessary) networks is also explored. Introduction We initially examine a non-incremental algorithm that learns binary classification tasks by producing decision trees of linear threshold units (LTU trees). This decision tree bears some similarity to the decision trees produced by ID3 (Quinlan 19..

CiteSeerX

Learning Classification Rules Using Lattices

Author: Mehran Sahami
Publication venue
Publication date
Field of study

This paper presents a novel induction algorithm, Rulearner, which induces classification rules using a Galois lattice as an explicit map through the search space of rules. The construction of lattices from data is initially discussed and the use of these structures in inducing classification rules is examined. The Rulearner system is shown to compare favorably with commonly used symbolic learning methods which use heursitics rather than an explicit map to guide their search through the rule space. Furthermore, our learning system is shown to be robust in the presence of noisy data. The Rulearner system is also capable of learning both decision lists as well as unordered rule sets and thus allows for comparisons of these different learning paradigms within the same algorithmic framework. Research Area: inductive learning Keywords: lattices, decision lists, rule induction 1 Introduction Research in rule induction by means of search [Michalski, 1969; Mitchell, 1982; Clark & Niblett, 198..

CiteSeerX

Generating Neural Networks Through the Induction of Threshold Logic Unit Trees

Author: Mehran Sahami
Publication venue
Publication date
Field of study

This paper investigates the generation of neural networks through the induction of binary trees of threshold logic units (TLUs). Initially, we describe the framework for our tree construction algorithm and show how it helps to bridge the gap between pure connectionist (neural network) and symbolic (decision tree) paradigms. We also show how the trees of threshold units that we induce can be transformed into an isomorphic neural network topology. Several methods for learning the linear discriminant functions at each node of the tree structure are examined and shown to produce accuracy results that are comparable to classical information theoretic methods for constructing decision trees (which use single feature tests at each node), but produce trees that are smaller and thus easier to understand. Moreover, our results also show that it is possible to simultaneously learn both the topology and weight settings of a neural network simply using the training data set that we are initially gi..

CiteSeerX

Learning Classification Rules Using Lattices (Extended Abstract)

Author: Mehran Sahami
Publication venue
Publication date
Field of study

Abstract. This paper presents a novel induction algorithm, Rulearner, which induces classification rules using a Galois lattice as an explicit map through the search space of rules. The Rulearner system is shown to compare favorably with commonly used symbolic learning methods which use heuristics rather than an explicit map to guide their search through the rule space. Furthermore, our learning system is shown to be robust in the presence of noisy data. The Rulearner system is also capable of learning both decision lists and unordered rule sets allowing for comparisons of these different learning paradigms within the same algorithmic framework

CiteSeerX

Algorithms, Experimentation

Author: Mehran Sahami
Publication venue
Publication date
Field of study

Determining the similarity of short text snippets, such as search queries, works poorly with traditional document similarity measures (e.g., cosine), since there are often few, if any, terms in common between two short text snippets. We address this problem by introducing a novel method for measuring the similarity between short text snippets (even those without any overlapping terms) by leveraging web search results to provide greater context for the short texts. In this paper, we define such a similarity kernel function, mathematically analyze some of its properties, and provide examples of its efficacy. We also show the use of this kernel function in a large-scale system for suggesting related queries to search engine users

CiteSeerX

Learning Limited Dependence Bayesian Classifiers

Author: Mehran Sahami
Publication venue: AAAI Press
Publication date
Field of study

We present a framework for characterizing Bayesian classification methods. This framework can be thought of as a spectrum of allowable dependence in a given probabilistic model with the Naive Bayes algorithm at the most restrictive end and the learning of full Bayesian networks at the most general extreme. While much work has been carried out along the two ends of this spectrum, there has been surprising little done along the middle. We analyze the assumptions made as one moves along this spectrum and show the tradeoffs between model accuracy and learning speed which become critical to consider in a variety of data mining domains. We then present a general induction algorithm that allows for traversal of this spectrum depending on the available computational power for carrying out induction and show its application in a number of domains with different properties. Introduction Recently, work in Bayesian methods for classification has grown enormously (Cooper & Herskovits 1992) (Buntin..

CiteSeerX