12 research outputs found

    Decision tree design from a communication theory standpoint

    Get PDF
    A communication theory approach to decision tree design based on a top-town mutual information algorithm is presented. It is shown that this algorithm is equivalent to a form of Shannon-Fano prefix coding, and several fundamental bounds relating decision-tree parameters are derived. The bounds are used in conjunction with a rate-distortion interpretation of tree design to explain several phenomena previously observed in practical decision-tree design. A termination rule for the algorithm called the delta-entropy rule is proposed that improves its robustness in the presence of noise. Simulation results are presented, showing that the tree classifiers derived by the algorithm compare favourably to the single nearest neighbour classifier

    Decision tree design from a communication theory standpoint

    Get PDF
    A communication theory approach to decision tree design based on a top-town mutual information algorithm is presented. It is shown that this algorithm is equivalent to a form of Shannon-Fano prefix coding, and several fundamental bounds relating decision-tree parameters are derived. The bounds are used in conjunction with a rate-distortion interpretation of tree design to explain several phenomena previously observed in practical decision-tree design. A termination rule for the algorithm called the delta-entropy rule is proposed that improves its robustness in the presence of noise. Simulation results are presented, showing that the tree classifiers derived by the algorithm compare favourably to the single nearest neighbour classifier

    Query Learning with Exponential Query Costs

    Full text link
    In query learning, the goal is to identify an unknown object while minimizing the number of "yes" or "no" questions (queries) posed about that object. A well-studied algorithm for query learning is known as generalized binary search (GBS). We show that GBS is a greedy algorithm to optimize the expected number of queries needed to identify the unknown object. We also generalize GBS in two ways. First, we consider the case where the cost of querying grows exponentially in the number of queries and the goal is to minimize the expected exponential cost. Then, we consider the case where the objects are partitioned into groups, and the objective is to identify only the group to which the object belongs. We derive algorithms to address these issues in a common, information-theoretic framework. In particular, we present an exact formula for the objective function in each case involving Shannon or Renyi entropy, and develop a greedy algorithm for minimizing it. Our algorithms are demonstrated on two applications of query learning, active learning and emergency response.Comment: 15 page

    Speeding up rendering of hybrid surface and volume models

    Get PDF
    Hybrid rendering of volume and polygonal model is an interesting feature of visualization systems, since it helps users to better understand the relationships between internal structures of the volume and fitted surfaces as well as external surfaces. Most of the existing bibliography focuses at the problem of correctly integrating in depth both types of information. The rendering method proposed in this paper is built on these previous results. It is aimed at solving a different problem: how to efficiently access to selected information of a hybrid model. We propose to construct a decision tree (the Rendering Decision Tree), which together with an auxiliary run-length representation of the model avoids visiting unselected surfaces and internal regions during a traversal of the model.Postprint (published version

    Search Through Systematic Set Enumeration

    Get PDF
    In many problem domains, solutions take the form of unordered sets. We present the Set-Enumerations (SE)-tree - a vehicle for representing sets and/or enumerating them in a best-first fashion. We demonstrate its usefulness as the basis for a unifying search-based framework for domains where minimal (maximal) elements of a power set are targeted, where minimal (maximal) partial instantiations of a set of variables are sought, or where a composite decision is not dependent on the order in which its primitive component-decisions are taken. Particular instantiations of SE-tree-based algorithms for some AI problem domains are used to demonstrate the general features of the approach. These algorithms are compared theoretically and empirically with current algorithms

    On the Qualitative Behavior of Impurity-Based Splitting Rules I: The Minima-Free Property

    Get PDF
    We show that all strictly convex n impurity measures lead to splits at boundary points, and furthermore show that certain rational splitting rules, notably the information gain ratio, also have this property. A slightly weaker result is shown to hold for impurity measures that are only convex n, such as Inaccuracy

    Projecting land use changes using parcel-level data : model development and application to Hunterdon County, New Jersey

    Get PDF
    This dissertation is to develop a parcel-based spatial land use change prediction model by coupling various machine learning and interpretation algorithms such as cellular automata (CA) and decision tree (DT). CA is a collection of cells that evolves through a number of discrete time steps according to a set of transition rules based on the state of each cell and the characteristics of its neighboring cells. DT is a data mining and machine learning tool that extracts the patterns of decision process from observed cell behaviors and their affecting factors. In this dissertation, CA is used to predict the future land use status of cadastral parcels based on a set of transition rules derived from a set of identified land use change driving factors using DT. Although CA and DT have been applied separately in various land use change models in the literature, no studies attempted to integrate them. This DT-based CA model developed in this dissertation represents the first kind of such integration in land use change modeling. The coupled model would be able to handle a large set of driving factors and also avoid subjective bias when deriving the transition rules. The coupled model uses the cadastral parcel as a unit of analysis, which has practical policy implications because the responses of land use changes to various policy usually take place at the parcel level. Since parcel varies by their sizes and shapes, its use as a unit of analysis does make it difficult to apply CA, which initially designed to handle regular grid cells. This dissertation improves the treatment of the irregular cell in CA-based land use change models in literature by defining a cell\u27s neighborhood as a fixed distance buffer along the parcel boundary. The DT-based CA model was developed and validated in Hunterdon County, New Jersey. The data on historical land uses and various land use change driving factors for Hunterdon County were collected and processed using a Geographic Information System (GIS). Specifically, the county land uses in 1986, I995 and 2002 were overlaid with a parcel map to create parcel-based land use maps. The single land use in each parcel is based on a classification scheme developed thorough literature review and empirical testing in the study area. The possible land use status considered for each parcel is agriculture, barren land, forest, urban, water or wetlands following the land use/land cover classification by the New Jersey Department of Environment Protection. The identified driving factors for the future status of the parcel includes the present land use type, the number of soil restrictions to urban development, and the size of the parcel, the amount of wetlands within the parcel, the distribution of land uses in the neighborhood of the parcel, the distances to the nearest streams, urban centers and major roads. A set of transition rules illustrating the land use change processes during the period 1986-1995 were developed using a TD software J48 Classifier. The derived transition rules were applied to the 1995 land use data in a CA model Agent Analyst/RePast (Recursive Porous Agent Simulation Toolkit) to predict the spatial land use pattern in 2004, which were then validated by the actual land use map in 2002. The DT-based CA model had an overall accuracy of 84.46 percent in terms of the number of parcels and of 80.92 percent in terms of the total acreage in predicting land use changes. The model shows much higher capacity in predicting the quantitative changes than the locational changes in land use. The validated model was applied to simulate the 2011 land use patterns in Hunterdon County based on its actual land uses in 2002 under both business as usual and policy scenarios. The simulation results shows that successfully implementing current land use policies such as down zoning, open space and farmland preservation would prevent the total of 7,053 acres (741 acres of wetlands, 3,034 acres of agricultural lands, 250 acres of barren land, and 3,028 acres of forest) from future urban development in Hunterdon County during the period 2002-2011. The neighborhood of a parcel was defined by a 475-foot buffer along the parcel boundary in the study. The results of sensitivity analyses using two additional neighborhoods (237- and 712-foot buffers) indicate the insignificant impacts of the neighborhood size on the model outputs in this application
    corecore