Search CORE

62 research outputs found

Automated Construction of Relational Attributes ACORA: A Progress Report

Author: Perlich Claudia
Publication venue: Stern School of Business, New York University
Publication date: 01/08/2002
Field of study

Data mining research has not only development a large number of algorithms, but also enhanced our knowledge and understanding of their applicability and performance. However, the application of data mining technology in business environments is still no very common, despite the fact that organizations have access to large amounts of data and make decisions that could profit from data mining on a daily basis. One of the reasons is the mismatch between data representation for data storage and data analysis. Data are most commonly stored in multi-table relational databases whereas data mining methods require that the data be represented as a simple feature vector. This work presents a general framework for feature construction from multiple relational tables for data mining applications. The second part describes our prototype implementation ACORA (Automated Construction of Relational Features).Information Systems Working Papers Serie

New York University Faculty Digital Archive

ACORA: Distribution-Based Aggregation for Relational Learning from Identifier Attributes

Author: Perlich Claudia
Provost Foster
Publication venue: Stern School of Business, New York University
Publication date: 01/02/2005
Field of study

Feature construction through aggregation plays an essential role in modeling relational domains with one-to-many relationships between tables. One-to-many relationships lead to bags (multisets) of related entities, from which predictive information must be captured. This paper focuses on aggregation from categorical attributes that can take many values (e.g., object identifiers). We present a novel aggregation method as part of a relational learning system ACORA, that combines the use of vector distance and meta-data about the class-conditional distributions of attribute values. We provide a theoretical foundation for this approach deriving a "relational fixed-effect" model within a Bayesian framework, and discuss the implications of identifier aggregation on the expressive power of the induced model. One advantage of using identifier attributes is the circumvention of limitations caused either by missing/unobserved object properties or by independence assumptions. Finally, we show empirically that the novel aggregators can generalize in the presence of identi- fier (and other high-dimensional) attributes, and also explore the limitations of the applicability of the methods.Information Systems Working Papers Serie

New York University Faculty Digital Archive

Aggregation-Based Feature Invention and Relational

Author: Perlich Claudia
Provost Foster0
Publication venue: Stern School of Business, New York University
Publication date: 01/01/2003
Field of study

Due to interest in social and economic networks, relational modeling is attracting increasing attention. The field of relational data mining/learning, which traditionally was dominated by logic-based approaches, has recently been extended by adapting learning methods such as naive Bayes, Baysian networks and decision trees to relational tasks. One aspect inherent to all methods of model induction from relational data is the construction of features through the aggregation of sets. The theoretical part of this work (1) presents an ontology of relational concepts of increasing complexity, (2) derives classes of aggregation operators that are needed to learn these concepts, and (3) classifies relational domains based on relational schema characteristics such as cardinality. We then present a new class of aggregation functions, ones that are particularly well suited for relational classification and class probability estimation. The empirical part of this paper demonstrates on real domain the effects on the system performance of different aggregation methods on different relational concepts. The results suggest that more complex aggregation methods can significantly increase generalization performance and that, in particular, task-specific aggregation can simplify relational prediction tasks into well-understood propositional learning problems.Information Systems Working Papers Serie

New York University Faculty Digital Archive

Distribution-based aggregation for relational learning with identifier attributes

Author: Perlich Claudia
Provost Foster
Publication venue: 'International Journal of Machine Learning and Networked Collaborative Engineering'
Publication date: 27/01/2006
Field of study

Identifier attributes—very high-dimensional categorical attributes such as particular product ids or people’s names—rarely are incorporated in statistical modeling. However, they can play an important role in relational modeling: it may be informative to have communicated with a particular set of people or to have purchased a particular set of products. A key limitation of existing relational modeling techniques is how they aggregate bags (multisets) of values from related entities. The aggregations used by existing methods are simple summaries of the distributions of features of related entities: e.g., MEAN, MODE, SUM, or COUNT. This paper’s main contribution is the introduction of aggregation operators that capture more information about the value distributions, by storing meta-data about value distributions and referencing this meta-data when aggregating—for example by computing class-conditional distributional distances. Such aggregations are particularly important for aggregating values from high-dimensional categorical attributes, for which the simple aggregates provide little information. In the first half of the paper we provide general guidelines for designing aggregation operators, introduce the new aggregators in the context of the relational learning system ACORA (Automated Construction of Relational Attributes), and provide theoretical justification.We also conjecture special properties of identifier attributes, e.g., they proxy for unobserved attributes and for information deeper in the relationship network. In the second half of the paper we provide extensive empirical evidence that the distribution-based aggregators indeed do facilitate modeling with high-dimensional categorical attributes, and in support of the aforementioned conjectures.NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc

New York University Faculty Digital Archive

Aggregation-based feature invention and relational concept classes

Author: Claudia Perlich
Foster Provost
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2004
Field of study

Crossref

Aggregation-Based Feature Invention and Relational Concept Classes

Author: Perlich Claudia
Provost Foster
Publication venue: SIGKDD
Publication date: 24/08/2003
Field of study

Model induction from relational data requires aggregation of values of attributes of related entities. This paper makes three contributions to the study of relational learning.(1) It presents a hierarchy of relational concepts of increasing complexity, using relational schema characteristics such as cardinality, and derives classes of aggregation operators that are needed to learn these concepts. (2) Expanding one level of the hierarchy, it introduces new aggregation operators that model the distribution of the values to be aggregated and (for classification problems) the differences in these distributions by class. (3) It demonstrates empirically on a noisy business domain that more-complex aggregation methods can increase generalization performance. Constructing features using target-dependent aggregations can transform relational prediction tasks so that well-understood feature-vector-based modeling algorithms can be applied successfully.NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc

Crossref

New York University Faculty Digital Archive

A MODULAR APPROACH TO RELATIONAL DATA MINING

Author: Perlich Claudia
Provost Foster
Publication venue: AIS Electronic Library (AISeL)
Publication date: 31/12/2002
Field of study

AIS Electronic Library (AISeL)

Predicting citation rates for physics papers: Constructing features for an ordered probit model

Author: Macskassy Sofus
Perlich Claudia
Provost Foster
Publication venue: SIGKDD
Publication date: 01/06/2003
Field of study

Gehrke et al. introduce the citation prediction task in their paper "Overview of the KDD Cup 2003" (in this issue). The objective was to predict the change in the number of citations a paper will receive-not the absolute number of citations. There are obvious factors affecting the number of citations including the quality and the topic of the paper, and the reputation of the authors. However it is not clear which factors might influence the change in citations between quarters, rendering the construction of predictive features a challenging task. A high quality and timely paper will be cited more often than a lower quality paper, but that does not suggest the change in citation counts. The selection of training data was critical, as the evaluation would only be on papers that received more than 5 citations in the quarter following the submission of results. After considering several modeling approaches, we used a modified version of an ordered probit model. We describe each of these steps in turn.NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc

New York University Faculty Digital Archive

Automated Construction of Relational Attributes ACORA: A Progress Report

Author: Perlich Claudia
Publication venue: Stern School of Business, New York University
Publication date: 01/08/2002
Field of study

Tree Induction vs. Logistic Regression: A Learning-Curve Analysis

Author: Perlich Claudia
Provost Foster
Simonoff Jeffrey S.
Publication venue: Stern School of Business, New York University
Publication date: 01/01/2001
Field of study

Tree induction and logistic regression are two standard, off-the-shelf methods for building models for classification. We present a large-scale experimental comparison of logistic regression and tree induction, assessing classification accuracy and the quality of rankings based on class-membership probabilities. We use a learning-curve analysis to examine the relationship of these measures to the size of the training set. The results of the study show several remarkable things. (I) Contrary to prior observations, logistic regression does not generally outperform tree induction. (2) More specifically, and not surprisingly, logistic regression is better for smaller training sets and tree induction for larger data sets. Importantly, this often holds for training sets drawn from the same domain (i.e., the learning curves cross), so conclusions about induction-algorithm superiority on a given domain must be based on an analysis of the learning curves. (3) Contrary to conventional wisdom, tree induction is effective at producing probability-based rankings, although apparently comparatively less so for a given training--set size than at making classifications. Finally, (4) the domains on which tree induction and logistic regression are ultimately preferable can be characterized surprisingly well by a simple measure of signal-to-noise ratio.Information Systems Working Papers Serie

CiteSeerX

New York University Faculty Digital Archive