8 research outputs found
A Semantic Similarity Measure for Expressive Description Logics
A totally semantic measure is presented which is able to calculate a
similarity value between concept descriptions and also between concept
description and individual or between individuals expressed in an expressive
description logic. It is applicable on symbolic descriptions although it uses a
numeric approach for the calculus. Considering that Description Logics stand as
the theoretic framework for the ontological knowledge representation and
reasoning, the proposed measure can be effectively used for agglomerative and
divisional clustering task applied to the semantic web domain.Comment: 13 pages, Appeared at CILC 2005, Convegno Italiano di Logica
Computazionale also available at
http://www.disp.uniroma2.it/CILC2005/downloads/papers/15.dAmato_CILC05.pd
ACORA: Distribution-Based Aggregation for Relational Learning from Identifier Attributes
Feature construction through aggregation plays an essential role in modeling relational
domains with one-to-many relationships between tables. One-to-many relationships
lead to bags (multisets) of related entities, from which predictive information
must be captured. This paper focuses on aggregation from categorical attributes
that can take many values (e.g., object identifiers). We present a novel aggregation
method as part of a relational learning system ACORA, that combines the use of
vector distance and meta-data about the class-conditional distributions of attribute
values. We provide a theoretical foundation for this approach deriving a "relational
fixed-effect" model within a Bayesian framework, and discuss the implications of
identifier aggregation on the expressive power of the induced model. One advantage
of using identifier attributes is the circumvention of limitations caused either by
missing/unobserved object properties or by independence assumptions. Finally, we
show empirically that the novel aggregators can generalize in the presence of identi-
fier (and other high-dimensional) attributes, and also explore the limitations of the
applicability of the methods.Information Systems Working Papers Serie
Distribution-based aggregation for relational learning with identifier attributes
Identifier attributes—very high-dimensional categorical attributes such as particular
product ids or people’s names—rarely are incorporated in statistical modeling. However,
they can play an important role in relational modeling: it may be informative to have communicated
with a particular set of people or to have purchased a particular set of products. A
key limitation of existing relational modeling techniques is how they aggregate bags (multisets)
of values from related entities. The aggregations used by existing methods are simple
summaries of the distributions of features of related entities: e.g., MEAN, MODE, SUM,
or COUNT. This paper’s main contribution is the introduction of aggregation operators that
capture more information about the value distributions, by storing meta-data about value
distributions and referencing this meta-data when aggregating—for example by computing
class-conditional distributional distances. Such aggregations are particularly important for
aggregating values from high-dimensional categorical attributes, for which the simple aggregates
provide little information. In the first half of the paper we provide general guidelines
for designing aggregation operators, introduce the new aggregators in the context of the
relational learning system ACORA (Automated Construction of Relational Attributes), and
provide theoretical justification.We also conjecture special properties of identifier attributes,
e.g., they proxy for unobserved attributes and for information deeper in the relationship
network. In the second half of the paper we provide extensive empirical evidence that the
distribution-based aggregators indeed do facilitate modeling with high-dimensional categorical
attributes, and in support of the aforementioned conjectures.NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc
ACORA: Distribution-Based Aggregation for Relational Learning from Identifier Attributes
Feature construction through aggregation plays an essential role in modeling relational
domains with one-to-many relationships between tables. One-to-many relationships
lead to bags (multisets) of related entities, from which predictive information
must be captured. This paper focuses on aggregation from categorical attributes
that can take many values (e.g., object identifiers). We present a novel aggregation
method as part of a relational learning system ACORA, that combines the use of
vector distance and meta-data about the class-conditional distributions of attribute
values. We provide a theoretical foundation for this approach deriving a "relational
fixed-effect" model within a Bayesian framework, and discuss the implications of
identifier aggregation on the expressive power of the induced model. One advantage
of using identifier attributes is the circumvention of limitations caused either by
missing/unobserved object properties or by independence assumptions. Finally, we
show empirically that the novel aggregators can generalize in the presence of identi-
fier (and other high-dimensional) attributes, and also explore the limitations of the
applicability of the methods.Information Systems Working Papers Serie
A mathematical theory of making hard decisions: model selection and robustness of matrix factorization with binary constraints
One of the first and most fundamental tasks in machine learning is to group observations within a dataset. Given a notion of similarity, finding those instances which are outstandingly similar to each other has manifold applications. Recommender systems and topic analysis in text data are examples which are most intuitive to grasp. The interpretation of the groups, called clusters, is facilitated if the assignment of samples is definite. Especially in high-dimensional data, denoting a degree to which an observation belongs to a specified cluster requires a subsequent processing of the model to filter the most important information. We argue that a good summary of the data provides hard decisions on the following question: how many groups are there, and which observations belong to which clusters? In this work, we contribute to the theoretical and practical background of clustering tasks, addressing one or both aspects of this question. Our overview of state-of-the-art clustering approaches details the challenges of our ambition to provide hard decisions. Based on this overview, we develop new methodologies for two branches of clustering: the one concerns the derivation of nonconvex clusters, known as spectral clustering; the other addresses the identification of biclusters, a set of samples together with similarity defining features, via Boolean matrix factorization. One of the main challenges in both considered settings is the robustness to noise. Assuming that the issue of robustness is controllable by means of theoretical insights, we have a closer look at those aspects of established clustering methods which lack a theoretical foundation. In the scope of Boolean matrix factorization, we propose a versatile framework for the optimization of matrix factorizations subject to binary constraints. Especially Boolean factorizations have been computed by intuitive methods so far, implementing greedy heuristics which lack quality guarantees of obtained solutions. In contrast, we propose to build upon recent advances in nonconvex optimization theory. This enables us to provide convergence guarantees to local optima of a relaxed objective, requiring only approximately binary factor matrices. By means of this new optimization scheme PAL-Tiling, we propose two approaches to automatically determine the number of clusters. The one is based on information theory, employing the minimum description length principle, and the other is a novel statistical approach, controlling the false discovery rate. The flexibility of our framework PAL-Tiling enables the optimization of novel factorization schemes. In a different context, where every data point belongs to a pre-defined class, a characterization of the classes may be obtained by Boolean factorizations. However, there are cases where this traditional factorization scheme is not sufficient. Therefore, we propose the integration of another factor matrix, reflecting class-specific differences within a cluster. Our theoretical considerations are complemented by empirical evaluations, showing how our methods combine theoretical soundness with practical advantages