129,714 research outputs found
Subgroup Discovery in Unstructured Data
Subgroup discovery is a descriptive and exploratory data mining technique to
identify subgroups in a population that exhibit interesting behavior with
respect to a variable of interest. Subgroup discovery has numerous applications
in knowledge discovery and hypothesis generation, yet it remains inapplicable
for unstructured, high-dimensional data such as images. This is because
subgroup discovery algorithms rely on defining descriptive rules based on
(attribute, value) pairs, however, in unstructured data, an attribute is not
well defined. Even in cases where the notion of attribute intuitively exists in
the data, such as a pixel in an image, due to the high dimensionality of the
data, these attributes are not informative enough to be used in a rule. In this
paper, we introduce the subgroup-aware variational autoencoder, a novel
variational autoencoder that learns a representation of unstructured data which
leads to subgroups with higher quality. Our experimental results demonstrate
the effectiveness of the method at learning subgroups with high quality while
supporting the interpretability of the concepts
Process performance measurement support : a critical analysis
Design development processes, within engineering disciplines, lack the necessary mechanisms in identifying the specific areas where improved design development performance may be obtained. In addition, they lack the means to consider and align the goals and respective performance levels of related development activities with an organisation's overall goals and performance levels. Current research in organisational performance behaviour, formalised through performance frameworks and methodologies, has attempted to identify and focus upon those critical factors which impinge upon a wealth creation system while attempting to, simultaneously, remain representative of organisational functions, processes, people, decisions and goals. Effective process improvements remain conditional upon: the ability to measure the potential performance gains which may result from an improvement initiative; the ability to understand existing process dynamics and in turn understand the subsequent impact of some change to a system/process; and, the ability to identify potential areas for improvement. The objective of this paper is to discuss some of the management techniques, which are purported to support various process performance concerns and perspectives, and present the major factors that remain unsupported in identifying, measuring and understanding design process performance
Discounting in LTL
In recent years, there is growing need and interest in formalizing and
reasoning about the quality of software and hardware systems. As opposed to
traditional verification, where one handles the question of whether a system
satisfies, or not, a given specification, reasoning about quality addresses the
question of \emph{how well} the system satisfies the specification. One
direction in this effort is to refine the "eventually" operators of temporal
logic to {\em discounting operators}: the satisfaction value of a specification
is a value in , where the longer it takes to fulfill eventuality
requirements, the smaller the satisfaction value is.
In this paper we introduce an augmentation by discounting of Linear Temporal
Logic (LTL), and study it, as well as its combination with propositional
quality operators. We show that one can augment LTL with an arbitrary set of
discounting functions, while preserving the decidability of the model-checking
problem. Further augmenting the logic with unary propositional quality
operators preserves decidability, whereas adding an average-operator makes some
problems undecidable. We also discuss the complexity of the problem, as well as
various extensions
Defining and Evaluating Network Communities based on Ground-truth
Nodes in real-world networks organize into densely linked communities where
edges appear with high concentration among the members of the community.
Identifying such communities of nodes has proven to be a challenging task
mainly due to a plethora of definitions of a community, intractability of
algorithms, issues with evaluation and the lack of a reliable gold-standard
ground-truth.
In this paper we study a set of 230 large real-world social, collaboration
and information networks where nodes explicitly state their group memberships.
For example, in social networks nodes explicitly join various interest based
social groups. We use such groups to define a reliable and robust notion of
ground-truth communities. We then propose a methodology which allows us to
compare and quantitatively evaluate how different structural definitions of
network communities correspond to ground-truth communities. We choose 13
commonly used structural definitions of network communities and examine their
sensitivity, robustness and performance in identifying the ground-truth. We
show that the 13 structural definitions are heavily correlated and naturally
group into four classes. We find that two of these definitions, Conductance and
Triad-participation-ratio, consistently give the best performance in
identifying ground-truth communities. We also investigate a task of detecting
communities given a single seed node. We extend the local spectral clustering
algorithm into a heuristic parameter-free community detection method that
easily scales to networks with more than hundred million nodes. The proposed
method achieves 30% relative improvement over current local clustering methods.Comment: Proceedings of 2012 IEEE International Conference on Data Mining
(ICDM), 201
Mining Heterogeneous Multivariate Time-Series for Learning Meaningful Patterns: Application to Home Health Telecare
For the last years, time-series mining has become a challenging issue for
researchers. An important application lies in most monitoring purposes, which
require analyzing large sets of time-series for learning usual patterns. Any
deviation from this learned profile is then considered as an unexpected
situation. Moreover, complex applications may involve the temporal study of
several heterogeneous parameters. In that paper, we propose a method for mining
heterogeneous multivariate time-series for learning meaningful patterns. The
proposed approach allows for mixed time-series -- containing both pattern and
non-pattern data -- such as for imprecise matches, outliers, stretching and
global translating of patterns instances in time. We present the early results
of our approach in the context of monitoring the health status of a person at
home. The purpose is to build a behavioral profile of a person by analyzing the
time variations of several quantitative or qualitative parameters recorded
through a provision of sensors installed in the home
The Bayesian Case Model: A Generative Approach for Case-Based Reasoning and Prototype Classification
We present the Bayesian Case Model (BCM), a general framework for Bayesian
case-based reasoning (CBR) and prototype classification and clustering. BCM
brings the intuitive power of CBR to a Bayesian generative framework. The BCM
learns prototypes, the "quintessential" observations that best represent
clusters in a dataset, by performing joint inference on cluster labels,
prototypes and important features. Simultaneously, BCM pursues sparsity by
learning subspaces, the sets of features that play important roles in the
characterization of the prototypes. The prototype and subspace representation
provides quantitative benefits in interpretability while preserving
classification accuracy. Human subject experiments verify statistically
significant improvements to participants' understanding when using explanations
produced by BCM, compared to those given by prior art.Comment: Published in Neural Information Processing Systems (NIPS) 2014,
Neural Information Processing Systems (NIPS) 201
Analysis of Network Clustering Algorithms and Cluster Quality Metrics at Scale
Notions of community quality underlie network clustering. While studies
surrounding network clustering are increasingly common, a precise understanding
of the realtionship between different cluster quality metrics is unknown. In
this paper, we examine the relationship between stand-alone cluster quality
metrics and information recovery metrics through a rigorous analysis of four
widely-used network clustering algorithms -- Louvain, Infomap, label
propagation, and smart local moving. We consider the stand-alone quality
metrics of modularity, conductance, and coverage, and we consider the
information recovery metrics of adjusted Rand score, normalized mutual
information, and a variant of normalized mutual information used in previous
work. Our study includes both synthetic graphs and empirical data sets of sizes
varying from 1,000 to 1,000,000 nodes.
We find significant differences among the results of the different cluster
quality metrics. For example, clustering algorithms can return a value of 0.4
out of 1 on modularity but score 0 out of 1 on information recovery. We find
conductance, though imperfect, to be the stand-alone quality metric that best
indicates performance on information recovery metrics. Our study shows that the
variant of normalized mutual information used in previous work cannot be
assumed to differ only slightly from traditional normalized mutual information.
Smart local moving is the best performing algorithm in our study, but
discrepancies between cluster evaluation metrics prevent us from declaring it
absolutely superior. Louvain performed better than Infomap in nearly all the
tests in our study, contradicting the results of previous work in which Infomap
was superior to Louvain. We find that although label propagation performs
poorly when clusters are less clearly defined, it scales efficiently and
accurately to large graphs with well-defined clusters
- …