1,820 research outputs found
Understanding Concept Identification as Consistent Data Clustering Across Multiple Feature Spaces
Identifying meaningful concepts in large data sets can provide valuable
insights into engineering design problems. Concept identification aims at
identifying non-overlapping groups of design instances that are similar in a
joint space of all features, but which are also similar when considering only
subsets of features. These subsets usually comprise features that characterize
a design with respect to one specific context, for example, constructive design
parameters, performance values, or operation modes. It is desirable to evaluate
the quality of design concepts by considering several of these feature subsets
in isolation. In particular, meaningful concepts should not only identify
dense, well separated groups of data instances, but also provide
non-overlapping groups of data that persist when considering pre-defined
feature subsets separately. In this work, we propose to view concept
identification as a special form of clustering algorithm with a broad range of
potential applications beyond engineering design. To illustrate the differences
between concept identification and classical clustering algorithms, we apply a
recently proposed concept identification algorithm to two synthetic data sets
and show the differences in identified solutions. In addition, we introduce the
mutual information measure as a metric to evaluate whether solutions return
consistent clusters across relevant subsets. To support the novel understanding
of concept identification, we consider a simulated data set from a
decision-making problem in the energy management domain and show that the
identified clusters are more interpretable with respect to relevant feature
subsets than clusters found by common clustering algorithms and are thus more
suitable to support a decision maker.Comment: 10 pages, 6 figures, to be published in proceedings of 2022 IEEE
International Conference on Data Mining Workshops (ICDMW
Entropy-scaling search of massive biological data
Many datasets exhibit a well-defined structure that can be exploited to
design faster search tools, but it is not always clear when such acceleration
is possible. Here, we introduce a framework for similarity search based on
characterizing a dataset's entropy and fractal dimension. We prove that
searching scales in time with metric entropy (number of covering hyperspheres),
if the fractal dimension of the dataset is low, and scales in space with the
sum of metric entropy and information-theoretic entropy (randomness of the
data). Using these ideas, we present accelerated versions of standard tools,
with no loss in specificity and little loss in sensitivity, for use in three
domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics
(MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search
(esFragBag, 10x speedup of FragBag). Our framework can be used to achieve
"compressive omics," and the general theory can be readily applied to data
science problems outside of biology.Comment: Including supplement: 41 pages, 6 figures, 4 tables, 1 bo
Bayesian Sparse Factor Analysis of Genetic Covariance Matrices
Quantitative genetic studies that model complex, multivariate phenotypes are
important for both evolutionary prediction and artificial selection. For
example, changes in gene expression can provide insight into developmental and
physiological mechanisms that link genotype and phenotype. However, classical
analytical techniques are poorly suited to quantitative genetic studies of gene
expression where the number of traits assayed per individual can reach many
thousand. Here, we derive a Bayesian genetic sparse factor model for estimating
the genetic covariance matrix (G-matrix) of high-dimensional traits, such as
gene expression, in a mixed effects model. The key idea of our model is that we
need only consider G-matrices that are biologically plausible. An organism's
entire phenotype is the result of processes that are modular and have limited
complexity. This implies that the G-matrix will be highly structured. In
particular, we assume that a limited number of intermediate traits (or factors,
e.g., variations in development or physiology) control the variation in the
high-dimensional phenotype, and that each of these intermediate traits is
sparse -- affecting only a few observed traits. The advantages of this approach
are two-fold. First, sparse factors are interpretable and provide biological
insight into mechanisms underlying the genetic architecture. Second, enforcing
sparsity helps prevent sampling errors from swamping out the true signal in
high-dimensional data. We demonstrate the advantages of our model on simulated
data and in an analysis of a published Drosophila melanogaster gene expression
data set.Comment: 35 pages, 7 figure
A Short Survey on Data Clustering Algorithms
With rapidly increasing data, clustering algorithms are important tools for
data analytics in modern research. They have been successfully applied to a
wide range of domains; for instance, bioinformatics, speech recognition, and
financial analysis. Formally speaking, given a set of data instances, a
clustering algorithm is expected to divide the set of data instances into the
subsets which maximize the intra-subset similarity and inter-subset
dissimilarity, where a similarity measure is defined beforehand. In this work,
the state-of-the-arts clustering algorithms are reviewed from design concept to
methodology; Different clustering paradigms are discussed. Advanced clustering
algorithms are also discussed. After that, the existing clustering evaluation
metrics are reviewed. A summary with future insights is provided at the end
An agent-driven semantical identifier using radial basis neural networks and reinforcement learning
Due to the huge availability of documents in digital form, and the deception
possibility raise bound to the essence of digital documents and the way they
are spread, the authorship attribution problem has constantly increased its
relevance. Nowadays, authorship attribution,for both information retrieval and
analysis, has gained great importance in the context of security, trust and
copyright preservation. This work proposes an innovative multi-agent driven
machine learning technique that has been developed for authorship attribution.
By means of a preprocessing for word-grouping and time-period related analysis
of the common lexicon, we determine a bias reference level for the recurrence
frequency of the words within analysed texts, and then train a Radial Basis
Neural Networks (RBPNN)-based classifier to identify the correct author. The
main advantage of the proposed approach lies in the generality of the semantic
analysis, which can be applied to different contexts and lexical domains,
without requiring any modification. Moreover, the proposed system is able to
incorporate an external input, meant to tune the classifier, and then
self-adjust by means of continuous learning reinforcement.Comment: Published on: Proceedings of the XV Workshop "Dagli Oggetti agli
Agenti" (WOA 2014), Catania, Italy, Sepember. 25-26, 201
Towards outlier detection for high-dimensional data streams using projected outlier analysis strategy
[Abstract]: Outlier detection is an important research problem in data mining that aims to discover useful abnormal and irregular patterns hidden in large data sets. Most existing outlier detection methods only deal with static data with relatively low dimensionality.
Recently, outlier detection for high-dimensional stream data became a new emerging research problem. A key observation that motivates this research is that outliers
in high-dimensional data are projected outliers, i.e., they are embedded in lower-dimensional subspaces. Detecting projected outliers from high-dimensional stream
data is a very challenging task for several reasons. First, detecting projected outliers is difficult even for high-dimensional static data. The exhaustive search for the out-lying subspaces where projected outliers are embedded is a NP problem. Second, the algorithms for handling data streams are constrained to take only one pass to process the streaming data with the conditions of space limitation and time criticality. The currently existing methods for outlier detection are found to be ineffective for detecting projected outliers in high-dimensional data streams.
In this thesis, we present a new technique, called the Stream Project Outlier deTector (SPOT), which attempts to detect projected outliers in high-dimensional
data streams. SPOT employs an innovative window-based time model in capturing dynamic statistics from stream data, and a novel data structure containing a set of
top sparse subspaces to detect projected outliers effectively. SPOT also employs a multi-objective genetic algorithm as an effective search method for finding the
outlying subspaces where most projected outliers are embedded. The experimental results demonstrate that SPOT is efficient and effective in detecting projected outliers
for high-dimensional data streams. The main contribution of this thesis is that it provides a backbone in tackling the challenging problem of outlier detection for high-
dimensional data streams. SPOT can facilitate the discovery of useful abnormal patterns and can be potentially applied to a variety of high demand applications, such as for sensor network data monitoring, online transaction protection, etc
- ā¦