1,264 research outputs found
A hybrid supervised/unsupervised machine learning approach to solar flare prediction
We introduce a hybrid approach to solar flare prediction, whereby a
supervised regularization method is used to realize feature importance and an
unsupervised clustering method is used to realize the binary flare/no-flare
decision. The approach is validated against NOAA SWPC data
Laplacian Mixture Modeling for Network Analysis and Unsupervised Learning on Graphs
Laplacian mixture models identify overlapping regions of influence in
unlabeled graph and network data in a scalable and computationally efficient
way, yielding useful low-dimensional representations. By combining Laplacian
eigenspace and finite mixture modeling methods, they provide probabilistic or
fuzzy dimensionality reductions or domain decompositions for a variety of input
data types, including mixture distributions, feature vectors, and graphs or
networks. Provable optimal recovery using the algorithm is analytically shown
for a nontrivial class of cluster graphs. Heuristic approximations for scalable
high-performance implementations are described and empirically tested.
Connections to PageRank and community detection in network analysis demonstrate
the wide applicability of this approach. The origins of fuzzy spectral methods,
beginning with generalized heat or diffusion equations in physics, are reviewed
and summarized. Comparisons to other dimensionality reduction and clustering
methods for challenging unsupervised machine learning problems are also
discussed.Comment: 13 figures, 35 reference
Fuzzy adaptive resonance theory: Applications and extensions
Adaptive Resonance Theory, ART, is a powerful clustering tool for learning arbitrary patterns in a self-organizing manner. In this research, two papers are presented that examine the extensibility and applications of ART. The first paper examines a means to boost ART performance by assigning each cluster a vigilance value, instead of a single value for the whole ART module. A Particle Swarm Optimization technique is used to search for desirable vigilance values. In the second paper, it is shown how ART, and clustering in general, can be a useful tool in preprocessing time series data. Clustering quantization attempts to meaningfully group data for preprocessing purposes, and improves results over the absence of quantization with statistical significance. --Abstract, page iv
A cosine based validation measure for Document Clustering
Document Clustering is the peculiar application of cluster analysis methods on huge documentary databases. Document Clustering aims at organizing a large quantity of unlabelled documents into a smaller number of meaningful and coherent clusters, similar in content. One of the main unsolved problems in clustering literature is the lack of a reliable methodology to evaluate results, although a wide variety of validation measures has been proposed. If those measures are often unsatisfactory when dealing with numerical databases, they definitely underperform in Document Clustering. This paper proposes a new validation measure. After introducing the most common approaches to Document Clustering, our attention is focused on Spherical K-means, do to its strict connection with the Vector Space Model, typical of Information Retrieval. Since Spherical K-means adopts a cosine-based similarity measure, we propose a validation measure based on the same criterion. The new measure effectiveness is shown in the frame of a comparative study, by involving 13 different corpora (usually used in literature for comparing different proposals) and 15 validation measures
A taxonomy framework for unsupervised outlier detection techniques for multi-type data sets
The term "outlier" can generally be defined as an observation that is significantly different from
the other values in a data set. The outliers may be instances of error or indicate events. The
task of outlier detection aims at identifying such outliers in order to improve the analysis of
data and further discover interesting and useful knowledge about unusual events within numerous
applications domains. In this paper, we report on contemporary unsupervised outlier detection
techniques for multiple types of data sets and provide a comprehensive taxonomy framework and
two decision trees to select the most suitable technique based on data set. Furthermore, we
highlight the advantages, disadvantages and performance issues of each class of outlier detection
techniques under this taxonomy framework
An Ontology-Based Recommender System with an Application to the Star Trek Television Franchise
Collaborative filtering based recommender systems have proven to be extremely
successful in settings where user preference data on items is abundant.
However, collaborative filtering algorithms are hindered by their weakness
against the item cold-start problem and general lack of interpretability.
Ontology-based recommender systems exploit hierarchical organizations of users
and items to enhance browsing, recommendation, and profile construction. While
ontology-based approaches address the shortcomings of their collaborative
filtering counterparts, ontological organizations of items can be difficult to
obtain for items that mostly belong to the same category (e.g., television
series episodes). In this paper, we present an ontology-based recommender
system that integrates the knowledge represented in a large ontology of
literary themes to produce fiction content recommendations. The main novelty of
this work is an ontology-based method for computing similarities between items
and its integration with the classical Item-KNN (K-nearest neighbors)
algorithm. As a study case, we evaluated the proposed method against other
approaches by performing the classical rating prediction task on a collection
of Star Trek television series episodes in an item cold-start scenario. This
transverse evaluation provides insights into the utility of different
information resources and methods for the initial stages of recommender system
development. We found our proposed method to be a convenient alternative to
collaborative filtering approaches for collections of mostly similar items,
particularly when other content-based approaches are not applicable or
otherwise unavailable. Aside from the new methods, this paper contributes a
testbed for future research and an online framework to collaboratively extend
the ontology of literary themes to cover other narrative content.Comment: 25 pages, 6 figures, 5 tables, minor revision
Towards an Architecture for Efficient Distributed Search of Multimodal Information
The creation of very large-scale multimedia search engines, with more than one billion
images and videos, is a pressing need of digital societies where data is generated by multiple connected devices. Distributing search indexes in cloud environments is the inevitable solution to deal with the increasing scale of image and video collections. The distribution of such indexes in this setting raises multiple challenges such as the even partitioning of data space, load balancing across index nodes and the fusion of the results computed over multiple nodes. The main question behind this thesis is how to reduce and distribute the multimedia retrieval computational complexity?
This thesis studies the extension of sparse hash inverted indexing to distributed settings.
The main goal is to ensure that indexes are uniformly distributed across computing nodes while keeping similar documents on the same nodes. Load balancing is performed at both node and index level, to guarantee that the retrieval process is not delayed by nodes that have to inspect larger subsets of the index.
Multimodal search requires the combination of the search results from individual modalities and document features. This thesis studies rank fusion techniques focused on reducing complexity by automatically selecting only the features that improve retrieval effectiveness.
The achievements of this thesis span both distributed indexing and rank fusion research.
Experiments across multiple datasets show that sparse hashes can be used to distribute documents and queries across index entries in a balanced and redundant manner across nodes. Rank fusion results show that is possible to reduce retrieval complexity and improve efficiency by searching only a subset of the feature indexes
SUBIC: A Supervised Bi-Clustering Approach for Precision Medicine
Traditional medicine typically applies one-size-fits-all treatment for the
entire patient population whereas precision medicine develops tailored
treatment schemes for different patient subgroups. The fact that some factors
may be more significant for a specific patient subgroup motivates clinicians
and medical researchers to develop new approaches to subgroup detection and
analysis, which is an effective strategy to personalize treatment. In this
study, we propose a novel patient subgroup detection method, called Supervised
Biclustring (SUBIC) using convex optimization and apply our approach to detect
patient subgroups and prioritize risk factors for hypertension (HTN) in a
vulnerable demographic subgroup (African-American). Our approach not only finds
patient subgroups with guidance of a clinically relevant target variable but
also identifies and prioritizes risk factors by pursuing sparsity of the input
variables and encouraging similarity among the input variables and between the
input and target variable
- …