496 research outputs found
Analysis of FMRI Exams Through Unsupervised Learning and Evaluation Index
In the last few years, the clustering of time series has seen significant growth and has proven effective in
providing useful information in various domains of use. This growing interest in time series clustering is the
result of the effort made by the scientific community in the context of time data mining.
For these reasons, the first phase of the thesis focused on the study of the data obtained from fMRI exams
carried out in task-based and resting state mode, using and comparing different clustering algorithms: SelfOrganizing map (SOM), the Growing Neural Gas (GNG) and Neural Gas (NG) which are crisp-type
algorithms, a fuzzy algorithm, the Fuzzy C algorithm, was also used (FCM). The evaluation of the results
obtained by using clustering algorithms was carried out using the Davies Bouldin evaluation index (DBI or
DB index).
Clustering evaluation is the second topic of this thesis. To evaluate the validity of the clustering, there are
specific techniques, but none of these is already consolidated for the study of fMRI exams. Furthermore,
the evaluation of evaluation techniques is still an open research field. Eight clustering validation indexes
(CVIs) applied to fMRI data clustering will be analysed. The validation indices that have been used are
Pakhira Bandyopadhyay Maulik Index (crisp and fuzzy), Fukuyama Sugeno Index, Rezaee Lelieveldt Reider
Index, Wang Sun Jiang Index, Xie Beni Index, Davies Bouldin Index, Soft Davies Bouldin Index. Furthermore,
an evaluation of the evaluation indices will be carried out, which will take into account the sub-optimal
performance obtained by the indices, through the introduction of new metrics. Finally, a new methodology
for the evaluation of CVIs will be introduced, which will use an ANFIS model
Customer Segmentation with Subscription-based Online Media Customers
In the modern era, using personalization when reaching out to potential or current customers is essential for businesses to compete in their area of business. With large customer bases, this personalization becomes more difficult, thus segmenting entire customer bases into smaller groups helps businesses focus better on personalization and targeted business decisions. These groups can be straightforward, like segmenting solely based on age, or more complex, like taking into account geographic, demographic, behavioral, and psychographic differences among the customers. In the latter case, customer segmentation should be performed with Machine Learning, which can help find more hidden patterns within the data.
Often, the number of features in the customer data set is so large that some form of dimensionality reduction is needed. That is also the case with this thesis, which includes 12802 unique article tags that are desired to be included in the segmentation. A form of dimensionality reduction called feature hashing is selected for hashing the tags for its ability to be introduced new tags in the future.
Using hashed features in customer segmentation is a balancing act. With more hashed features, the evaluation metrics might give better results and the hashed features resemble more closely the unhashed article tag data, but with less hashed features the clustering process is faster, more memory-efficient and the resulting clusters are more interpretable to the business. Three clustering algorithms, K-means, DBSCAN, and BIRCH, are tested with eight feature hashing bin sizes for each, with promising results for K-means and BIRCH
An approach to validity indices for clustering techniques in Big Data
Clustering analysis is one of the most used
Machine Learning techniques to discover groups among data
objects. Some clustering methods require the number of clus ters into which the data is going to be partitioned. There exist
several cluster validity indices that help us to approximate
the optimal number of clusters of the dataset. However, such
indices are not suitable to deal with Big Data due to its size
limitation and runtime costs. This paper presents two cluster ing validity indices that handle large amount of data in low
computational time. Our indices are based on redefinitions
of traditional indices by simplifying the intra-cluster distance
calculation. Two types of tests have been carried out over 28
synthetic datasets to analyze the performance of the proposed
indices. First, we test the indices with small and medium size
datasets to verify that our indices have a similar effectiveness
to the traditional ones. Subsequently, tests on datasets of up
to 11 million records and 20 features have been executed to
check their efficiency. The results show that both indices can
handle Big Data in a very low computational time with an
effectiveness similar to the traditional indices using Apache
Spark framework.Ministerio de EconomÃa y Competitividad TIN2014-55894-C2-1-
Visual and semantic interpretability of projections of high dimensional data for classification tasks
A number of visual quality measures have been introduced in visual analytics
literature in order to automatically select the best views of high dimensional
data from a large number of candidate data projections. These methods generally
concentrate on the interpretability of the visualization and pay little
attention to the interpretability of the projection axes. In this paper, we
argue that interpretability of the visualizations and the feature
transformation functions are both crucial for visual exploration of high
dimensional labeled data. We present a two-part user study to examine these two
related but orthogonal aspects of interpretability. We first study how humans
judge the quality of 2D scatterplots of various datasets with varying number of
classes and provide comparisons with ten automated measures, including a number
of visual quality measures and related measures from various machine learning
fields. We then investigate how the user perception on interpretability of
mathematical expressions relate to various automated measures of complexity
that can be used to characterize data projection functions. We conclude with a
discussion of how automated measures of visual and semantic interpretability of
data projections can be used together for exploratory analysis in
classification tasks.Comment: Longer version of the VAST 2011 poster.
http://dx.doi.org/10.1109/VAST.2011.610247
- …