18 research outputs found
Robust EM algorithm for model-based curve clustering
Model-based clustering approaches concern the paradigm of exploratory data
analysis relying on the finite mixture model to automatically find a latent
structure governing observed data. They are one of the most popular and
successful approaches in cluster analysis. The mixture density estimation is
generally performed by maximizing the observed-data log-likelihood by using the
expectation-maximization (EM) algorithm. However, it is well-known that the EM
algorithm initialization is crucial. In addition, the standard EM algorithm
requires the number of clusters to be known a priori. Some solutions have been
provided in [31, 12] for model-based clustering with Gaussian mixture models
for multivariate data. In this paper we focus on model-based curve clustering
approaches, when the data are curves rather than vectorial data, based on
regression mixtures. We propose a new robust EM algorithm for clustering
curves. We extend the model-based clustering approach presented in [31] for
Gaussian mixture models, to the case of curve clustering by regression
mixtures, including polynomial regression mixtures as well as spline or
B-spline regressions mixtures. Our approach both handles the problem of
initialization and the one of choosing the optimal number of clusters as the EM
learning proceeds, rather than in a two-fold scheme. This is achieved by
optimizing a penalized log-likelihood criterion. A simulation study confirms
the potential benefit of the proposed algorithm in terms of robustness
regarding initialization and funding the actual number of clusters.Comment: In Proceedings of the 2013 International Joint Conference on Neural
Networks (IJCNN), 2013, Dallas, TX, US
TWO-DIMENSIONAL GMM-BASED CLUSTERING IN THE PRESENCE OF QUANTIZATION NOISE
In this paper, unlike to the commonly considered clustering, wherein data attributes are accurately presented, it is researched how successful clustering can be performed when data attributes are represented with smaller accuracy, i.e. by using the small number of bits. In particular, the effect of data attributes quantization on the two-dimensional two-component Gaussian mixture model (GMM)-based clustering by using expectation–maximization (EM) algorithm is analyzed. An independent quantization of data attributes by using uniform quantizers with the support limits adjusted to the minimal and maximal attribute values is assumed. The analysis makes it possible to determine the number of bits for data presentation that provides the accurate clustering. These findings can be useful in clustering wherein before being grouped the data have to be represented with a finite small number of bits due to their transmission through the bandwidth-limited channel.
A Learning-Based EM Clustering for Circular Data with Unknown Number of Clusters
Clustering is a method for analyzing grouped data. Circular data were well used in various applications, such as wind directions, departure directions of migrating birds or animals, etc. The expectation & maximization (EM) algorithm on mixtures of von Mises distributions is popularly used for clustering circular data. In general, the EM algorithm is sensitive to initials and not robust to outliers in which it is also necessary to give a number of clusters a priori. In this paper, we consider a learning-based schema for EM, and then propose a learning-based EM algorithm on mixtures of von Mises distributions for clustering grouped circular data. The proposed clustering method is without any initial and robust to outliers with automatically finding the number of clusters. Some numerical and real data sets are used to compare the proposed algorithm with existing methods. Experimental results and comparisons actually demonstrate these good aspects of effectiveness and superiority of the proposed learning-based EM algorithm
Improved Correction of Atmospheric Pressure Data Obtained by Smartphones through Machine Learning
A correction method using machine learning aims to improve the conventional linear regression (LR) based method for correction of atmospheric pressure data obtained by smartphones. The method proposed in this study conducts clustering and regression analysis with time domain classification. Data obtained in Gyeonggi-do, one of the most populous provinces in South Korea surrounding Seoul with the size of 10,000 km2, from July 2014 through December 2014, using smartphones were classified with respect to time of day (daytime or nighttime) as well as day of the week (weekday or weekend) and the user’s mobility, prior to the expectation-maximization (EM) clustering. Subsequently, the results were analyzed for comparison by applying machine learning methods such as multilayer perceptron (MLP) and support vector regression (SVR). The results showed a mean absolute error (MAE) 26% lower on average when regression analysis was performed through EM clustering compared to that obtained without EM clustering. For machine learning methods, the MAE for SVR was around 31% lower for LR and about 19% lower for MLP. It is concluded that pressure data from smartphones are as good as the ones from national automatic weather station (AWS) network
visClust: A visual clustering algorithm based on orthogonal projections
We present a novel clustering algorithm, visClust, that is based on lower
dimensional data representations and visual interpretation. Thereto, we design
a transformation that allows the data to be represented by a binary integer
array enabling the further use of image processing methods to select a
partition. Qualitative and quantitative analyses show that the algorithm
obtains high accuracy (measured with an adjusted one-sided Rand-Index) and
requires low runtime and RAM. We compare the results to 6 state-of-the-art
algorithms, confirming the quality of visClust by outperforming in most
experiments. Moreover, the algorithm asks for just one obligatory input
parameter while allowing optimization via optional parameters. The code is made
available on GitHub.Comment: 23 page
Estimation of 5G Core and RAN End-to-End Delay through Gaussian Mixture Models
Funding Information: This research was funded by Fundação para a Ciência e Tecnologia (FCT) under the projects 2022.08786.PTDC and UIDB/50008/2020. Publisher Copyright: © 2022 by the authors.Network analytics provide a comprehensive picture of the network’s Quality of Service (QoS), including the End-to-End (E2E) delay. In this paper, we characterize the Core and the Radio Access Network (RAN) E2E delay of 5G networks with the Standalone (SA) and Non-Standalone (NSA) topologies when a single known Probability Density Function (PDF) is not suitable to model its distribution. To this end, multiple PDFs, denominated as components, are combined in a Gaussian Mixture Model (GMM) to represent the distribution of the E2E delay. The accuracy and computation time of the GMM is evaluated for a different number of components and a number of samples. The results presented in the paper are based on a dataset of E2E delay values sampled from both SA and NSA 5G networks. Finally, we show that the GMM can be adopted to estimate a high diversity of E2E delay patterns found in 5G networks and its computation time can be adequate for a large range of applications.publishersversionpublishe
RECOMED: A Comprehensive Pharmaceutical Recommendation System
A comprehensive pharmaceutical recommendation system was designed based on
the patients and drugs features extracted from Drugs.com and Druglib.com.
First, data from these databases were combined, and a dataset of patients and
drug information was built. Secondly, the patients and drugs were clustered,
and then the recommendation was performed using different ratings provided by
patients, and importantly by the knowledge obtained from patients and drug
specifications, and considering drug interactions. To the best of our
knowledge, we are the first group to consider patients conditions and history
in the proposed approach for selecting a specific medicine appropriate for that
particular user. Our approach applies artificial intelligence (AI) models for
the implementation. Sentiment analysis using natural language processing
approaches is employed in pre-processing along with neural network-based
methods and recommender system algorithms for modeling the system. In our work,
patients conditions and drugs features are used for making two models based on
matrix factorization. Then we used drug interaction to filter drugs with severe
or mild interactions with other drugs. We developed a deep learning model for
recommending drugs by using data from 2304 patients as a training set, and then
we used data from 660 patients as our validation set. After that, we used
knowledge from critical information about drugs and combined the outcome of the
model into a knowledge-based system with the rules obtained from constraints on
taking medicine.Comment: 39 pages, 14 figures, 13 table
NetCluster: A clustering-based framework to analyze internet passive measurements data
Internet measured data collected via passive measurement are analyzed to obtain localization information on nodes by clustering (i.e., grouping together) nodes that exhibit similar network path properties. Since traditional clustering algorithms fail to correctly identify clusters of homogeneous nodes, we propose the NetCluster novel framework, suited to analyze Internet measurement datasets. We show that the proposed framework correctly analyzes synthetically generated traces. Finally, we apply it to real traces collected at the access link of Politecnico di Torino campus LAN and discuss the network characteristics as seen at the vantage point
Unsupervised online clustering and detection algorithms using crowdsourced data for malaria diagnosis
© . This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/Crowdsourced data in science might be severely error-prone due to the inexperience of annotators participating in the project. In this work, we present a procedure to detect specific structures in an image given tags provided by multiple annotators and collected through a crowdsourcing methodology. The procedure consists of two stages based on the Expectation–Maximization (EM) algorithm, one for clustering and the other one for detection, and it gracefully combines data coming from annotators with unknown reliability in an unsupervised manner. An online implementation of the approach is also presented that is well suited to crowdsourced streaming data. Comprehensive experimental results with real data from the MalariaSpot project are also included.Peer ReviewedPreprin