674 research outputs found
Classifying sequences by the optimized dissimilarity space embedding approach: a case study on the solubility analysis of the E. coli proteome
We evaluate a version of the recently-proposed classification system named
Optimized Dissimilarity Space Embedding (ODSE) that operates in the input space
of sequences of generic objects. The ODSE system has been originally presented
as a classification system for patterns represented as labeled graphs. However,
since ODSE is founded on the dissimilarity space representation of the input
data, the classifier can be easily adapted to any input domain where it is
possible to define a meaningful dissimilarity measure. Here we demonstrate the
effectiveness of the ODSE classifier for sequences by considering an
application dealing with the recognition of the solubility degree of the
Escherichia coli proteome. Solubility, or analogously aggregation propensity,
is an important property of protein molecules, which is intimately related to
the mechanisms underlying the chemico-physical process of folding. Each protein
of our dataset is initially associated with a solubility degree and it is
represented as a sequence of symbols, denoting the 20 amino acid residues. The
herein obtained computational results, which we stress that have been achieved
with no context-dependent tuning of the ODSE system, confirm the validity and
generality of the ODSE-based approach for structured data classification.Comment: 10 pages, 49 reference
Learning to Predict with Highly Granular Temporal Data: Estimating individual behavioral profiles with smart meter data
Big spatio-temporal datasets, available through both open and administrative
data sources, offer significant potential for social science research. The
magnitude of the data allows for increased resolution and analysis at
individual level. While there are recent advances in forecasting techniques for
highly granular temporal data, little attention is given to segmenting the time
series and finding homogeneous patterns. In this paper, it is proposed to
estimate behavioral profiles of individuals' activities over time using
Gaussian Process-based models. In particular, the aim is to investigate how
individuals or groups may be clustered according to the model parameters. Such
a Bayesian non-parametric method is then tested by looking at the
predictability of the segments using a combination of models to fit different
parts of the temporal profiles. Model validity is then tested on a set of
holdout data. The dataset consists of half hourly energy consumption records
from smart meters from more than 100,000 households in the UK and covers the
period from 2015 to 2016. The methodological approach developed in the paper
may be easily applied to datasets of similar structure and granularity, for
example social media data, and may lead to improved accuracy in the prediction
of social dynamics and behavior
Quality, Frequency and Similarity Based Fuzzy Nearest Neighbor Classification
This paper proposes an approach based on fuzzy rough set theory to improve nearest neighbor based classification. Six measures are introduced to evaluate the quality of the nearest neighbors. This quality is combined with the frequency at which classes occur among the nearest neighbors and the similarity w.r.t. the nearest neighbor, to decide which class to pick among the neighbor's classes. The importance of each aspect is weighted using optimized weights. An experimental study shows that our method, Quality, Frequency and Similarity based Fuzzy Nearest Neighbor (QFSNN), outperforms state-of-the-art nearest neighbor classifiers
Chronic liver disease staging classification based on ultrasound, clinical and laboratorial data
In this work the identification and diagnosis of various stages of chronic liver disease is addressed. The classification results of a support vector machine, a decision tree and a k-nearest neighbor classifier are compared. Ultrasound image intensity and textural features are jointly used with clinical and laboratorial data in the staging process. The classifiers training is performed by using a population of 97 patients at six different stages of chronic liver disease and a leave-one-out cross-validation strategy. The best results are obtained using the support vector machine with a radial-basis kernel, with 73.20% of overall accuracy. The good performance of the method is a promising indicator that it can be used, in a non invasive way, to provide reliable information about the chronic liver disease staging
- …