162,731 research outputs found
On clustering financial time series: a need for distances between dependent random variables
The following working document summarizes our work on the clustering of
financial time series. It was written for a workshop on information geometry
and its application for image and signal processing. This workshop brought
several experts in pure and applied mathematics together with applied
researchers from medical imaging, radar signal processing and finance. The
authors belong to the latter group. This document was written as a long
introduction to further development of geometric tools in financial
applications such as risk or portfolio analysis. Indeed, risk and portfolio
analysis essentially rely on covariance matrices. Besides that the Gaussian
assumption is known to be inaccurate, covariance matrices are difficult to
estimate from empirical data. To filter noise from the empirical estimate,
Mantegna proposed using hierarchical clustering. In this work, we first show
that this procedure is statistically consistent. Then, we propose to use
clustering with a much broader application than the filtering of empirical
covariance matrices from the estimate correlation coefficients. To be able to
do that, we need to obtain distances between the financial time series that
incorporate all the available information in these cross-dependent random
processes.Comment: Work presented during a workshop on Information Geometry at the
International Centre for Mathematical Sciences, Edinburgh, U
Explicit approximation of stochastic optimal feedback control for combined therapy of cancer
In this paper, a tractable methodology is proposed to approximate stochastic
optimal feedback treatment in the context of mixed immuno-chemo therapy of
cancer. The method uses a fixed-point value iteration that approximately solves
a stochastic dynamic programming-like equation. It is in particular shown that
the introduction of a variance-related penalty in the latter induces better
results that cope with the consequences of softening the health safety
constraints in the cost function. The convergence of the value function
iteration is revisited in the presence of the variance related term. The
implementation involves some Machine Learning tools in order to represent the
optimal function and to perform complexity reduction by clustering.
Quantitative illustration is given using a commonly used model of combined
therapy involving twelve highly uncertain parameters
Visual Analytics of Image-Centric Cohort Studies in Epidemiology
Epidemiology characterizes the influence of causes to disease and health
conditions of defined populations. Cohort studies are population-based studies
involving usually large numbers of randomly selected individuals and comprising
numerous attributes, ranging from self-reported interview data to results from
various medical examinations, e.g., blood and urine samples. Since recently,
medical imaging has been used as an additional instrument to assess risk
factors and potential prognostic information. In this chapter, we discuss such
studies and how the evaluation may benefit from visual analytics. Cluster
analysis to define groups, reliable image analysis of organs in medical imaging
data and shape space exploration to characterize anatomical shapes are among
the visual analytics tools that may enable epidemiologists to fully exploit the
potential of their huge and complex data. To gain acceptance, visual analytics
tools need to complement more classical epidemiologic tools, primarily
hypothesis-driven statistical analysis
AntiPlag: Plagiarism Detection on Electronic Submissions of Text Based Assignments
Plagiarism is one of the growing issues in academia and is always a concern
in Universities and other academic institutions. The situation is becoming even
worse with the availability of ample resources on the web. This paper focuses
on creating an effective and fast tool for plagiarism detection for text based
electronic assignments. Our plagiarism detection tool named AntiPlag is
developed using the tri-gram sequence matching technique. Three sets of text
based assignments were tested by AntiPlag and the results were compared against
an existing commercial plagiarism detection tool. AntiPlag showed better
results in terms of false positives compared to the commercial tool due to the
pre-processing steps performed in AntiPlag. In addition, to improve the
detection latency, AntiPlag applies a data clustering technique making it four
times faster than the commercial tool considered. AntiPlag could be used to
isolate plagiarized text based assignments from non-plagiarised assignments
easily. Therefore, we present AntiPlag, a fast and effective tool for
plagiarism detection on text based electronic assignments
Mining the Workload of Real Grid Computing Systems
Since the mid 1990s, grid computing systems have emerged as an analogy for
making computing power as pervasive an easily accessible as an electric power
grid. Since then, grid computing systems have been shown to be able to provide
very large amounts of storage and computing power to mainly support the
scientific and engineering research on a wide geographic scale. Understanding
the workload characteristics incoming to such systems is a milestone for the
design and the tuning of effective resource management strategies. This is
accomplished through the workload characterization, where workload
characteristics are analyzed and a possibly realistic model for those is
obtained. In this paper, we study the workload of some real grid systems by
using a data mining approach to build a workload model for job interarrival
time and runtime, and a Bayesian approach to capture user correlations and
usage patterns. The final model is then validated against the workload coming
from a real grid system
Discriminating Topology in Galaxy Distributions using Network Analysis
(abridged) The large-scale distribution of galaxies is generally analyzed
using the two-point correlation function. However, this statistic does not
capture the topology of the distribution, and it is necessary to resort to
higher order correlations to break degeneracies. We demonstrate that an
alternate approach using network analysis can discriminate between
topologically different distributions that have similar two-point correlations.
We investigate two galaxy point distributions, one produced by a cosmological
simulation and the other by a L\'evy walk. For the cosmological simulation, we
adopt the redshift slice from Illustris (Vogelsberger et al. 2014A)
and select galaxies with stellar masses greater than . The two
point correlation function of these simulated galaxies follows a single
power-law, . Then, we generate L\'evy walks matching the
correlation function and abundance with the simulated galaxies. We find that,
while the two simulated galaxy point distributions have the same abundance and
two point correlation function, their spatial distributions are very different;
most prominently, \emph{filamentary structures}, absent in L\'evy fractals. To
quantify these missing topologies, we adopt network analysis tools and measure
diameter, giant component, and transitivity from networks built by a
conventional friends-of-friends recipe with various linking lengths. Unlike the
abundance and two point correlation function, these network quantities reveal a
clear separation between the two simulated distributions; therefore, the galaxy
distribution simulated by Illustris is not a L\'evy fractal quantitatively. We
find that the described network quantities offer an efficient tool for
discriminating topologies and for comparing observed and theoretical
distributions.Comment: 11 pages, 5 figures, submitted to MNRAS on 12/15/2015, now fully
reviewed for publication; more information about our network analyses can be
found at https://sites.google.com/site/shongscience/researc
Attribute Identification and Predictive Customisation Using Fuzzy Clustering and Genetic Search for Industry 4.0 Environments
Today´s factory involves more services and customisation. A paradigm shift is towards “Industry 4.0” (i4) aiming at realising mass customisation at a mass production cost. However, there is a lack of tools for customer informatics. This paper addresses this issue and develops a predictive analytics framework integrating big data analysis and business informatics, using Computational Intelligence (CI). In particular, a fuzzy c-means is used for pattern recognition, as well as managing relevant big data for feeding potential customer needs and wants for improved productivity at the design stage for customised mass production. The selection of patterns from big data is performed using a genetic algorithm with fuzzy c-means, which helps with clustering and selection of optimal attributes. The case study shows that fuzzy c-means are able to assign new clusters with growing knowledge of customer needs and wants. The dataset has three types of entities: specification of various characteristics, assigned insurance risk rating, and normalised losses in use compared with other cars. The fuzzy c-means tool offers a number of features suitable for smart designs for an i4 environment
Grid-based Approaches for Distributed Data Mining Applications
The data mining field is an important source of large-scale applications and
datasets which are getting more and more common. In this paper, we present
grid-based approaches for two basic data mining applications, and a performance
evaluation on an experimental grid environment that provides interesting
monitoring capabilities and configuration tools. We propose a new distributed
clustering approach and a distributed frequent itemsets generation well-adapted
for grid environments. Performance evaluation is done using the Condor system
and its workflow manager DAGMan. We also compare this performance analysis to a
simple analytical model to evaluate the overheads related to the workflow
engine and the underlying grid system. This will specifically show that
realistic performance expectations are currently difficult to achieve on the
grid
Processing Tweets for Cybersecurity Threat Awareness
Receiving timely and relevant security information is crucial for maintaining
a high-security level on an IT infrastructure. This information can be
extracted from Open Source Intelligence published daily by users, security
organisations, and researchers. In particular, Twitter has become an
information hub for obtaining cutting-edge information about many subjects,
including cybersecurity. This work proposes SYNAPSE, a Twitter-based streaming
threat monitor that generates a continuously updated summary of the threat
landscape related to a monitored infrastructure. Its tweet-processing pipeline
is composed of filtering, feature extraction, binary classification, an
innovative clustering strategy, and generation of Indicators of Compromise
(IoCs). A quantitative evaluation considering all tweets from 80 accounts over
more than 8 months (over 195.000 tweets), shows that our approach timely and
successfully finds the majority of security-related tweets concerning an
example IT infrastructure (true positive rate above 90%), incorrectly selects a
small number of tweets as relevant (false positive rate under 10%), and
summarises the results to very few IoCs per day. A qualitative evaluation of
the IoCs generated by SYNAPSE demonstrates their relevance (based on the CVSS
score and the availability of patches or exploits), and timeliness (based on
threat disclosure dates from NVD)
Using qualia information to identify lexical semantic classes in an unsupervised clustering task
Acquiring lexical information is a complex problem, typically approached by
relying on a number of contexts to contribute information for classification.
One of the first issues to address in this domain is the determination of such
contexts. The work presented here proposes the use of automatically obtained
FORMAL role descriptors as features used to draw nouns from the same lexical
semantic class together in an unsupervised clustering task. We have dealt with
three lexical semantic classes (HUMAN, LOCATION and EVENT) in English. The
results obtained show that it is possible to discriminate between elements from
different lexical semantic classes using only FORMAL role information, hence
validating our initial hypothesis. Also, iterating our method accurately
accounts for fine-grained distinctions within lexical classes, namely
distinctions involving ambiguous expressions. Moreover, a filtering and
bootstrapping strategy employed in extracting FORMAL role descriptors proved to
minimize effects of sparse data and noise in our task.Comment: 10 pages, 5 tables. Also available in UPF institutional repository
(http://hdl.handle.net/10230/20383
- …