30 research outputs found
Iterative Optimization and Simplification of Hierarchical Clusterings
Clustering is often used for discovering structure in data. Clustering
systems differ in the objective function used to evaluate clustering quality
and the control strategy used to search the space of clusterings. Ideally, the
search strategy should consistently construct clusterings of high quality, but
be computationally inexpensive as well. In general, we cannot have it both
ways, but we can partition the search so that a system inexpensively constructs
a `tentative' clustering for initial examination, followed by iterative
optimization, which continues to search in background for improved clusterings.
Given this motivation, we evaluate an inexpensive strategy for creating initial
clusterings, coupled with several control strategies for iterative
optimization, each of which repeatedly modifies an initial clustering in search
of a better one. One of these methods appears novel as an iterative
optimization strategy in clustering contexts. Once a clustering has been
constructed it is judged by analysts -- often according to task-specific
criteria. Several authors have abstracted these criteria and posited a generic
performance task akin to pattern completion, where the error rate over
completed patterns is used to `externally' judge clustering utility. Given this
performance task, we adapt resampling-based pruning strategies used by
supervised learning systems to the task of simplifying hierarchical
clusterings, thus promising to ease post-clustering analysis. Finally, we
propose a number of objective functions, based on attribute-selection measures
for decision-tree induction, that might perform well on the error rate and
simplicity dimensions.Comment: See http://www.jair.org/ for any accompanying file
A Short Survey on Data Clustering Algorithms
With rapidly increasing data, clustering algorithms are important tools for
data analytics in modern research. They have been successfully applied to a
wide range of domains; for instance, bioinformatics, speech recognition, and
financial analysis. Formally speaking, given a set of data instances, a
clustering algorithm is expected to divide the set of data instances into the
subsets which maximize the intra-subset similarity and inter-subset
dissimilarity, where a similarity measure is defined beforehand. In this work,
the state-of-the-arts clustering algorithms are reviewed from design concept to
methodology; Different clustering paradigms are discussed. Advanced clustering
algorithms are also discussed. After that, the existing clustering evaluation
metrics are reviewed. A summary with future insights is provided at the end
Data Stream Clustering: Challenges and Issues
Very large databases are required to store massive amounts of data that are
continuously inserted and queried. Analyzing huge data sets and extracting
valuable pattern in many applications are interesting for researchers. We can
identify two main groups of techniques for huge data bases mining. One group
refers to streaming data and applies mining techniques whereas second group
attempts to solve this problem directly with efficient algorithms. Recently
many researchers have focused on data stream as an efficient strategy against
huge data base mining instead of mining on entire data base. The main problem
in data stream mining means evolving data is more difficult to detect in this
techniques therefore unsupervised methods should be applied. However,
clustering techniques can lead us to discover hidden information. In this
survey, we try to clarify: first, the different problem definitions related to
data stream clustering in general; second, the specific difficulties
encountered in this field of research; third, the varying assumptions,
heuristics, and intuitions forming the basis of different approaches; and how
several prominent solutions tackle different problems. Index Terms- Data
Stream, Clustering, K-Means, Concept driftComment: IMECS201
Comparison and validation of community structures in complex networks
The issue of partitioning a network into communities has attracted a great
deal of attention recently. Most authors seem to equate this issue with the one
of finding the maximum value of the modularity, as defined by Newman. Since the
problem formulated this way is NP-hard, most effort has gone into the
construction of search algorithms, and less to the question of other measures
of community structures, similarities between various partitionings and the
validation with respect to external information. Here we concentrate on a class
of computer generated networks and on three well-studied real networks which
constitute a bench-mark for network studies; the karate club, the US college
football teams and a gene network of yeast. We utilize some standard ways of
clustering data (originally not designed for finding community structures in
networks) and show that these classical methods sometimes outperform the newer
ones. We discuss various measures of the strength of the modular structure, and
show by examples features and drawbacks. Further, we compare different
partitions by applying some graph-theoretic concepts of distance, which
indicate that one of the quality measures of the degree of modularity
corresponds quite well with the distance from the true partition. Finally, we
introduce a way to validate the partitionings with respect to external data
when the nodes are classified but the network structure is unknown. This is
here possible since we know everything of the computer generated networks, as
well as the historical answer to how the karate club and the football teams are
partitioned in reality. The partitioning of the gene network is validated by
use of the Gene Ontology database, where we show that a community in general
corresponds to a biological process.Comment: To appear in Physica A; 25 page
Optimal metric for condition rating of existing buildings: is five the right number?
This is an Accepted Manuscript of an article published by Taylor & Francis in Structure and Infrastructure Engineering on January 2019, available online: http://www.tandfonline.com/10.1080/15732479.2018.1557702In the context of the built environment in the recent years, the concept of maintenance has changed from corrective to preventive maintenance. There is evidence that preventive maintenance is much more efficient than corrective maintenance, since severe deteriorations that may represent danger to people are avoided, and also money is saved. To make periodic inspections of the buildings is useful to quantify, the extent to which deteriorations are severe or not, in order to facilitate decision making and prioritise interventions. To this purpose, many scales have been used and are used to assess the severity of damage and degradation of the building components. But it appears evident that there is not consensus among users and these scales are different between them, with different number of degrees and metrics for the measurement of the condition state. The main goal of this paper is to calculate which is the optimal metric (which is the optimal number of degrees) of a severity scale of damages in buildings, so the corresponding scale could be of widespread and of common use among professionals, avoiding the problems of comparison between different evaluators. The proposed methodology to calculate the optimal metric of a scale can be also extended to other scopes.Peer ReviewedPostprint (author's final draft
Discovering usage patterns from web server logs
As the amount of information available on the World Wide Web (WWW) increases rapidly, the number of sites that hold particular information also increases. In order to have some
insights o the site usage, system administrator needs tools that can aid in his usage site’s analysis.To achieve this goal, the use of web mining too is necessary to discover the usage pattern of a particular site. For the purpose of this study, server logs from the educational portal were
retrieved, pre-processed and analyzed. Information collected by the Web servers are kept in the server logs and used as the main source of data for analyzing users’ navigation patterns. Once the server logs have been preprocessed and sessions have been obtained, there are several kinds of access pattern mining that can be performed, depending on the needs of the analyst. In this
study, data mining technique known as Generalized Association Rule was used in order to get some insights into website usage pattern. The findings from this study provide an overview of the usage pattern of particular educational portal. The study also demonstrates how Generalized Association Rule can be used in site usage analysis. Such a technique enables the discovery of
hidden information within the web server logs using data mining technique
Aplicação de redes neurais ART e análise de textura para a classificação do estado de alteração de agregados minerais
Uma nova abordagem para identificação do estado de alteração de agregados minerais destinados à obras de construção civil é apresentada. Tal identificação é de fundamental importância para evitar insucessos e ocorrência de defeitos prematuros na realização de obras que podem ser atribuídos à qualidade do agregado utilizado quanto ao seu estado de alteração. Técnicas de processamento de imagens são empregadas para aquisição dos histogramas dos canais de cor das imagens, seguidos do cálculo da entropia dos histogramas que fornece as características principais para a classificação. Finalmente, um modelo de aquisição de conhecimento incremental e de classificação que emprega redes neurais ART (Adaptive Resonance Theory) é construído para automatizar o processo de classificação. O modelo de classificação é organizado em duas etapas. Na primeira etapa, os agregados são classificados como alterados e não alterados, e em uma segunda etapa, o grupo deagregados alterados é classificado quanto ao grau de alteração. O modelo proposto apresenta resultados de classificação melhores quando comparados com aqueles obtidos através de outros algoritmos de classificação