6,706 research outputs found
State of the Art, Evaluation and Recommendations regarding "Document Processing and Visualization Techniques"
Several Networks of Excellence have been set up in the framework of the
European FP5 research program. Among these Networks of Excellence, the NEMIS
project focuses on the field of Text Mining.
Within this field, document processing and visualization was identified as
one of the key topics and the WG1 working group was created in the NEMIS
project, to carry out a detailed survey of techniques associated with the text
mining process and to identify the relevant research topics in related research
areas.
In this document we present the results of this comprehensive survey. The
report includes a description of the current state-of-the-art and practice, a
roadmap for follow-up research in the identified areas, and recommendations for
anticipated technological development in the domain of text mining.Comment: 54 pages, Report of Working Group 1 for the European Network of
Excellence (NoE) in Text Mining and its Applications in Statistics (NEMIS
Adaptive Evolutionary Clustering
In many practical applications of clustering, the objects to be clustered
evolve over time, and a clustering result is desired at each time step. In such
applications, evolutionary clustering typically outperforms traditional static
clustering by producing clustering results that reflect long-term trends while
being robust to short-term variations. Several evolutionary clustering
algorithms have recently been proposed, often by adding a temporal smoothness
penalty to the cost function of a static clustering method. In this paper, we
introduce a different approach to evolutionary clustering by accurately
tracking the time-varying proximities between objects followed by static
clustering. We present an evolutionary clustering framework that adaptively
estimates the optimal smoothing parameter using shrinkage estimation, a
statistical approach that improves a naive estimate using additional
information. The proposed framework can be used to extend a variety of static
clustering algorithms, including hierarchical, k-means, and spectral
clustering, into evolutionary clustering algorithms. Experiments on synthetic
and real data sets indicate that the proposed framework outperforms static
clustering and existing evolutionary clustering algorithms in many scenarios.Comment: To appear in Data Mining and Knowledge Discovery, MATLAB toolbox
available at http://tbayes.eecs.umich.edu/xukevin/affec
Survey of state-of-the-art mixed data clustering algorithms
Mixed data comprises both numeric and categorical features, and mixed
datasets occur frequently in many domains, such as health, finance, and
marketing. Clustering is often applied to mixed datasets to find structures and
to group similar objects for further analysis. However, clustering mixed data
is challenging because it is difficult to directly apply mathematical
operations, such as summation or averaging, to the feature values of these
datasets. In this paper, we present a taxonomy for the study of mixed data
clustering algorithms by identifying five major research themes. We then
present a state-of-the-art review of the research works within each research
theme. We analyze the strengths and weaknesses of these methods with pointers
for future research directions. Lastly, we present an in-depth analysis of the
overall challenges in this field, highlight open research questions and discuss
guidelines to make progress in the field.Comment: 20 Pages, 2 columns, 6 Tables, 209 Reference
Distributed dual vigilance fuzzy adaptive resonance theory learns online, retrieves arbitrarily-shaped clusters, and mitigates order dependence
This paper presents a novel adaptive resonance theory (ART)-based modular
architecture for unsupervised learning, namely the distributed dual vigilance
fuzzy ART (DDVFA). DDVFA consists of a global ART system whose nodes are local
fuzzy ART modules. It is equipped with the distinctive features of distributed
higher-order activation and match functions, using dual vigilance parameters
responsible for cluster similarity and data quantization. Together, these allow
DDVFA to perform unsupervised modularization, create multi-prototype clustering
representations, retrieve arbitrarily-shaped clusters, and control its
compactness. Another important contribution is the reduction of
order-dependence, an issue that affects any agglomerative clustering method.
This paper demonstrates two approaches for mitigating order-dependence:
preprocessing using visual assessment of cluster tendency (VAT) or
postprocessing using a novel Merge ART module. The former is suitable for batch
processing, whereas the latter can be used in online learning. Experimental
results in the online learning mode carried out on 30 benchmark data sets show
that DDVFA cascaded with Merge ART statistically outperformed the best other
ART-based systems when samples were randomly presented. Conversely, they were
found to be statistically equivalent in the offline mode when samples were
pre-processed using VAT. Remarkably, performance comparisons to non-ART-based
clustering algorithms show that DDVFA (which learns incrementally) was also
statistically equivalent to the non-incremental (offline) methods of DBSCAN,
single linkage hierarchical agglomerative clustering (HAC), and k-means, while
retaining the appealing properties of ART. Links to the source code and data
are provided. Considering the algorithm's simplicity, online learning
capability, and performance, it is an ideal choice for many agglomerative
clustering applications
Online Machine Learning in Big Data Streams
The area of online machine learning in big data streams covers algorithms
that are (1) distributed and (2) work from data streams with only a limited
possibility to store past data. The first requirement mostly concerns software
architectures and efficient algorithms. The second one also imposes nontrivial
theoretical restrictions on the modeling methods: In the data stream model,
older data is no longer available to revise earlier suboptimal modeling
decisions as the fresh data arrives.
In this article, we provide an overview of distributed software architectures
and libraries as well as machine learning models for online learning. We
highlight the most important ideas for classification, regression,
recommendation, and unsupervised modeling from streaming data, and we show how
they are implemented in various distributed data stream processing systems.
This article is a reference material and not a survey. We do not attempt to
be comprehensive in describing all existing methods and solutions; rather, we
give pointers to the most important resources in the field. All related
sub-fields, online algorithms, online learning, and distributed data processing
are hugely dominant in current research and development with conceptually new
research results and software components emerging at the time of writing. In
this article, we refer to several survey results, both for distributed data
processing and for online machine learning. Compared to past surveys, our
article is different because we discuss recommender systems in extended detail
Survey of data mining approaches to user modeling for adaptive hypermedia
The ability of an adaptive hypermedia system to create tailored environments depends mainly on the amount and accuracy of information stored in each user model. Some of the difficulties that user modeling faces are the amount of data available to create user models, the adequacy of the data, the noise within that data, and the necessity of capturing the imprecise nature of human behavior. Data mining and machine learning techniques have the ability to handle large amounts of data and to process uncertainty. These characteristics make these techniques suitable for automatic generation of user models that simulate human decision making. This paper surveys different data mining techniques that can be used to efficiently and accurately capture user behavior. The paper also presents guidelines that show which techniques may be used more efficiently according to the task implemented by the applicatio
ELKI: A large open-source library for data analysis - ELKI Release 0.7.5 "Heidelberg"
This paper documents the release of the ELKI data mining framework, version
0.7.5.
ELKI is an open source (AGPLv3) data mining software written in Java. The
focus of ELKI is research in algorithms, with an emphasis on unsupervised
methods in cluster analysis and outlier detection. In order to achieve high
performance and scalability, ELKI offers data index structures such as the
R*-tree that can provide major performance gains. ELKI is designed to be easy
to extend for researchers and students in this domain, and welcomes
contributions of additional methods. ELKI aims at providing a large collection
of highly parameterizable algorithms, in order to allow easy and fair
evaluation and benchmarking of algorithms.
We will first outline the motivation for this release, the plans for the
future, and then give a brief overview over the new functionality in this
version. We also include an appendix presenting an overview on the overall
implemented functionality
Parallel and Distributed Collaborative Filtering: A Survey
Collaborative filtering is amongst the most preferred techniques when
implementing recommender systems. Recently, great interest has turned towards
parallel and distributed implementations of collaborative filtering algorithms.
This work is a survey of the parallel and distributed collaborative filtering
implementations, aiming not only to provide a comprehensive presentation of the
field's development, but also to offer future research orientation by
highlighting the issues that need to be further developed.Comment: 46 page
Scaling Bayesian network discovery through incremental recovery
Bayesian networks are a type of graphical models that, e.g., allow one to analyze the interaction among the variables in a database. A well-known problem with the discovery of such models from a database is the ``problem of high-dimensionality''. That is, the discovery of a network from a database with a moderate to large number of variables quickly becomes intractable. Most solutions towards this problem have relied on prior knowledge on the structure of the network, e.g., through the definition of an order on the variables. With a growing number of variables, however, this becomes a considerable burden on the data miner. Moreover, mistakes in such prior knowledge have large effects on the final network. Another approach is rather than asking the expert insight in the structure of the final network, asking the database. Our work fits in this approach. More in particular, before we start recovering the network, we first cluster the variables based on a chi-squared measure of association. Then we use an incremental algorithm to discover the network. This algorithm uses the small networks discovered for the individual clusters of variables as its starting point. We illustrate the feasibility of our approach with some experiments. More in particular, we show that in the case where one knows the network, and thus the order, our algorithm yields almost the same network which is, moreover, still an I-map
Coping With New Challengens for Density-Based Clustering
Knowledge Discovery in Databases (KDD) is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. The core step of the KDD process is the application of a Data Mining algorithm in order to produce a particular enumeration of patterns and relationships in large databases. Clustering is one of the major data mining tasks and aims at grouping the data objects into meaningful classes (clusters) such that the similarity of objects within clusters is maximized, and the similarity of objects from different clusters is minimized. Beside many others, the density-based clustering notion underlying the algorithm DBSCAN and its hierarchical extension OPTICS has been proposed recently, being one of the most successful approaches to clustering.
In this thesis, our aim is to advance the state-of-the-art clustering, especially density-based clustering by identifying novel challenges for density-based clustering and proposing innovative and solid solutions for these challenges.
We describe the development of the industrial prototype BOSS (Browsing OPTICS plots for Similarity Search) which is a first step towards developing a comprehensive, scalable and distributed computing solution designed to make the efficiency and analytical capabilities of OPTICS available to a broader audience. For the development of BOSS, several key enhancements of OPTICS are required which are addressed in this thesis. We develop incremental algorithms of OPTICS to efficiently reconstruct the hierarchical clustering structure in frequently updated databases, in particular, when a set of objects is inserted in or deleted from the database. We empirically show that these incremental algorithms yield significant speed-up factors over the original OPTICS algorithm. Furthermore, we propose a novel algorithm for automatic extraction of clusters from hierarchical clustering representations that outperforms comparative methods, and introduce two novel approaches for selecting meaningful representatives, using the density-based concepts of OPTICS and producing better results than the related medoid approach.
Another major challenge for density-based clustering is to cope with high dimensional data. Many today's real-world data sets contain a large number of measurements (or features) for a single data object. Usually, global feature reduction techniques cannot be applied to these data sets. Thus, the task of feature selection must be combined with and incooperated into the clustering process. In this thesis, we present original extensions and enhancements of the density-based clustering notion to cope with high dimensional data. In particular, we propose an algorithm called SUBCLU (density based SUBspace CLUstering) that extends DBSCAN to the problem of subspace clustering. SUBCLU efficiently computes all clusters that would have been found if DBSCAN is applied to all possible subspaces of the feature space. An experimental evaluation on real-world data sets illustrates that SUBCLU is more effective than existing subspace clustering algorithms because it is able to find clusters of arbitrary size and shape, and produces determine results. A semi-hierarchical extension of SUBCLU called RIS (Ranking Interesting Subspaces) is proposed that does not compute the subspace clusters directly, but generates a list of subspaces ranked by their clustering characteristics. A hierarchical clustering algorithm can be applied to these interesting subspaces in order to compute a hierarchical (subspace) clustering. A comparative evaluation of RIS and SUBCLU shows that RIS in combination with OPTICS can achieve an information gain over SUBCLU. In addition, we propose the algorithm 4C (Computing Correlation Connected Clusters) that extends the concepts of DBSCAN to compute density-based correlation clusters. 4C benefits from an innovative, well-defined and effective clustering model, outperforming related approaches in terms of clustering quality on real-world data sets.Knowledge Discovery in Databases (KDD) ist der Prozess der (semi-)automatischen Extraktion von Wissen aus Datenbanken, das gültig, bisher unbekannt und potentiell nützlich für eine gegebene Anwendung ist. Der zentrale Schritt des KDD-Prozesses ist das Data Mining. Eine der wichtigsten Aufgaben des Data Mining ist Clustering. Dabei sollen die Objekte einer Datenbank in Gruppen (Cluster) partitioniert werden, so dass Objekte eines Clusters möglichst ähnlich und Objekte verschiedener Cluster möglichst unähnlich zu einander sind. Das dichtebasierte Clustermodell und die darauf aufbauenden Algorithmen DBSCAN und OPTICS sind unter einer Vielzahl anderer Clustering-Ansätze eine der erfolgreichsten Methoden zum Clustering.
Im Rahmen dieser Dissertation wollen wir den aktuellen Stand der Technik im Bereich Clustering und speziell im Bereich dichtebasiertes Clustering voranbringen. Dazu erarbeiten wir neue Herausforderungen für das dichtebasierte Clustermodell und schlagen dazu innovative Lösungen vor.
Zunächst steht die Entwicklung des industriellen Prototyps BOSS (Browsing OPTICS plots for Similarity Search) im Mittelpunkt dieser Arbeit. BOSS ist ein erster Beitrag zu einer umfassenden, skalierbaren und verteilten Softwarelösung, die eine Nutzung der Effizienzvorteile und die analytischen Möglichkeiten des dichtebasierten, hierarchischen Clustering-Algorithmus OPTICS für ein breites Publikum ermöglichen. Zur Entwicklung von BOSS werden drei entscheidende Erweiterungen von OPTICS benötigt: Wir entwickeln eine inkrementelle Version von OPTICS um nach einem Update der Datenbank (Einfügen/Löschen einer Menge von Objekten) die hierarchische Clustering Struktur effizient zu reorganisieren. Anhand von Experimenten mit synthetischen und realen Daten zeigen wir, dass die vorgeschlagenen, inkrementellen Algorithmen deutliche Beschleunigungsfaktoren gegenüber dem originalen OPTICS-Algorithmus erzielen. Desweiteren schlagen wir einen neuen Algorithmus zur automatischen Clusterextraktion aus hierarchischen Repräsentationen und zwei innovative Methoden zur automatischen Auswahl geeigneter Clusterrepräsentaten vor. Unsere neuen Techniken erzielen bei Tests auf mehreren realen Datenbanken im Vergleich zu den konkurrierenden Verfahren bessere Ergebnisse.
Eine weitere Herausforderung für Clustering-Verfahren stellen hochdimensionale Featureräume dar. Reale Datensätze beinhalten dank moderner Verfahren zur Datenerhebung häufig sehr viele Merkmale. Teile dieser Merkmale unterliegen oft Rauschen oder Abhängigkeiten und können meist nicht im Vorfeld ausgesiebt werden, da diese Effekte jeweils in Teilen der Datenbank unterschiedlich ausgeprägt sind. Daher muss die Wahl der Features mit dem Data-Mining-Verfahren verknüpft werden. Im Rahmen dieser Arbeit stellen wir innovative Erweiterungen des dichtebasierten Clustermodells für hochdimensionale Daten vor. Wir entwickeln SUBCLU (dichtebasiertes SUBspace CLUstering), ein auf DBSCAN basierender Subspace Clustering Algorithmus. SUBCLU erzeugt effizient alle Cluster, die gefunden werden, wenn man DBSCAN auf alle möglichen Teilräume des Datensatzes anwendet. Experimente auf realen Daten zeigen, dass SUBCLU effektiver als vergleichbare Algorithmen ist. RIS (Ranking Interesting Subspaces), eine semi-hierarchische Erweiterung von SUBCLU, wird vorgeschlagen, das nicht mehr direkt die Teilraumcluster berechnet, sondern eine Liste von Teilräumen geordnet anhand ihrer Clustering-Qualität erzeugt. Dadurch können hierarchische Partitionierungen auf ausgewählten Teilräumen erzeugt werden. Experimente belegen, dass RIS in Kombination mit OPTICS ein Informationsgewinn gegenüber SUBCLU erreicht. Außerdem stellen wir den neuartigen Korrelationscluster Algorithmus 4C (Computing Correlation Connected Clusters) vor. 4C basiert auf einem innovativen und wohldefinierten Clustermodell und erzielt in unseren Experimenten mit realen Daten bessere Ergebnisse als vergleichbare Clustering-Ansätze
- …