114,458 research outputs found
Extracting predictive models from marked-p free-text documents at the Royal Botanic Gardens, Kew, London
In this paper we explore the combination of text-mining, un-supervised and supervised learning to extract predictive models from a corpus of digitised historical floras. These documents deal with the nomenclature, geographical distribution, ecology and comparative morphology of the species of a region. Here we exploit the fact that portions of text in the floras are marked up as different types of trait and habitat. We infer models from these different texts that can predict different habitat-types based upon the traits of plant species. We also integrate plant taxonomy data in order to assist in the validation of our models. We have shown that by clustering text describing the habitat of different floras we can identify a number of important and distinct habitats that are associated with particular families of species along with statistical significance scores. We have also shown that by using these discovered habitat-types as labels for supervised learning we can predict them based upon a subset of traits, identified using wrapper feature selection
Experiments in Clustering Homogeneous XML Documents to Validate an Existing Typology
This paper presents some experiments in clustering homogeneous XMLdocuments
to validate an existing classification or more generally anorganisational
structure. Our approach integrates techniques for extracting knowledge from
documents with unsupervised classification (clustering) of documents. We focus
on the feature selection used for representing documents and its impact on the
emerging classification. We mix the selection of structured features with fine
textual selection based on syntactic characteristics.We illustrate and evaluate
this approach with a collection of Inria activity reports for the year 2003.
The objective is to cluster projects into larger groups (Themes), based on the
keywords or different chapters of these activity reports. We then compare the
results of clustering using different feature selections, with the official
theme structure used by Inria.Comment: (postprint); This version corrects a couple of errors in authors'
names in the bibliograph
Positive Data Clustering based on Generalized Inverted Dirichlet Mixture Model
Recent advances in processing and networking capabilities of computers have caused an accumulation
of immense amounts of multimodal multimedia data (image, text, video). These data
are generally presented as high-dimensional vectors of features. The availability of these highdimensional
data sets has provided the input to a large variety of statistical learning applications
including clustering, classification, feature selection, outlier detection and density estimation. In
this context, a finite mixture offers a formal approach to clustering and a powerful tool to tackle
the problem of data modeling. A mixture model assumes that the data is generated by a set of
parametric probability distributions. The main learning process of a mixture model consists of the
following two parts: parameter estimation and model selection (estimation the number of components).
In addition, other issues may be considered during the learning process of mixture models
such as the: a) feature selection and b) outlier detection. The main objective of this thesis is to
work with different kinds of estimation criteria and to incorporate those challenges into a single
framework.
The first contribution of this thesis is to propose a statistical framework which can tackle the problem
of parameter estimation, model selection, feature selection, and outlier rejection in a unified
model. We propose to use feature saliency and introduce an expectation-maximization (EM) algorithm
for the estimation of the Generalized Inverted Dirichlet (GID) mixture model. By using
the Minimum Message Length (MML), we can identify how much each feature contributes to
our model as well as determine the number of components. The presence of outliers is an added
challenge and is handled by incorporating an auxiliary outlier component, to which we associate a uniform density. Experimental results on synthetic data, as well as real world applications involving
visual scenes and object classification, indicates that the proposed approach was promising,
even though low-dimensional representation of the data was applied. In addition, it showed
the importance of embedding an outlier component to the proposed model. EM learning suffers
from significant drawbacks. In order to overcome those drawbacks, a learning approach using a
Bayesian framework is proposed as our second contribution. This learning is based on the estimation
of the parameters posteriors and by considering the prior knowledge about these parameters.
Calculation of the posterior distribution of each parameter in the model is done by using Markov
chain Monte Carlo (MCMC) simulation methods - namely, the Gibbs sampling and the Metropolis-
Hastings methods. The Bayesian Information Criterion (BIC) was used for model selection. The
proposed model was validated on object classification and forgery detection applications. For the
first two contributions, we developed a finite GID mixture. However, in the third contribution,
we propose an infinite GID mixture model. The proposed model simutaneously tackles the clustering
and feature selection problems. The proposed learning model is based on Gibbs sampling.
The effectiveness of the proposed method is shown using image categorization application. Our
last contribution in this thesis is another fully Bayesian approach for a finite GID mixture learning
model using the Reversible Jump Markov Chain Monte Carlo (RJMCMC) technique. The
proposed algorithm allows for the simultaneously handling of the model selection and parameter estimation for high dimensional data. The merits of this approach are investigated using synthetic
data, and data generated from a challenging namely object detection
Transforming Graph Representations for Statistical Relational Learning
Relational data representations have become an increasingly important topic
due to the recent proliferation of network datasets (e.g., social, biological,
information networks) and a corresponding increase in the application of
statistical relational learning (SRL) algorithms to these domains. In this
article, we examine a range of representation issues for graph-based relational
data. Since the choice of relational data representation for the nodes, links,
and features can dramatically affect the capabilities of SRL algorithms, we
survey approaches and opportunities for relational representation
transformation designed to improve the performance of these algorithms. This
leads us to introduce an intuitive taxonomy for data representation
transformations in relational domains that incorporates link transformation and
node transformation as symmetric representation tasks. In particular, the
transformation tasks for both nodes and links include (i) predicting their
existence, (ii) predicting their label or type, (iii) estimating their weight
or importance, and (iv) systematically constructing their relevant features. We
motivate our taxonomy through detailed examples and use it to survey and
compare competing approaches for each of these tasks. We also discuss general
conditions for transforming links, nodes, and features. Finally, we highlight
challenges that remain to be addressed
CEAI: CCM based Email Authorship Identification Model
In this paper we present a model for email authorship identification (EAI) by
employing a Cluster-based Classification (CCM) technique. Traditionally,
stylometric features have been successfully employed in various authorship
analysis tasks; we extend the traditional feature-set to include some more
interesting and effective features for email authorship identification (e.g.
the last punctuation mark used in an email, the tendency of an author to use
capitalization at the start of an email, or the punctuation after a greeting or
farewell). We also included Info Gain feature selection based content features.
It is observed that the use of such features in the authorship identification
process has a positive impact on the accuracy of the authorship identification
task. We performed experiments to justify our arguments and compared the
results with other base line models. Experimental results reveal that the
proposed CCM-based email authorship identification model, along with the
proposed feature set, outperforms the state-of-the-art support vector machine
(SVM)-based models, as well as the models proposed by Iqbal et al. [1, 2]. The
proposed model attains an accuracy rate of 94% for 10 authors, 89% for 25
authors, and 81% for 50 authors, respectively on Enron dataset, while 89.5%
accuracy has been achieved on authors' constructed real email dataset. The
results on Enron dataset have been achieved on quite a large number of authors
as compared to the models proposed by Iqbal et al. [1, 2]
Recommended from our members
A niching memetic algorithm for simultaneous clustering and feature selection
Clustering is inherently a difficult task, and is made even more difficult when the selection of relevant features is also an issue. In this paper we propose an approach for simultaneous clustering and feature selection using a niching memetic algorithm. Our approach (which we call NMA_CFS) makes feature selection an integral part of the global clustering search procedure and attempts to overcome the problem of identifying less promising locally optimal solutions in both clustering and feature selection, without making any a priori assumption about the number of clusters. Within the NMA_CFS procedure, a variable composite representation is devised to encode both feature selection and cluster centers with different numbers of clusters. Further, local search operations are introduced to refine feature selection and cluster centers encoded in the chromosomes. Finally, a niching method is integrated to preserve the population diversity and prevent premature convergence. In an experimental evaluation we demonstrate the effectiveness of the proposed approach and compare it with other related approaches, using both synthetic and real data
Simulated evaluation of faceted browsing based on feature selection
In this paper we explore the limitations of facet based browsing which uses sub-needs of an information need for querying and organising the search process in video retrieval. The underlying assumption of this approach is that the search effectiveness will be enhanced if such an approach is employed for interactive video retrieval using textual and visual features. We explore the performance bounds of a faceted system by carrying out a simulated user evaluation on TRECVid data sets, and also on the logs of a prior user experiment with the system. We first present a methodology to reduce the dimensionality of features by selecting the most important ones. Then, we discuss the simulated evaluation strategies employed in our evaluation and the effect on the use of both textual and visual features. Facets created by users are simulated by clustering video shots using textual and visual features. The experimental results of our study demonstrate that the faceted browser can potentially improve the search effectiveness
Taming Wild High Dimensional Text Data with a Fuzzy Lash
The bag of words (BOW) represents a corpus in a matrix whose elements are the
frequency of words. However, each row in the matrix is a very high-dimensional
sparse vector. Dimension reduction (DR) is a popular method to address sparsity
and high-dimensionality issues. Among different strategies to develop DR
method, Unsupervised Feature Transformation (UFT) is a popular strategy to map
all words on a new basis to represent BOW. The recent increase of text data and
its challenges imply that DR area still needs new perspectives. Although a wide
range of methods based on the UFT strategy has been developed, the fuzzy
approach has not been considered for DR based on this strategy. This research
investigates the application of fuzzy clustering as a DR method based on the
UFT strategy to collapse BOW matrix to provide a lower-dimensional
representation of documents instead of the words in a corpus. The quantitative
evaluation shows that fuzzy clustering produces superior performance and
features to Principal Components Analysis (PCA) and Singular Value
Decomposition (SVD), two popular DR methods based on the UFT strategy
- …