11,800 research outputs found
Random Indexing K-tree
Random Indexing (RI) K-tree is the combination of two algorithms for
clustering. Many large scale problems exist in document clustering. RI K-tree
scales well with large inputs due to its low complexity. It also exhibits
features that are useful for managing a changing collection. Furthermore, it
solves previous issues with sparse document vectors when using K-tree. The
algorithms and data structures are defined, explained and motivated. Specific
modifications to K-tree are made for use with RI. Experiments have been
executed to measure quality. The results indicate that RI K-tree improves
document cluster quality over the original K-tree algorithm.Comment: 8 pages, ADCS 2009; Hyperref and cleveref LaTeX packages conflicted.
Removed clevere
k-Nearest Neighbour Classifiers: 2nd Edition (with Python examples)
Perhaps the most straightforward classifier in the arsenal or machine
learning techniques is the Nearest Neighbour Classifier -- classification is
achieved by identifying the nearest neighbours to a query example and using
those neighbours to determine the class of the query. This approach to
classification is of particular importance because issues of poor run-time
performance is not such a problem these days with the computational power that
is available. This paper presents an overview of techniques for Nearest
Neighbour classification focusing on; mechanisms for assessing similarity
(distance), computational issues in identifying nearest neighbours and
mechanisms for reducing the dimension of the data.
This paper is the second edition of a paper previously published as a
technical report. Sections on similarity measures for time-series, retrieval
speed-up and intrinsic dimensionality have been added. An Appendix is included
providing access to Python code for the key methods.Comment: 22 pages, 15 figures: An updated edition of an older tutorial on kN
Seeding hESCs to achieve optimal colony clonality
Human embryonic stem cells (hESCs) and induced pluripotent stem cells (iPSCs) have promising clinical applications which often rely on clonally-homogeneous cell populations. To achieve this, it is important to ensure that each colony originates from a single founding cell and to avoid subsequent merging of colonies during their growth. Clonal homogeneity can be obtained with low seeding densities; however, this leads to low yield and viability. It is therefore important to quantitatively assess how seeding density affects clonality loss so that experimental protocols can be optimised to meet the required standards. Here we develop a quantitative framework for modelling the growth of hESC colonies from a given seeding density based on stochastic exponential growth. This allows us to identify the timescales for colony merges and over which colony size no longer predicts the number of founding cells. We demonstrate the success of our model by applying it to our own experiments of hESC colony growth; while this is based on a particular experimental set-up, the model can be applied more generally to other cell lines and experimental conditions to predict these important timescales
Anisotropic diffusion limited aggregation in three dimensions : universality and nonuniversality
We explore the macroscopic consequences of lattice anisotropy for diffusion limited aggregation (DLA) in three dimensions. Simple cubic and bcc lattice growths are shown to approach universal asymptotic states in a coherent fashion, and the approach is accelerated by the use of noise reduction. These states are strikingly anisotropic dendrites with a rich hierarchy of structure. For growth on an fcc lattice, our data suggest at least two stable fixed points of anisotropy, one matching the bcc case. Hexagonal growths, favoring six planar and two polar directions, appear to approach a line of asymptotic states with continuously tunable polar anisotropy. The more planar of these growths visually resembles real snowflake morphologies. Our simulations use a new and dimension-independent implementation of the DLA model. The algorithm maintains a hierarchy of sphere coverings of the growth, supporting efficient random walks onto the growth by spherical moves. Anisotropy was introduced by restricting growth to certain preferred directions
Visualising the structure of document search results: A comparison of graph theoretic approaches
This is the post-print of the article - Copyright @ 2010 Sage PublicationsPrevious work has shown that distance-similarity visualisation or ‘spatialisation’ can provide a potentially useful context in which to browse the results of a query search, enabling the user to adopt a simple local foraging or ‘cluster growing’ strategy to navigate through the retrieved document set. However, faithfully mapping feature-space models to visual space can be problematic owing to their inherent high dimensionality and non-linearity. Conventional linear approaches to dimension reduction tend to fail at this kind of task, sacrificing local structural in order to preserve a globally optimal mapping. In this paper the clustering performance of a recently proposed algorithm called isometric feature mapping (Isomap), which deals with non-linearity by transforming dissimilarities into geodesic distances, is compared to that of non-metric multidimensional scaling (MDS). Various graph pruning methods, for geodesic distance estimation, are also compared. Results show that Isomap is significantly better at preserving local structural detail than MDS, suggesting it is better suited to cluster growing and other semantic navigation tasks. Moreover, it is shown that applying a minimum-cost graph pruning criterion can provide a parameter-free alternative to the traditional K-neighbour method, resulting in spatial clustering that is equivalent to or better than that achieved using an optimal-K criterion
Using distributional similarity to organise biomedical terminology
We investigate an application of distributional similarity techniques to the problem of structural organisation of biomedical terminology. Our application domain is the relatively small GENIA corpus. Using terms that have been accurately marked-up by hand within the corpus, we consider the problem of automatically determining semantic proximity. Terminological units are dened for our purposes as normalised classes of individual terms. Syntactic analysis of the corpus data is carried out using the Pro3Gres parser and provides the data required to calculate distributional similarity using a variety of dierent measures. Evaluation is performed against a hand-crafted gold standard for this domain in the form of the GENIA ontology. We show that distributional similarity can be used to predict semantic type with a good degree of accuracy
- …