100 research outputs found
On Randomly Projected Hierarchical Clustering with Guarantees
Hierarchical clustering (HC) algorithms are generally limited to small data
instances due to their runtime costs. Here we mitigate this shortcoming and
explore fast HC algorithms based on random projections for single (SLC) and
average (ALC) linkage clustering as well as for the minimum spanning tree
problem (MST). We present a thorough adaptive analysis of our algorithms that
improve prior work from by up to a factor of for a
dataset of points in Euclidean space. The algorithms maintain, with
arbitrary high probability, the outcome of hierarchical clustering as well as
the worst-case running-time guarantees. We also present parameter-free
instances of our algorithms.Comment: This version contains the conference paper "On Randomly Projected
Hierarchical Clustering with Guarantees'', SIAM International Conference on
Data Mining (SDM), 2014 and, additionally, proofs omitted in the conference
versio
Approximate Matrix Multiplication with Application to Linear Embeddings
In this paper, we study the problem of approximately computing the product of
two real matrices. In particular, we analyze a dimensionality-reduction-based
approximation algorithm due to Sarlos [1], introducing the notion of nuclear
rank as the ratio of the nuclear norm over the spectral norm. The presented
bound has improved dependence with respect to the approximation error (as
compared to previous approaches), whereas the subspace -- on which we project
the input matrices -- has dimensions proportional to the maximum of their
nuclear rank and it is independent of the input dimensions. In addition, we
provide an application of this result to linear low-dimensional embeddings.
Namely, we show that any Euclidean point-set with bounded nuclear rank is
amenable to projection onto number of dimensions that is independent of the
input dimensionality, while achieving additive error guarantees.Comment: 8 pages, International Symposium on Information Theor
Compressive Mining: Fast and Optimal Data Mining in the Compressed Domain
Real-world data typically contain repeated and periodic patterns. This
suggests that they can be effectively represented and compressed using only a
few coefficients of an appropriate basis (e.g., Fourier, Wavelets, etc.).
However, distance estimation when the data are represented using different sets
of coefficients is still a largely unexplored area. This work studies the
optimization problems related to obtaining the \emph{tightest} lower/upper
bound on Euclidean distances when each data object is potentially compressed
using a different set of orthonormal coefficients. Our technique leads to
tighter distance estimates, which translates into more accurate search,
learning and mining operations \textit{directly} in the compressed domain.
We formulate the problem of estimating lower/upper distance bounds as an
optimization problem. We establish the properties of optimal solutions, and
leverage the theoretical analysis to develop a fast algorithm to obtain an
\emph{exact} solution to the problem. The suggested solution provides the
tightest estimation of the -norm or the correlation. We show that typical
data-analysis operations, such as k-NN search or k-Means clustering, can
operate more accurately using the proposed compression and distance
reconstruction technique. We compare it with many other prevalent compression
and reconstruction techniques, including random projections and PCA-based
techniques. We highlight a surprising result, namely that when the data are
highly sparse in some basis, our technique may even outperform PCA-based
compression.
The contributions of this work are generic as our methodology is applicable
to any sequential or high-dimensional data as well as to any orthogonal data
transformation used for the underlying data compression scheme.Comment: 25 pages, 20 figures, accepted in VLD
Scalable and interpretable product recommendations via overlapping co-clustering
We consider the problem of generating interpretable recommendations by
identifying overlapping co-clusters of clients and products, based only on
positive or implicit feedback. Our approach is applicable on very large
datasets because it exhibits almost linear complexity in the input examples and
the number of co-clusters. We show, both on real industrial data and on
publicly available datasets, that the recommendation accuracy of our algorithm
is competitive to that of state-of-art matrix factorization techniques. In
addition, our technique has the advantage of offering recommendations that are
textually and visually interpretable. Finally, we examine how to implement our
technique efficiently on Graphical Processing Units (GPUs).Comment: In IEEE International Conference on Data Engineering (ICDE) 201
Adaptive coarse-grained Monte Carlo simulation of reaction and diffusion dynamics in heterogeneous plasma membranes
Background: An adaptive coarse-grained (kinetic) Monte Carlo (ACGMC) simulation framework is applied to reaction and diffusion dynamics in inhomogeneous domains. The presented model is relevant to the diffusion and dimerization dynamics of epidermal growth factor receptor (EGFR) in the presence of plasma membrane heterogeneity and specifically receptor clustering. We perform simulations representing EGFR cluster dissipation in heterogeneous plasma membranes consisting of higher density clusters of receptors surrounded by low population areas using the ACGMC method. We further investigate the effect of key parameters on the cluster lifetime.Results: Coarse-graining of dimerization, rather than of diffusion, may lead to computational error. It is shown that the ACGMC method is an effective technique to minimize error in diffusion-reaction processes and is superior to the microscopic kinetic Monte Carlo simulation in terms of computational cost while retaining accuracy. The low computational cost enables sensitivity analysis calculations. Sensitivity analysis indicates that it may be possible to retain clusters of receptors over the time scale of minutes under suitable conditions and the cluster lifetime may depend on both receptor density and cluster size.Conclusions: The ACGMC method is an ideal platform to resolve large length and time scales in heterogeneous biological systems well beyond the plasma membrane and the EGFR system studied here. Our results demonstrate that cluster size must be considered in conjunction with receptor density, as they synergistically affect EGFR cluster lifetime. Further, the cluster lifetime being of the order of several seconds suggests that any mechanisms responsible for EGFR aggregation must operate on shorter timescales (at most a fraction of a second), to overcome dissipation and produce stable clusters observed experimentally. © 2010 Collins et al; licensee BioMed Central Ltd
Recurrent Urinary Tract Infections due to Asymptomatic Colonic Diverticulitis
Colovesical fistula is a common complication of diverticulitis. Pneumaturia, fecaluria, urinary tract infections, abdominal pain, and dysuria are commonly reported. The authors report a case of colovesical fistula due to asymptomatic diverticulitis, and they emphasize the importance of deeply investigate recurrent urinary tract infection without any bowel symptoms. They also briefly review the literature
A study on implementing a multithreaded version of the SIRENE detector simulation software for high energy neutrinos
The primary objective of SIRENE is to simulate the response to neutrino
events of any type of high energy neutrino telescope. Additionally, it
implements different geometries for a neutrino detector and different
configurations and characteristics of photo-multiplier tubes (PMTs) inside the
optical modules of the detector through a library of C+ + classes. This could
be considered a massive statistical analysis of photo-electrons. Aim of this
work is the development of a multithreaded version of the SIRENE detector
simulation software for high energy neutrinos. This approach allows utilization
of multiple CPU cores leading to a potentially significant decrease in the
required execution time compared to the sequential code. We are making use of
the OpenMP framework for the production of multithreaded code running on the
CPU. Finally, we analyze the feasibility of a GPU-accelerated implementation
Discovering similar multidimensional trajectories
We investigate techniques for analysis and retrieval of object trajectories in a two or three dimensional space. Such kind of data usually contain a great amount of noise, that makes all previously used metrics fail. Therefore, here we formalize non-metric similarity functions based on the Longest Common Subsequence (LCSS), which are very robust to noise and furthermore provide an intuitive notion of similarity between trajectories by giving more weight to the similar portions of the sequences. Stretching of sequences in time is allowed, as well as global translating of the sequences in space. Efficient approximate algorithms that compute these similarity measures are also provided. We compare these new methods to the widely used Euclidean and Time Warping distance functions (for real and synthetic data) and show the superiority of our approach, especially under the strong presence of noise. We prove a weaker version of the triangle inequality and employ it in an indexing structure to answer nearest neighbor queries. Finally, we present experimental results that validate the accuracy and efficiency of our approach.
Identifying Similarities, Periodicities and Bursts for Online Search Queries
We present several methods for mining knowledge from the query logs of the MSN search engine. Using the query logs, we build a time series for each query word or phrase (e.g., ‘Thanksgiving ’ or ‘Christmas gifts’) where the elements of the time series are the number of times that a query is issued on a day. All of the methods we describe use sequences of this form and can be applied to time series data generally. Our primary goal is the discovery of semantically similar queries and we do so by identifying queries with similar demand patterns. Utilizing the best Fourier coefficients and the energy of the omitted components, we improve upon the state-of-the-art in time-series similarity matching. The extracted sequence features are then organized in an efficient metric tree index structure. We also demonstrate how to efficiently and accurately discover the important periods in a time-series. Finally we propose a simple but effective method for identification of bursts (long or short-term). Using the burst information extracted from a sequence, we are able to efficiently perform ’query-by-burst ’ on the database of timeseries. We conclude the presentation with the description of a tool that uses the described methods, and serves as an interactive exploratory data discovery tool for the MSN query database. 1
- …