31 research outputs found
Improved data visualisation through nonlinear dissimilarity modelling
Inherent to state-of-the-art dimension reduction algorithms is the assumption that global distances between observations are Euclidean, despite the potential for altogether non-Euclidean data manifolds. We demonstrate that a non-Euclidean manifold chart can be approximated by implementing a universal approximator over a dictionary of dissimilarity measures, building on recent developments in the field. This approach is transferable across domains such that observations can be vectors, distributions, graphs and time series for instance. Our novel dissimilarity learning method is illustrated with four standard visualisation datasets showing the benefits over the linear dissimilarity learning approach
The Missing Mass Problem
We give tight lower and upper bounds on the expected missing mass for
distributions over finite and countably infinite spaces. An essential
characterization of the extremal distributions is given. We also provide an
extension to totally bounded metric spaces that may be of independent interest.Comment: 15 page
Greedy MAXCUT Algorithms and their Information Content
MAXCUT defines a classical NP-hard problem for graph partitioning and it
serves as a typical case of the symmetric non-monotone Unconstrained Submodular
Maximization (USM) problem. Applications of MAXCUT are abundant in machine
learning, computer vision and statistical physics. Greedy algorithms to
approximately solve MAXCUT rely on greedy vertex labelling or on an edge
contraction strategy. These algorithms have been studied by measuring their
approximation ratios in the worst case setting but very little is known to
characterize their robustness to noise contaminations of the input data in the
average case. Adapting the framework of Approximation Set Coding, we present a
method to exactly measure the cardinality of the algorithmic approximation sets
of five greedy MAXCUT algorithms. Their information contents are explored for
graph instances generated by two different noise models: the edge reversal
model and Gaussian edge weights model. The results provide insights into the
robustness of different greedy heuristics and techniques for MAXCUT, which can
be used for algorithm design of general USM problems.Comment: This is a longer version of the paper published in 2015 IEEE
Information Theory Workshop (ITW
Multiple Kernel-Based Multimedia Fusion for Automated Event Detection from Tweets
A method for detecting hot events such as wildfires is proposed. It uses visual and textual information to improve detection. Starting with picking up tweets having texts and images, it preprocesses the data to eliminate unwanted data, transforms unstructured data into structured data, then extracts features. Text features include term frequency-inverse document frequency. Image features include histogram of oriented gradients, gray-level co-occurrence matrix, color histogram, and scale-invariant feature transform. Next, it inputs the features to the multiple kernel learning (MKL) for fusion to automatically combine both feature types to achieve the best performance. Finally, it does event detection. The method was tested on Brisbane hailstorm 2014 and California wildfires 2017. It was compared with methods that used text only or images only. With the Brisbane hailstorm data, the proposed method achieved the best performance, with a fusion accuracy of 0.93, comparing to 0.89 with text only, and 0.85 with images only. With the California wildfires data, a similar performance was recorded. It has demonstrated that event detection in Twitter is enhanced and improved by combination of multiple features. It has delivered an accurate and effective event detection method for spreading awareness and organizing responses, leading to better disaster management