194 research outputs found
Analysis of Crowdsourced Sampling Strategies for HodgeRank with Sparse Random Graphs
Crowdsourcing platforms are now extensively used for conducting subjective
pairwise comparison studies. In this setting, a pairwise comparison dataset is
typically gathered via random sampling, either \emph{with} or \emph{without}
replacement. In this paper, we use tools from random graph theory to analyze
these two random sampling methods for the HodgeRank estimator. Using the
Fiedler value of the graph as a measurement for estimator stability
(informativeness), we provide a new estimate of the Fiedler value for these two
random graph models. In the asymptotic limit as the number of vertices tends to
infinity, we prove the validity of the estimate. Based on our findings, for a
small number of items to be compared, we recommend a two-stage sampling
strategy where a greedy sampling method is used initially and random sampling
\emph{without} replacement is used in the second stage. When a large number of
items is to be compared, we recommend random sampling with replacement as this
is computationally inexpensive and trivially parallelizable. Experiments on
synthetic and real-world datasets support our analysis
The Global Patch Collider
Abstract This paper proposes a novel extremely efficient, fullyparallelizable, task-specific algorithm for the computation of global point-wise correspondences in images and videos. Our algorithm, the Global Patch Collider, is based on detecting unique collisions between image points using a collection of learned tree structures that act as conditional hash functions. In contrast to conventional approaches that rely on pairwise distance computation, our algorithm isolates distinctive pixel pairs that hit the same leaf during traversal through multiple learned tree structures. The split functions stored at the intermediate nodes of the trees are trained to ensure that only visually similar patches or their geometric or photometric transformed versions fall into the same leaf node. The matching process involves passing all pixel positions in the images under analysis through the tree structures. We then compute matches by isolating points that uniquely collide with each other ie. fell in the same empty leaf in multiple trees. Our algorithm is linear in the number of pixels but can be made constant time on a parallel computation architecture as the tree traversal for individual image points is decoupled. We demonstrate the efficacy of our method by using it to perform optical flow matching and stereo matching on some challenging benchmarks. Experimental results show that not only is our method extremely computationally efficient, but it is also able to match or outperform state of the art methods that are much more complex
Designing labeled graph classifiers by exploiting the R\'enyi entropy of the dissimilarity representation
Representing patterns as labeled graphs is becoming increasingly common in
the broad field of computational intelligence. Accordingly, a wide repertoire
of pattern recognition tools, such as classifiers and knowledge discovery
procedures, are nowadays available and tested for various datasets of labeled
graphs. However, the design of effective learning procedures operating in the
space of labeled graphs is still a challenging problem, especially from the
computational complexity viewpoint. In this paper, we present a major
improvement of a general-purpose classifier for graphs, which is conceived on
an interplay between dissimilarity representation, clustering,
information-theoretic techniques, and evolutionary optimization algorithms. The
improvement focuses on a specific key subroutine devised to compress the input
data. We prove different theorems which are fundamental to the setting of the
parameters controlling such a compression operation. We demonstrate the
effectiveness of the resulting classifier by benchmarking the developed
variants on well-known datasets of labeled graphs, considering as distinct
performance indicators the classification accuracy, computing time, and
parsimony in terms of structural complexity of the synthesized classification
models. The results show state-of-the-art standards in terms of test set
accuracy and a considerable speed-up for what concerns the computing time.Comment: Revised versio
Metrics for Graph Comparison: A Practitioner's Guide
Comparison of graph structure is a ubiquitous task in data analysis and
machine learning, with diverse applications in fields such as neuroscience,
cyber security, social network analysis, and bioinformatics, among others.
Discovery and comparison of structures such as modular communities, rich clubs,
hubs, and trees in data in these fields yields insight into the generative
mechanisms and functional properties of the graph.
Often, two graphs are compared via a pairwise distance measure, with a small
distance indicating structural similarity and vice versa. Common choices
include spectral distances (also known as distances) and distances
based on node affinities. However, there has of yet been no comparative study
of the efficacy of these distance measures in discerning between common graph
topologies and different structural scales.
In this work, we compare commonly used graph metrics and distance measures,
and demonstrate their ability to discern between common topological features
found in both random graph models and empirical datasets. We put forward a
multi-scale picture of graph structure, in which the effect of global and local
structure upon the distance measures is considered. We make recommendations on
the applicability of different distance measures to empirical graph data
problem based on this multi-scale view. Finally, we introduce the Python
library NetComp which implements the graph distances used in this work
Wavelet-based Image Denoising
Käesolev töö uurib laineteisenduste kasutust piltide kvaliteedi parandamise eesmärgil, neilt müra eemaldades. See annab ülevaate erinevates müra tüüpidest ning müraeemaldusmeetoditest. Edasi keskendub töö laineteisenduspõhistele müraeemaldusskeemidele. Samuti uurib töö laineteisenduspõhise müraeemaldusmeetodi ning kokkupakkimise kombineerimise kasulikkust ja pakub välja uue lävendamise tüübi ning muudatuse eksisteerivale BayesShrink meetodile. Pakutud meetod implementeeritakse C# keeles ning selle implementatsiooni tulemusi, jõudlust ning optimaalseid parameetreid analüüsitakse eksperimentaalsete tulemuste abil.This thesis studies the use of wavelets for the purpose of improving the quality of images by removing noise from them. It presents an overview of different types of noise and denoising methods. The thesis then focuses on the wavelet transform based denoising schemes. It explores the potential of combining wavelet-based denoising and compression, and presents a new thresholding type and a modification to the existing BayesShrink method. The proposed method is implemented in the C# language and its performance and optimal parameters are analyzed through experimental results
Evaluating and Mapping Internet Connectivity in the United States
We evaluated Internet connectivity in the United States, drawn from different definitions of connectivity and different methods of analysis. Using DNS cache manipulation, traceroutes, and a crowdsourced “site ping” method we identify patterns in connectivity that correspond to higher population or coastal regions of the US. We analyze the data for quality strengths and shortcomings, establish connectivity heatmaps, state rankings, and statistical measures of the data. We give comparative analyses of the three methods and present suggestions for future work building off this report
TweeProfiles4: a weighted multidimensional stream clustering algorithm
O aparecimento das redes sociais abriu aos utilizadores a possibilidade de facilmente partilharem as suas ideias a respeito de diferentes temas, o que constitui uma fonte de informação enriquecedora para diversos campos. As plataformas de microblogging sofreram um grande crescimento e de forma constante nos últimos anos. O Twitter é o site de microblogging mais popular, tornando-se uma fonte de dados interessante para extração de conhecimento. Um dos principais desafios na análise de dados provenientes de redes sociais é o seu fluxo, o que dificulta a aplicação de processos tradicionais de data mining. Neste sentido, a extração de conhecimento sobre fluxos de dados tem recebido um foco significativo recentemente. O TweeProfiles é a uma ferramenta de data mining para análise e visualização de dados do Twitter sobre quatro dimensões: espacial (a localização geográfica do tweet), temporal (a data de publicação do tweet), de conteúdo (o texto do tweet) e social (o grafo dos relacionamentos). Este é um projeto em desenvolvimento que ainda possui muitos aspetos que podem ser melhorados. Uma das recentes melhorias inclui a substituição do algoritmo de clustering original, o qual não suportava o fluxo contínuo dos dados, por um método de streaming. O objetivo desta dissertação passa pela continuação do desenvolvimento do TweeProfiles. Em primeiro lugar, será proposto um novo algoritmo de clustering para fluxos de dados com o objetivo de melhorar o existente. Para esse efeito será desenvolvido um algoritmo incremental com suporte para fluxos de dados multi-dimensionais. Esta abordagem deve permitir ao utilizador alterar dinamicamente a importância relativa de cada dimensão do processo de clustering. Adicionalmente, a avaliação empírica dos resultados será alvo de melhoramento através da identificação e implementação de medidas adequadas de avaliação dos padrões extraídos. O estudo empírico será realizado através de tweets georreferenciados obtidos pelo SocialBus.The emergence of social media made it possible for users to easily share their thoughts on different topics, which constitutes a rich source of information for many fields. Microblogging platforms experienced a large and steady growth over the last few years. Twitter is the most popular microblogging site, making it an interesting source of data for pattern extraction. One of the main challenges of analyzing social media data is its continuous nature, which makes it hard to use traditional data mining. Therefore, mining stream data has also received a lot of attention recently.TweeProfiles is a data mining tool for analyzing and visualizing Twitter data over four dimensions: spatial (the location of the tweet), temporal (the timestamp of the tweet), content (the text of the tweet) and social (relationship graph). This is an ongoing project which still has many aspects that can be improved. For instance, it was recently improved by replacing the original clustering algorithm which could not handle the continuous flow of data with a streaming method. The goal of this dissertation is to continue the development of TweeProfiles. First, the stream clustering process will be improved by proposing a new algorithm. This will be achieved by developing an incremental algorithm with support for multi-dimensional streaming data. Moreover, it should make it possible for the user to dynamically change the relative importance of each dimension in the clustering. Additionally, the empirical evaluation of the results will also be improved.Suitable measures to evaluate the extracted patterns will be identified and implemented. An empirical study will be done using data consisting of georeferenced tweets from SocialBus
Generalized Low Rank Models
Principal components analysis (PCA) is a well-known technique for
approximating a tabular data set by a low rank matrix. Here, we extend the idea
of PCA to handle arbitrary data sets consisting of numerical, Boolean,
categorical, ordinal, and other data types. This framework encompasses many
well known techniques in data analysis, such as nonnegative matrix
factorization, matrix completion, sparse and robust PCA, -means, -SVD,
and maximum margin matrix factorization. The method handles heterogeneous data
sets, and leads to coherent schemes for compressing, denoising, and imputing
missing entries across all data types simultaneously. It also admits a number
of interesting interpretations of the low rank factors, which allow clustering
of examples or of features. We propose several parallel algorithms for fitting
generalized low rank models, and describe implementations and numerical
results.Comment: 84 pages, 19 figure
- …