786 research outputs found
When Hashes Met Wedges: A Distributed Algorithm for Finding High Similarity Vectors
Finding similar user pairs is a fundamental task in social networks, with
numerous applications in ranking and personalization tasks such as link
prediction and tie strength detection. A common manifestation of user
similarity is based upon network structure: each user is represented by a
vector that represents the user's network connections, where pairwise cosine
similarity among these vectors defines user similarity. The predominant task
for user similarity applications is to discover all similar pairs that have a
pairwise cosine similarity value larger than a given threshold . In
contrast to previous work where is assumed to be quite close to 1, we
focus on recommendation applications where is small, but still
meaningful. The all pairs cosine similarity problem is computationally
challenging on networks with billions of edges, and especially so for settings
with small . To the best of our knowledge, there is no practical solution
for computing all user pairs with, say on large social networks,
even using the power of distributed algorithms.
Our work directly addresses this challenge by introducing a new algorithm ---
WHIMP --- that solves this problem efficiently in the MapReduce model. The key
insight in WHIMP is to combine the "wedge-sampling" approach of Cohen-Lewis for
approximate matrix multiplication with the SimHash random projection techniques
of Charikar. We provide a theoretical analysis of WHIMP, proving that it has
near optimal communication costs while maintaining computation cost comparable
with the state of the art. We also empirically demonstrate WHIMP's scalability
by computing all highly similar pairs on four massive data sets, and show that
it accurately finds high similarity pairs. In particular, we note that WHIMP
successfully processes the entire Twitter network, which has tens of billions
of edges
COSMOS-7: Video-oriented MPEG-7 scheme for modelling and filtering of semantic content
MPEG-7 prescribes a format for semantic content models for multimedia to ensure interoperability across a multitude of platforms and application domains. However, the standard leaves it open as to how the models should be used and how their content should be filtered. Filtering is a technique used to retrieve only content relevant to user requirements, thereby reducing the necessary content-sifting effort of the user. This paper proposes an MPEG-7 scheme that can be deployed for semantic content modelling and filtering of digital video. The proposed scheme, COSMOS-7, produces rich and multi-faceted semantic content models and supports a content-based filtering approach that only analyses content relating directly to the preferred content requirements of the user
An MPEG-7 scheme for semantic content modelling and filtering of digital video
Abstract Part 5 of the MPEG-7 standard specifies Multimedia Description Schemes (MDS); that is, the format multimedia content models should conform to in order to ensure interoperability across multiple platforms and applications. However, the standard does not specify how the content or the associated model may be filtered. This paper proposes an MPEG-7 scheme which can be deployed for digital video content modelling and filtering. The proposed scheme, COSMOS-7, produces rich and multi-faceted semantic content models and supports a content-based filtering approach that only analyses content relating directly to the preferred content requirements of the user. We present details of the scheme, front-end systems used for content modelling and filtering and experiences with a number of users
Selection Bias in News Coverage: Learning it, Fighting it
News entities must select and filter the coverage they broadcast through
their respective channels since the set of world events is too large to be
treated exhaustively. The subjective nature of this filtering induces biases
due to, among other things, resource constraints, editorial guidelines,
ideological affinities, or even the fragmented nature of the information at a
journalist's disposal. The magnitude and direction of these biases are,
however, widely unknown. The absence of ground truth, the sheer size of the
event space, or the lack of an exhaustive set of absolute features to measure
make it difficult to observe the bias directly, to characterize the leaning's
nature and to factor it out to ensure a neutral coverage of the news. In this
work, we introduce a methodology to capture the latent structure of media's
decision process on a large scale. Our contribution is multi-fold. First, we
show media coverage to be predictable using personalization techniques, and
evaluate our approach on a large set of events collected from the GDELT
database. We then show that a personalized and parametrized approach not only
exhibits higher accuracy in coverage prediction, but also provides an
interpretable representation of the selection bias. Last, we propose a method
able to select a set of sources by leveraging the latent representation. These
selected sources provide a more diverse and egalitarian coverage, all while
retaining the most actively covered events
Region-Based Watermarking of Biometric Images: Case Study in Fingerprint Images
In this paper, a novel scheme to watermark biometric
images is proposed. It exploits the fact that biometric
images, normally, have one region of interest, which represents
the relevant part of information processable by most of the
biometric-based identification/authentication systems. This proposed
scheme consists of embedding the watermark into the
region of interest only; thus, preserving the hidden data from
the segmentation process that removes the useless background
and keeps the region of interest unaltered; a process which can
be used by an attacker as a cropping attack. Also, it provides
more robustness and better imperceptibility of the embedded
watermark. The proposed scheme is introduced into the optimum
watermark detection in order to improve its performance. It is
applied to fingerprint images, one of the most widely used and
studied biometric data. The watermarking is assessed in two
well-known transform domains: the discrete wavelet transform
(DWT) and the discrete Fourier transform (DFT). The results
obtained are very attractive and clearly show significant improvements
when compared to the standard technique, which
operates on the whole image. The results also reveal that the
segmentation (cropping) attack does not affect the performance
of the proposed technique, which also shows more robustness
against other common attacks
Flooding through the lens of mobile phone activity
Natural disasters affect hundreds of millions of people worldwide every year.
Emergency response efforts depend upon the availability of timely information,
such as information concerning the movements of affected populations. The
analysis of aggregated and anonymized Call Detail Records (CDR) captured from
the mobile phone infrastructure provides new possibilities to characterize
human behavior during critical events. In this work, we investigate the
viability of using CDR data combined with other sources of information to
characterize the floods that occurred in Tabasco, Mexico in 2009. An impact map
has been reconstructed using Landsat-7 images to identify the floods. Within
this frame, the underlying communication activity signals in the CDR data have
been analyzed and compared against rainfall levels extracted from data of the
NASA-TRMM project. The variations in the number of active phones connected to
each cell tower reveal abnormal activity patterns in the most affected
locations during and after the floods that could be used as signatures of the
floods - both in terms of infrastructure impact assessment and population
information awareness. The representativeness of the analysis has been assessed
using census data and civil protection records. While a more extensive
validation is required, these early results suggest high potential in using
cell tower activity information to improve early warning and emergency
management mechanisms.Comment: Submitted to IEEE Global Humanitarian Technologies Conference (GHTC)
201
Computational methods to predict and enhance decision-making with biomedical data.
The proposed research applies machine learning techniques to healthcare applications. The core ideas were using intelligent techniques to find automatic methods to analyze healthcare applications. Different classification and feature extraction techniques on various clinical datasets are applied. The datasets include: brain MR images, breathing curves from vessels around tumor cells during in time, breathing curves extracted from patients with successful or rejected lung transplants, and lung cancer patients diagnosed in US from in 2004-2009 extracted from SEER database. The novel idea on brain MR images segmentation is to develop a multi-scale technique to segment blood vessel tissues from similar tissues in the brain. By analyzing the vascularization of the cancer tissue during time and the behavior of vessels (arteries and veins provided in time), a new feature extraction technique developed and classification techniques was used to rank the vascularization of each tumor type. Lung transplantation is a critical surgery for which predicting the acceptance or rejection of the transplant would be very important. A review of classification techniques on the SEER database was developed to analyze the survival rates of lung cancer patients, and the best feature vector that can be used to predict the most similar patients are analyzed
- …