786 research outputs found

    When Hashes Met Wedges: A Distributed Algorithm for Finding High Similarity Vectors

    Full text link
    Finding similar user pairs is a fundamental task in social networks, with numerous applications in ranking and personalization tasks such as link prediction and tie strength detection. A common manifestation of user similarity is based upon network structure: each user is represented by a vector that represents the user's network connections, where pairwise cosine similarity among these vectors defines user similarity. The predominant task for user similarity applications is to discover all similar pairs that have a pairwise cosine similarity value larger than a given threshold τ\tau. In contrast to previous work where τ\tau is assumed to be quite close to 1, we focus on recommendation applications where τ\tau is small, but still meaningful. The all pairs cosine similarity problem is computationally challenging on networks with billions of edges, and especially so for settings with small τ\tau. To the best of our knowledge, there is no practical solution for computing all user pairs with, say τ=0.2\tau = 0.2 on large social networks, even using the power of distributed algorithms. Our work directly addresses this challenge by introducing a new algorithm --- WHIMP --- that solves this problem efficiently in the MapReduce model. The key insight in WHIMP is to combine the "wedge-sampling" approach of Cohen-Lewis for approximate matrix multiplication with the SimHash random projection techniques of Charikar. We provide a theoretical analysis of WHIMP, proving that it has near optimal communication costs while maintaining computation cost comparable with the state of the art. We also empirically demonstrate WHIMP's scalability by computing all highly similar pairs on four massive data sets, and show that it accurately finds high similarity pairs. In particular, we note that WHIMP successfully processes the entire Twitter network, which has tens of billions of edges

    COSMOS-7: Video-oriented MPEG-7 scheme for modelling and filtering of semantic content

    Get PDF
    MPEG-7 prescribes a format for semantic content models for multimedia to ensure interoperability across a multitude of platforms and application domains. However, the standard leaves it open as to how the models should be used and how their content should be filtered. Filtering is a technique used to retrieve only content relevant to user requirements, thereby reducing the necessary content-sifting effort of the user. This paper proposes an MPEG-7 scheme that can be deployed for semantic content modelling and filtering of digital video. The proposed scheme, COSMOS-7, produces rich and multi-faceted semantic content models and supports a content-based filtering approach that only analyses content relating directly to the preferred content requirements of the user

    An MPEG-7 scheme for semantic content modelling and filtering of digital video

    Get PDF
    Abstract Part 5 of the MPEG-7 standard specifies Multimedia Description Schemes (MDS); that is, the format multimedia content models should conform to in order to ensure interoperability across multiple platforms and applications. However, the standard does not specify how the content or the associated model may be filtered. This paper proposes an MPEG-7 scheme which can be deployed for digital video content modelling and filtering. The proposed scheme, COSMOS-7, produces rich and multi-faceted semantic content models and supports a content-based filtering approach that only analyses content relating directly to the preferred content requirements of the user. We present details of the scheme, front-end systems used for content modelling and filtering and experiences with a number of users

    Selection Bias in News Coverage: Learning it, Fighting it

    Get PDF
    News entities must select and filter the coverage they broadcast through their respective channels since the set of world events is too large to be treated exhaustively. The subjective nature of this filtering induces biases due to, among other things, resource constraints, editorial guidelines, ideological affinities, or even the fragmented nature of the information at a journalist's disposal. The magnitude and direction of these biases are, however, widely unknown. The absence of ground truth, the sheer size of the event space, or the lack of an exhaustive set of absolute features to measure make it difficult to observe the bias directly, to characterize the leaning's nature and to factor it out to ensure a neutral coverage of the news. In this work, we introduce a methodology to capture the latent structure of media's decision process on a large scale. Our contribution is multi-fold. First, we show media coverage to be predictable using personalization techniques, and evaluate our approach on a large set of events collected from the GDELT database. We then show that a personalized and parametrized approach not only exhibits higher accuracy in coverage prediction, but also provides an interpretable representation of the selection bias. Last, we propose a method able to select a set of sources by leveraging the latent representation. These selected sources provide a more diverse and egalitarian coverage, all while retaining the most actively covered events

    Region-Based Watermarking of Biometric Images: Case Study in Fingerprint Images

    Get PDF
    In this paper, a novel scheme to watermark biometric images is proposed. It exploits the fact that biometric images, normally, have one region of interest, which represents the relevant part of information processable by most of the biometric-based identification/authentication systems. This proposed scheme consists of embedding the watermark into the region of interest only; thus, preserving the hidden data from the segmentation process that removes the useless background and keeps the region of interest unaltered; a process which can be used by an attacker as a cropping attack. Also, it provides more robustness and better imperceptibility of the embedded watermark. The proposed scheme is introduced into the optimum watermark detection in order to improve its performance. It is applied to fingerprint images, one of the most widely used and studied biometric data. The watermarking is assessed in two well-known transform domains: the discrete wavelet transform (DWT) and the discrete Fourier transform (DFT). The results obtained are very attractive and clearly show significant improvements when compared to the standard technique, which operates on the whole image. The results also reveal that the segmentation (cropping) attack does not affect the performance of the proposed technique, which also shows more robustness against other common attacks

    Flooding through the lens of mobile phone activity

    Get PDF
    Natural disasters affect hundreds of millions of people worldwide every year. Emergency response efforts depend upon the availability of timely information, such as information concerning the movements of affected populations. The analysis of aggregated and anonymized Call Detail Records (CDR) captured from the mobile phone infrastructure provides new possibilities to characterize human behavior during critical events. In this work, we investigate the viability of using CDR data combined with other sources of information to characterize the floods that occurred in Tabasco, Mexico in 2009. An impact map has been reconstructed using Landsat-7 images to identify the floods. Within this frame, the underlying communication activity signals in the CDR data have been analyzed and compared against rainfall levels extracted from data of the NASA-TRMM project. The variations in the number of active phones connected to each cell tower reveal abnormal activity patterns in the most affected locations during and after the floods that could be used as signatures of the floods - both in terms of infrastructure impact assessment and population information awareness. The representativeness of the analysis has been assessed using census data and civil protection records. While a more extensive validation is required, these early results suggest high potential in using cell tower activity information to improve early warning and emergency management mechanisms.Comment: Submitted to IEEE Global Humanitarian Technologies Conference (GHTC) 201

    Computational methods to predict and enhance decision-making with biomedical data.

    Get PDF
    The proposed research applies machine learning techniques to healthcare applications. The core ideas were using intelligent techniques to find automatic methods to analyze healthcare applications. Different classification and feature extraction techniques on various clinical datasets are applied. The datasets include: brain MR images, breathing curves from vessels around tumor cells during in time, breathing curves extracted from patients with successful or rejected lung transplants, and lung cancer patients diagnosed in US from in 2004-2009 extracted from SEER database. The novel idea on brain MR images segmentation is to develop a multi-scale technique to segment blood vessel tissues from similar tissues in the brain. By analyzing the vascularization of the cancer tissue during time and the behavior of vessels (arteries and veins provided in time), a new feature extraction technique developed and classification techniques was used to rank the vascularization of each tumor type. Lung transplantation is a critical surgery for which predicting the acceptance or rejection of the transplant would be very important. A review of classification techniques on the SEER database was developed to analyze the survival rates of lung cancer patients, and the best feature vector that can be used to predict the most similar patients are analyzed
    corecore