93 research outputs found

    When Hashes Met Wedges: A Distributed Algorithm for Finding High Similarity Vectors

    Full text link
    Finding similar user pairs is a fundamental task in social networks, with numerous applications in ranking and personalization tasks such as link prediction and tie strength detection. A common manifestation of user similarity is based upon network structure: each user is represented by a vector that represents the user's network connections, where pairwise cosine similarity among these vectors defines user similarity. The predominant task for user similarity applications is to discover all similar pairs that have a pairwise cosine similarity value larger than a given threshold Ï„\tau. In contrast to previous work where Ï„\tau is assumed to be quite close to 1, we focus on recommendation applications where Ï„\tau is small, but still meaningful. The all pairs cosine similarity problem is computationally challenging on networks with billions of edges, and especially so for settings with small Ï„\tau. To the best of our knowledge, there is no practical solution for computing all user pairs with, say Ï„=0.2\tau = 0.2 on large social networks, even using the power of distributed algorithms. Our work directly addresses this challenge by introducing a new algorithm --- WHIMP --- that solves this problem efficiently in the MapReduce model. The key insight in WHIMP is to combine the "wedge-sampling" approach of Cohen-Lewis for approximate matrix multiplication with the SimHash random projection techniques of Charikar. We provide a theoretical analysis of WHIMP, proving that it has near optimal communication costs while maintaining computation cost comparable with the state of the art. We also empirically demonstrate WHIMP's scalability by computing all highly similar pairs on four massive data sets, and show that it accurately finds high similarity pairs. In particular, we note that WHIMP successfully processes the entire Twitter network, which has tens of billions of edges

    Zero-Shot Hashing via Transferring Supervised Knowledge

    Full text link
    Hashing has shown its efficiency and effectiveness in facilitating large-scale multimedia applications. Supervised knowledge e.g. semantic labels or pair-wise relationship) associated to data is capable of significantly improving the quality of hash codes and hash functions. However, confronted with the rapid growth of newly-emerging concepts and multimedia data on the Web, existing supervised hashing approaches may easily suffer from the scarcity and validity of supervised information due to the expensive cost of manual labelling. In this paper, we propose a novel hashing scheme, termed \emph{zero-shot hashing} (ZSH), which compresses images of "unseen" categories to binary codes with hash functions learned from limited training data of "seen" categories. Specifically, we project independent data labels i.e. 0/1-form label vectors) into semantic embedding space, where semantic relationships among all the labels can be precisely characterized and thus seen supervised knowledge can be transferred to unseen classes. Moreover, in order to cope with the semantic shift problem, we rotate the embedded space to more suitably align the embedded semantics with the low-level visual feature space, thereby alleviating the influence of semantic gap. In the meantime, to exert positive effects on learning high-quality hash functions, we further propose to preserve local structural property and discrete nature in binary codes. Besides, we develop an efficient alternating algorithm to solve the ZSH model. Extensive experiments conducted on various real-life datasets show the superior zero-shot image retrieval performance of ZSH as compared to several state-of-the-art hashing methods.Comment: 11 page

    Non-surgical spinal decompression therapy: does the scientific literature support efficacy claims made in the advertising media?

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Traction therapy has been utilized in the treatment of low back pain for decades. The most recent incarnation of traction therapy is non-surgical spinal decompression therapy which can cost over $100,000. This form of therapy has been heavily marketed to manual therapy professions and subsequently to the consumer. The purpose of this paper is to initiate a debate pertaining to the relationship between marketing claims and the scientific literature on non-surgical spinal decompression.</p> <p>Discussion</p> <p>Only one small randomized controlled trial and several lower level efficacy studies have been performed on spinal decompression therapy. In general the quality of these studies is questionable. Many of the studies were performed using the VAX-D<sup>® </sup>unit which places the patient in a prone position. Often companies utilize this research for their marketing although their units place the patient in the supine position.</p> <p>Summary</p> <p>Only limited evidence is available to warrant the routine use of non-surgical spinal decompression, particularly when many other well investigated, less expensive alternatives are available.</p

    Trioctahedral entities in palygorskite: Near-infrared evidence for sepiolite-palygorskite polysomatism

    Get PDF
    The mixed dioctahedral-trioctahedral character of Mg-rich palygorskite has been previously described by the formula yMg5 Si8 O20(OH)2(OH2)4(1–y)[xMg2Fe2(1–x)Mg2 Al2] Si8 O20(OH)2(OH2)4, where y is the trioctahedral fraction of this two-chain ribbon mineral with an experimentally determined upper limit of y 0.5 and x is the FeIII content in the M2 sites of the dioctahedral component. Ideal trioctahedral (y ¼ 1) palygorskite is elusive, although sepiolite Mg8Si12O30(OH)4(OH2)4 with a similar composition, three-chain ribbon structure and distinct XRD pattern is common. A set of 22 samples identified by XRD as palygorskite and with variable composition (0 , x , 0.7, 0 , y , 0.5) were studied to extrapolate the structure of an ideal trioctahedral (y ¼ 1) palygorskite and to compare this structure to sepiolite. Near-infrared spectroscopy was used to study the influence of octahedral composition on the structure of the TOT ribbons, H2O in the tunnels and surface silanols of palygorskite, as well as their response to loss of zeolitic H2O. All spectroscopic evidence suggests that palygorskite consists of discrete dioctahedral and trioctahedral entities. The dioctahedral entities have variable structure determined solely by x=FeIII/(Al+FeIII) and their content is proportional to (1–y). In contrast, the trioctahedral entities have fixed octahedral composition or ribbon structure and are spectroscopically identical to sepiolite. The value of d200 in palygorskite follows the regression d200 (A°)= 6.362 + 0.129 x(1–y) + 0.305y, R2 = 0.96, σ = 0.013A°. When extrapolated to y = 1,d200 is identical to sepiolite. Based on this analysis, we propose that palygorskite samples with non-zero trioctahedral character should be considered as members of a polysomatic series of sepiolite and (dioctahedral) palygorskite described by the new formula y'Mg8 Si12 O30(OH)4(OH2)4.(1–y')[x'Mg2Fe2(1–x')Mg2Al2]Si8O20(OH)2(OH2)4, with 0 < x'= x < 0.7 and 0 < y' = y/(2–y) < 0.33

    De-identifying a public use microdata file from the Canadian national discharge abstract database

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The Canadian Institute for Health Information (CIHI) collects hospital discharge abstract data (DAD) from Canadian provinces and territories. There are many demands for the disclosure of this data for research and analysis to inform policy making. To expedite the disclosure of data for some of these purposes, the construction of a DAD public use microdata file (PUMF) was considered. Such purposes include: confirming some published results, providing broader feedback to CIHI to improve data quality, training students and fellows, providing an easily accessible data set for researchers to prepare for analyses on the full DAD data set, and serve as a large health data set for computer scientists and statisticians to evaluate analysis and data mining techniques. The objective of this study was to measure the probability of re-identification for records in a PUMF, and to de-identify a national DAD PUMF consisting of 10% of records.</p> <p>Methods</p> <p>Plausible attacks on a PUMF were evaluated. Based on these attacks, the 2008-2009 national DAD was de-identified. A new algorithm was developed to minimize the amount of suppression while maximizing the precision of the data. The acceptable threshold for the probability of correct re-identification of a record was set at between 0.04 and 0.05. Information loss was measured in terms of the extent of suppression and entropy.</p> <p>Results</p> <p>Two different PUMF files were produced, one with geographic information, and one with no geographic information but more clinical information. At a threshold of 0.05, the maximum proportion of records with the diagnosis code suppressed was 20%, but these suppressions represented only 8-9% of all values in the DAD. Our suppression algorithm has less information loss than a more traditional approach to suppression. Smaller regions, patients with longer stays, and age groups that are infrequently admitted to hospitals tend to be the ones with the highest rates of suppression.</p> <p>Conclusions</p> <p>The strategies we used to maximize data utility and minimize information loss can result in a PUMF that would be useful for the specific purposes noted earlier. However, to create a more detailed file with less information loss suitable for more complex health services research, the risk would need to be mitigated by requiring the data recipient to commit to a data sharing agreement.</p
    • …
    corecore