5 research outputs found
Deep Hashing for Image Similarity Search
Hashing for similarity search is one of the most widely used methods to solve the approximate nearest neighbor search problem. In this method, one first maps data items from a real valued high-dimensional space to a suitable low dimensional binary code space and then performs the approximate nearest neighbor search in this code space instead. This is beneficial because the search in the code space can be solved more efficiently in terms of runtime complexity and storage consumption. Obviously, for this method to succeed, it is necessary that similar data items be mapped to binary code words that have small Hamming distance. For real-world data such as images, one usually proceeds as follows. For each data item, a pre-processing algorithm removes noise and insignificant information and extracts important discriminating information to generate a feature vector that captures the important semantic content. Next, a vector hash function maps this real valued feature vector to a binary code word. It is also possible to use the raw feature vectors afterwards to further process the search result candidates produced by binary hash codes. In this dissertation we focus on the following. First, developing a learning based counterpart for the MinHash hashing algorithm. Second, presenting a new unsupervised hashing method UmapHash to map the neighborhood relations of data items from the feature vector space to the binary hash code space. Finally, an application of the aforementioned hashing methods for rapid face image recognition
Stability is Stable: Connections between Replicability, Privacy, and Adaptive Generalization
The notion of replicable algorithms was introduced in Impagliazzo et al.
[STOC '22] to describe randomized algorithms that are stable under the
resampling of their inputs. More precisely, a replicable algorithm gives the
same output with high probability when its randomness is fixed and it is run on
a new i.i.d. sample drawn from the same distribution. Using replicable
algorithms for data analysis can facilitate the verification of published
results by ensuring that the results of an analysis will be the same with high
probability, even when that analysis is performed on a new data set.
In this work, we establish new connections and separations between
replicability and standard notions of algorithmic stability. In particular, we
give sample-efficient algorithmic reductions between perfect generalization,
approximate differential privacy, and replicability for a broad class of
statistical problems. Conversely, we show any such equivalence must break down
computationally: there exist statistical problems that are easy under
differential privacy, but that cannot be solved replicably without breaking
public-key cryptography. Furthermore, these results are tight: our reductions
are statistically optimal, and we show that any computational separation
between DP and replicability must imply the existence of one-way functions.
Our statistical reductions give a new algorithmic framework for translating
between notions of stability, which we instantiate to answer several open
questions in replicability and privacy. This includes giving sample-efficient
replicable algorithms for various PAC learning, distribution estimation, and
distribution testing problems, algorithmic amplification of in
approximate DP, conversions from item-level to user-level privacy, and the
existence of private agnostic-to-realizable learning reductions under
structured distributions.Comment: STOC 2023, minor typos fixe
A Framework for Verifying the Fixity of Archived Web Resources
The number of public and private web archives has increased, and we implicitly trust content delivered by these archives. Fixity is checked to ensure that an archived resource has remained unaltered (i.e., fixed) since the time it was captured. Currently, end users do not have the ability to easily verify the fixity of content preserved in web archives. For instance, if a web page is archived in 1999 and replayed in 2019, how do we know that it has not been tampered with during those 20 years? In order for the users of web archives to verify that archived web resources have not been altered, they should have access to fixity information associated with these resources. However, most web archives do not allow accessing fixity information and, more importantly, even if fixity information is available, it is provided by the same archive delivering the resource, not by an independent archive or service.
In this research, we present a framework for establishing and checking the fixity on the playback of archived resources, or mementos. The framework defines an archive-aware hashing function that consists of several guidelines for generating repeatable fixity information on the playback of mementos. These guidelines are results of our 14-month study for identifying and quantifying changes in replayed mementos over time that affect generating repeatable fixity information. Changes on the playback of mementos may be caused by JavaScript, transient errors, inconsistency in the availability of mementos over time, and archive-specific resources. Changes are also caused by transformations in the content of archived resources applied by web archives to appropriately replay these resources in a user’s browser. The study also shows that only 11.55% of mementos always produce the same fixity information after each replay, while about 16.06% of mementos always produce different fixity information after each replay. The remaining 72.39% of mementos produce multiple unique fixity information. We also find that mementos may disappear when web archives move to different domains or archives.
In addition to defining multiple guidelines for generating fixity information, the framework introduces two approaches, Atomic and Block, that can be used to disseminate fixity information to web archives. The main difference between the two approaches is that, in the Atomic approach, the fixity information of each archived web page is stored in a separate file before being disseminated to several on-demand web archives, while in the Block approach, we batch together fixity information of multiple archived pages to a single binary-searchable file before being disseminated to archives. The framework defines the structure of URLs used to publish fixity information on the web and retrieve archived fixity information from web archives. Our framework does not require changes in the current web archiving infrastructure, and it is built based on well-known web archiving standards, such as the Memento protocol. The proposed framework will allow users to generate fixity information on any archived page at any time, preserve the fixity information independently from the archive delivering the archived page, and verify the fixity of the archived page at any time in the future