110 research outputs found
From Frequency to Meaning: Vector Space Models of Semantics
Computers understand very little of the meaning of human language. This
profoundly limits our ability to give instructions to computers, the ability of
computers to explain their actions to us, and the ability of computers to
analyse and process text. Vector space models (VSMs) of semantics are beginning
to address these limits. This paper surveys the use of VSMs for semantic
processing of text. We organize the literature on VSMs according to the
structure of the matrix in a VSM. There are currently three broad classes of
VSMs, based on term-document, word-context, and pair-pattern matrices, yielding
three classes of applications. We survey a broad range of applications in these
three categories and we take a detailed look at a specific open source project
in each category. Our goal in this survey is to show the breadth of
applications of VSMs for semantics, to provide a new perspective on VSMs for
those who are already familiar with the area, and to provide pointers into the
literature for those who are less familiar with the field
Learning compact hashing codes with complex objectives from multiple sources for large scale similarity search
Similarity search is a key problem in many real world applications including image and text retrieval, content reuse detection and collaborative filtering. The purpose of similarity search is to identify similar data examples given a query example. Due to the explosive growth of the Internet, a huge amount of data such as texts, images and videos has been generated, which indicates that efficient large scale similarity search becomes more important.^ Hashing methods have become popular for large scale similarity search due to their computational and memory efficiency. These hashing methods design compact binary codes to represent data examples so that similar examples are mapped into similar codes. This dissertation addresses five major problems for utilizing supervised information from multiple sources in hashing with respect to different objectives. Firstly, we address the problem of incorporating semantic tags by modeling the latent correlations between tags and data examples. More precisely, the hashing codes are learned in a unified semi-supervised framework by simultaneously preserving the similarities between data examples and ensuring the tag consistency via a latent factor model. Secondly, we solve the missing data problem by latent subspace learning from multiple sources. The hashing codes are learned by enforcing the data consistency among different sources. Thirdly, we address the problem of hashing on structured data by graph learning. A weighted graph is constructed based on the structured knowledge from the data. The hashing codes are then learned by preserving the graph similarities. Fourthly, we address the problem of learning high ranking quality hashing codes by utilizing the relevance judgments from users. The hashing code/function is learned via optimizing a commonly used non-smooth non-convex ranking measure, NDCG. Finally, we deal with the problem of insufficient supervision by active learning. We propose to actively select the most informative data examples and tags in a joint manner based on the selection criteria that both the data examples and tags should be most uncertain and dissimilar with each other.^ Extensive experiments on several large scale datasets demonstrate the superior performance of the proposed approaches over several state-of-the-art hashing methods from different perspectives
Recommended from our members
Hardware and software fingerprinting of mobile devices
This dissertation presents novel and practical algorithms to identify the software and hardware components on mobile devices. In particular, we make significant contributions in two challenging areas: library fingerprinting, to identify third-party software libraries, and device fingerprinting, to identify individual hardware components. Our work has significant implications for the privacy and security of mobile platforms.
Software-based library fingerprinting can be used to detect vulnerable libraries and uncover large-scale data collection activities. We develop a novel Android library finger-printing tool, LibID, to reliably identify specific versions of in-app third-party libraries. LibID is more effective against code obfuscation than prior art. When comparing LibID with other tools in identifying the correct library version using obfuscated F-Droid apps, LibID achieves an F1 score of more than 0.5 in all cases while prior work is below 0.25. We also demonstrate the utility of LibID by detecting the use of a vulnerable version of the OkHttp library in nearly 10% of the 3 958 popular apps on the Google Play Store.
Hardware-based device fingerprinting allows apps and websites to invade user privacy by tracking user activity online as the user moves between apps or websites. In particular, we present a new type of device fingerprinting attack, the factory calibration fingerprinting attack, that recovers embedded per-device factory calibration data from motion sensors in a smartphone. We investigate the calibration behaviour of each sensor and show that the calibration fingerprint is fast to generate, does not change over time or after a factory reset, and can be obtained without any special user permissions.
We estimate the entropy of the calibration fingerprint and find the fingerprint is very likely to be globally unique for iOS devices (~67 bits of entropy for iPhone 6S) and recent Google Pixel devices (~57 bits of entropy for Pixel 4/4 XL). By comparison, the fingerprint generated by previous work has at most 13 bits of entropy. Following our disclosures, Apple deployed a fix in iOS 12.2 and Google in Android 11.
Both code obfuscation and factory calibration help to hide software and hardware idiosyncrasies from third-parties, but this dissertation demonstrates that reliable software and hardware fingerprints can still be generated given sufficient knowledge and a suitable approach. Our work has significant practical implications and can be used to improve platform security and protect user privacy.China Scholarship Council
The Boeing Company
Microsoft Researc
Learning in the Real World: Constraints on Cost, Space, and Privacy
The sheer demand for machine learning in fields as varied as: healthcare, web-search ranking, factory automation, collision prediction, spam filtering, and many others, frequently outpaces the intended use-case of machine learning models. In fact, a growing number of companies hire machine learning researchers to rectify this very problem: to tailor and/or design new state-of-the-art models to the setting at hand.
However, we can generalize a large set of the machine learning problems encountered in practical settings into three categories: cost, space, and privacy. The first category (cost) considers problems that need to balance the accuracy of a machine learning model with the cost required to evaluate it. These include problems in web-search, where results need to be delivered to a user in under a second and be as accurate as possible. The second category (space) collects problems that require running machine learning algorithms on low-memory computing devices. For instance, in search-and-rescue operations we may opt to use many small unmanned aerial vehicles (UAVs) equipped with machine learning algorithms for object detection to find a desired search target. These algorithms should be small to fit within the physical memory limits of the UAV (and be energy efficient) while reliably detecting objects. The third category (privacy) considers problems where one wishes to run machine learning algorithms on sensitive data. It has been shown that seemingly innocuous analyses on such data can be exploited to reveal data individuals would prefer to keep private. Thus, nearly any algorithm that runs on patient or economic data falls under this set of problems.
We devise solutions for each of these problem categories including (i) a fast tree-based model for explicitly trading off accuracy and model evaluation time, (ii) a compression method for the k-nearest neighbor classifier, and (iii) a private causal inference algorithm that protects sensitive data
- …