786 research outputs found
You are your Metadata: Identification and Obfuscation of Social Media Users using Metadata Information
Metadata are associated to most of the information we produce in our daily
interactions and communication in the digital world. Yet, surprisingly,
metadata are often still catergorized as non-sensitive. Indeed, in the past,
researchers and practitioners have mainly focused on the problem of the
identification of a user from the content of a message.
In this paper, we use Twitter as a case study to quantify the uniqueness of
the association between metadata and user identity and to understand the
effectiveness of potential obfuscation strategies. More specifically, we
analyze atomic fields in the metadata and systematically combine them in an
effort to classify new tweets as belonging to an account using different
machine learning algorithms of increasing complexity. We demonstrate that
through the application of a supervised learning algorithm, we are able to
identify any user in a group of 10,000 with approximately 96.7% accuracy.
Moreover, if we broaden the scope of our search and consider the 10 most likely
candidates we increase the accuracy of the model to 99.22%. We also found that
data obfuscation is hard and ineffective for this type of data: even after
perturbing 60% of the training data, it is still possible to classify users
with an accuracy higher than 95%. These results have strong implications in
terms of the design of metadata obfuscation strategies, for example for data
set release, not only for Twitter, but, more generally, for most social media
platforms.Comment: 11 pages, 13 figures. Published in the Proceedings of the 12th
International AAAI Conference on Web and Social Media (ICWSM 2018). June
2018. Stanford, CA, US
Using Contexts and Constraints for Improved Geotagging of Human Trafficking Webpages
Extracting geographical tags from webpages is a well-motivated application in
many domains. In illicit domains with unusual language models, like human
trafficking, extracting geotags with both high precision and recall is a
challenging problem. In this paper, we describe a geotag extraction framework
in which context, constraints and the openly available Geonames knowledge base
work in tandem in an Integer Linear Programming (ILP) model to achieve good
performance. In preliminary empirical investigations, the framework improves
precision by 28.57% and F-measure by 36.9% on a difficult human trafficking
geotagging task compared to a machine learning-based baseline. The method is
already being integrated into an existing knowledge base construction system
widely used by US law enforcement agencies to combat human trafficking.Comment: 6 pages, GeoRich 2017 workshop at ACM SIGMOD conferenc
Fidelity-Weighted Learning
Training deep neural networks requires many training samples, but in practice
training labels are expensive to obtain and may be of varying quality, as some
may be from trusted expert labelers while others might be from heuristics or
other sources of weak supervision such as crowd-sourcing. This creates a
fundamental quality versus-quantity trade-off in the learning process. Do we
learn from the small amount of high-quality data or the potentially large
amount of weakly-labeled data? We argue that if the learner could somehow know
and take the label-quality into account when learning the data representation,
we could get the best of both worlds. To this end, we propose
"fidelity-weighted learning" (FWL), a semi-supervised student-teacher approach
for training deep neural networks using weakly-labeled data. FWL modulates the
parameter updates to a student network (trained on the task we care about) on a
per-sample basis according to the posterior confidence of its label-quality
estimated by a teacher (who has access to the high-quality labels). Both
student and teacher are learned from the data. We evaluate FWL on two tasks in
information retrieval and natural language processing where we outperform
state-of-the-art alternative semi-supervised methods, indicating that our
approach makes better use of strong and weak labels, and leads to better
task-dependent data representations.Comment: Published as a conference paper at ICLR 201
Semantic Interaction in Web-based Retrieval Systems : Adopting Semantic Web Technologies and Social Networking Paradigms for Interacting with Semi-structured Web Data
Existing web retrieval models for exploration and interaction with web data do not take into account semantic information, nor do they allow for new forms of interaction by employing meaningful interaction and navigation metaphors in 2D/3D. This thesis researches means for introducing a semantic dimension into the search and exploration process of web content to enable a significantly positive user experience. Therefore, an inherently dynamic view beyond single concepts and models from semantic information processing, information extraction and human-machine interaction is adopted. Essential tasks for semantic interaction such as semantic annotation, semantic mediation and semantic human-computer interaction were identified and elaborated for two general application scenarios in web retrieval: Web-based Question Answering in a knowledge-based dialogue system and semantic exploration of information spaces in 2D/3D
Machine learning from crowds a systematic review of its applications
Crowdsourcing opens the door to solving a wide variety of problems that previ-ously were unfeasible in the field of machine learning, allowing us to obtain rela-tively low cost labeled data in a small amount of time. However, due to theuncertain quality of labelers, the data to deal with are sometimes unreliable, forcingpractitioners to collect information redundantly, which poses new challenges in thefield. Despite these difficulties, many applications of machine learning usingcrowdsourced data have recently been published that achieved state of the artresults in relevant problems. We have analyzed these applications following a sys-tematic methodology, classifying them into different fields of study, highlightingseveral of their characteristics and showing the recent interest in the use of crowd-sourcing for machine learning. We also identify several exciting research linesbased on the problems that remain unsolved to foster future research in this field
Empirically-Grounded Construction of Bug Prediction and Detection Tools
There is an increasing demand on high-quality software as software bugs have an economic impact not only on software projects, but also on national economies in general. Software quality is achieved via the main quality assurance activities of testing and code reviewing. However, these activities are expensive, thus they need to be carried out efficiently.
Auxiliary software quality tools such as bug detection and bug prediction tools help developers focus their testing and reviewing activities on the parts of software that more likely contain bugs. However, these tools are far from adoption as mainstream development tools. Previous research points to their inability to adapt to the peculiarities of projects and their high rate of false positives as the main obstacles of their adoption.
We propose empirically-grounded analysis to improve the adaptability and efficiency of bug detection and prediction tools. For a bug detector to be efficient, it needs to detect bugs that are conspicuous, frequent, and specific to a software project. We empirically show that the null-related bugs fulfill these criteria and are worth building detectors for. We analyze the null dereferencing problem and find that its root cause lies in methods that return null. We propose an empirical solution to this problem that depends on the wisdom of the crowd. For each API method, we extract the nullability measure that expresses how often the return value of this method is checked against null in the ecosystem of the API. We use nullability to annotate API methods with nullness annotation and warn developers about missing and excessive null checks.
For a bug predictor to be efficient, it needs to be optimized as both a machine learning model and a software quality tool. We empirically show how feature selection and hyperparameter optimizations improve prediction accuracy. Then we optimize bug prediction to locate the maximum number of bugs in the minimum amount of code by finding the most cost-effective combination of bug prediction configurations, i.e., dependent variables, machine learning model, and response variable. We show that using both source code and change metrics as dependent variables, applying feature selection on them, then using an optimized Random Forest to predict the number of bugs results in the most cost-effective bug predictor.
Throughout this thesis, we show how empirically-grounded analysis helps us achieve efficient bug prediction and detection tools and adapt them to the characteristics of each software project
- …