199 research outputs found
Neural Text Sanitization with Privacy Risk Indicators: An Empirical Analysis
Text sanitization is the task of redacting a document to mask all occurrences
of (direct or indirect) personal identifiers, with the goal of concealing the
identity of the individual(s) referred in it. In this paper, we consider a
two-step approach to text sanitization and provide a detailed analysis of its
empirical performance on two recently published datasets: the Text
Anonymization Benchmark (Pil\'an et al., 2022) and a collection of Wikipedia
biographies (Papadopoulou et al., 2022). The text sanitization process starts
with a privacy-oriented entity recognizer that seeks to determine the text
spans expressing identifiable personal information. This privacy-oriented
entity recognizer is trained by combining a standard named entity recognition
model with a gazetteer populated by person-related terms extracted from
Wikidata. The second step of the text sanitization process consists in
assessing the privacy risk associated with each detected text span, either
isolated or in combination with other text spans. We present five distinct
indicators of the re-identification risk, respectively based on language model
probabilities, text span classification, sequence labelling, perturbations, and
web search. We provide a contrastive analysis of each privacy indicator and
highlight their benefits and limitations, notably in relation to the available
labeled data
Matchability of heterogeneous networks pairs
We consider the problem of graph matchability in non-identically distributed networks. In a general class of edge-independent networks, we demonstrate that graph matchability is almost surely lost when matching the networks directly, and is almost perfectly recovered when first centering the networks using Universal Singular Value Thresholding before matching. These theoretical results are then demonstrated in both real and synthetic simulation settings. We also recover analogous core-matchability results in a very general core-junk network model, wherein some vertices do not correspond between the graph pair.First author draf
Methods of Disambiguating and De-anonymizing Authorship in Large Scale Operational Data
Operational data from software development, social networks and other domains are often contaminated with incorrect or missing values. Examples include misspelled or changed names, multiple emails belonging to the same person and user profiles that vary in different systems. Such digital traces are extensively used in research and practice to study collaborating communities of various kinds. To achieve a realistic representation of the networks that represent these communities, accurate identities are essential. In this work, we aim to identify, model, and correct identity errors in data from open-source software repositories, which include more than 23M developer IDs and nearly 1B Git commits (developer activity records). Our investigation into the nature and prevalence of identity errors in software activity data reveals that they are different and occur at much higher rates than other domains. Existing techniques relying on string comparisons can only disambiguate Synonyms, but not Homonyms, which are common in software activity traces. Therefore, we introduce measures of behavioral fingerprinting to improve the accuracy of Synonym resolution, and to disambiguate Homonyms. Fingerprints are constructed from the traces of developers’ activities, such as, the style of writing in commit messages, the patterns in files modified and projects participated in by developers, and the patterns related to the timing of the developers’ activity. Furthermore, to address the lack of training data necessary for the supervised learning approaches that are used in disambiguation, we design a specific active learning procedure that minimizes the manual effort necessary to create training data in the domain of developer identity matching. We extensively evaluate the proposed approach, using over 16,000 OpenStack developers in 1200 projects, against commercial and most recent research approaches, and further on recent research on a much larger sample of over 2,000,000 IDs. Results demonstrate that our method is significantly better than both the recent research and commercial methods. We also conduct experiments to demonstrate that such erroneous data have significant impact on developer networks. We hope that the proposed approach will expedite research progress in the domain of software engineering, especially in applications for which graphs of social networks are critical
“You’re trolling because…” – A Corpus-based Study of Perceived Trolling and Motive Attribution in the Comment Threads of Three British Political Blogs
This paper investigates the linguistically marked motives that participants attribute to those they call trolls in 991 comment threads of three British political blogs. The study is concerned with how these motives affect the discursive construction of trolling and trolls. Another goal of the paper is to examine whether the mainly emotional motives ascribed to trolls in the academic literature correspond with those that the participants attribute to the alleged trolls in the analysed threads. The paper identifies five broad motives ascribed to trolls: emotional/mental health-related/social reasons, financial gain, political beliefs, being employed by a political body, and unspecified political affiliation. It also points out that depending on these motives, trolling and trolls are constructed in various ways. Finally, the study argues that participants attribute motives to trolls not only to explain their behaviour but also to insult them
- …