3 research outputs found

    Using Data Mining Methods to Predict Personally Identifiable Information in Emails

    No full text
    Private information management and compliance are important issues nowadays for most of organizations. As a major communication tool for organizations, email is one of the many potential sources for privacy leaks. Information extraction methods have been applied to detect private information in text files. However, since email messages usually consist of low quality text, information extraction methods for private information detection may not achieve good performance. In this paper, we address the problem of predicting the presence of private information in email using data mining and text mining methods. Two prediction models are proposed. The first model is based on association rules that predict one type of private information based on other types of private information identified in emails. The second model is based on classification models that predict private information according to the content of the emails. Experiments on the Enron email dataset show promising results.La gestion des renseignements personnels et le respect des r\ue8gles constituent aujourd'hui des questions importantes pour la plupart des organisations. \uc0 titre d'outil de communications important des organisations, la messagerie \ue9lectronique constitue une source potentielle de divulgation de renseignements personnels. De m\ue9thodes d'extraction de renseignements ont \ue9t\ue9 appliqu\ue9es afin de d\ue9tecter les renseignements personnels dans les fichiers texte. Toutefois, comme les courriels sont habituellement compos\ue9s de texte de qualit\ue9 m\ue9diocre, les m\ue9thodes d'extraction d'information qui visent \ue0 d\ue9tecter les renseignements personnels peuvent ne pas pr\ue9senter de bons rendements. Dans cet expos\ue9, nous nous penchons sur la question de la pr\ue9diction de la pr\ue9sence de renseignements personnels dans les courriels en faisant appel \ue0 l'exploration de donn\ue9es et de texte. Deux mod\ue8les de pr\ue9vision sont propos\ue9s. Le premier mod\ue8le est bas\ue9 sur des r\ue8gles d'association qui pr\ue9voient un type de renseignement personnel en se basant sur d'autres types de renseignements personnels relev\ue9s dans des courriels. Le deuxi\ue8me mod\ue8le est bas\ue9 sur des mod\ue8les de classification qui pr\ue9voient la pr\ue9sence de renseignements personnels en se fondant sur le contenu des courriels. Les r\ue9sultats des exp\ue9riences appliqu\ue9es au jeu de donn\ue9es de courriels d'Enron se montrent prometteurs.NRC publication: Ye

    Investigating Obfuscation as a Tool to Enhance Photo Privacy on Social Networks Sites

    Get PDF
    Photos which contain rich visual information can be a source of privacy issues. Some privacy issues associated with photos include identification of people, inference attacks, location disclosure, and sensitive information leakage. However, photo privacy is often hard to achieve because the content in the photos is both what makes them valuable to viewers, and what causes privacy concerns. Photo sharing often occurs via Social Network Sites (SNSs). Photo privacy is difficult to achieve via SNSs due to two main reasons: first, SNSs seldom notify users of the sensitive content in their photos that might cause privacy leakage; second, the recipient control tools available on SNSs are not effective. The only solution that existing SNSs (e.g., Facebook, Flickr) provide is control over who receives a photo. This solution allows users to withhold the entire photo from certain viewers while sharing it with other viewers. The idea is that if viewers cannot see a photo, then privacy risk is minimized. However, withholding or self-censoring photos is not always the solution people want. In some cases, people want to be able to share photos, or parts of photos, even when they have privacy concerns about the photo. To provide better online photo privacy protection options for users, we leverage a behavioral theory of privacy that identifies and focuses on two key elements that influence privacy -- information content and information recipient. This theory provides a vocabulary for discussing key aspects of privacy and helps us organize our research to focus on the two key parameters through a series of studies. In my thesis, I describe five studies I have conducted. First, I focus on the content parameter to identify what portions of an image are considered sensitive and therefore are candidates to be obscured to increase privacy. I provide a taxonomy of content sensitivity that can help designers of photo-privacy mechanisms understand what categories of content users consider sensitive. Then, focusing on the recipient parameter, I describe how elements of the taxonomy are associated with users\u27 sharing preferences for different categories of recipients (e.g., colleagues vs. family members). Second, focusing on controlling photo content disclosure, I invented privacy-enhancing obfuscations and evaluated their effectiveness against human recognition and studied how they affect the viewing experience. Third, after discovering that avatar and inpainting are two promising obfuscation methods, I studied whether they were robust when de-identifying both familiar and unfamiliar people since viewers are likely to know the people in OSN photos. Additionally, I quantified the prevalence of self-reported photo self-censorship and discovered that privacy-preserving obfuscations might be useful for combating photo self-censorship. Gaining sufficient knowledge from the studies above, I proposed a privacy-enhanced photo-sharing interface that helps users identify the potential sensitive content and provides obfuscation options. To evaluate the interface, I compared the proposed obfuscation approach with the other two approaches – a control condition that mimics the current Facebook photo-sharing interface and an interface that provides a privacy warning about potentially sensitive content. The results show that our proposed system performs better over the other two in terms of reducing perceived privacy risks, increasing willingness to share, and enhancing usability. Overall, our research will benefit privacy researchers, online social network designers, policymakers, computer vision researchers, and anyone who has or wants to share photos online

    Stylistics versus Statistics: A corpus linguistic approach to combining techniques in forensic authorship analysis using Enron emails

    Get PDF
    This thesis empirically investigates how a corpus linguistic approach can address the main theoretical and methodological challenges facing the field of forensic authorship analysis. Linguists approach the problem of questioned authorship from the theoretical position that each person has their own distinctive idiolect (Coulthard 2004: 431). However, the notion of idiolect has come under scrutiny in forensic linguistics over recent years for being too abstract to be of practical use (Grant 2010; Turell 2010). At the same time, two competing methodologies have developed in authorship analysis. On the one hand, there are qualitative stylistic approaches, and on the other there are statistical ‘stylometric’ techniques. This study uses a corpus of over 60,000 emails and 2.5 million words written by 176 employees of the former American company Enron to tackle these issues in the contexts of both authorship attribution (identifying authors using linguistic evidence) and author profiling (predicting authors’ social characteristics using linguistic evidence). Analyses reveal that even in shared communicative contexts, and when using very common lexical items, individual Enron employees produce distinctive collocation patterns and lexical co-selections. In turn, these idiolectal elements of linguistic output can be captured and quantified by word n-grams (strings of n words). An attribution experiment is performed using word n-grams to identify the authors of anonymised email samples. Results of the experiment are encouraging, and it is argued that the approach developed here offers a means by which stylistic and statistical techniques can complement each other. Finally, quantitative and qualitative analyses are combined in the sociolinguistic profiling of Enron employees by gender and occupation. Current author profiling research is exclusively statistical in nature. However, the findings here demonstrate that when statistical results are augmented by qualitative evidence, the complex relationship between language use and author identity can be more accurately observed
    corecore