1,732 research outputs found

    개인 사회망 네트워크 분석 기반 온라인 사회 공격자 탐지

    Get PDF
    학위논문(박사)--서울대학교 대학원 :공과대학 컴퓨터공학부,2020. 2. 김종권.In the last decade we have witnessed the explosive growth of online social networking services (SNSs) such as Facebook, Twitter, Weibo and LinkedIn. While SNSs provide diverse benefits – for example, fostering inter-personal relationships, community formations and news propagation, they also attracted uninvited nuiance. Spammers abuse SNSs as vehicles to spread spams rapidly and widely. Spams, unsolicited or inappropriate messages, significantly impair the credibility and reliability of services. Therefore, detecting spammers has become an urgent and critical issue in SNSs. This paper deals with spamming in Twitter and Weibo. Instead of spreading annoying messages to the public, a spammer follows (subscribes to) normal users, and followed a normal user. Sometimes a spammer makes link farm to increase target accounts explicit influence. Based on the assumption that the online relationships of spammers are different from those of normal users, I proposed classification schemes that detect online social attackers including spammers. I firstly focused on ego-network social relations and devised two features, structural features based on Triad Significance Profile (TSP) and relational semantic features based on hierarchical homophily in an ego-network. Experiments on real Twitter and Weibo datasets demonstrated that the proposed approach is very practical. The proposed features are scalable because instead of analyzing the whole network, they inspect user-centered ego-networks. My performance study showed that proposed methods yield significantly better performance than prior scheme in terms of true positives and false positives.최근 우리는 Facebook, Twitter, Weibo, LinkedIn 등의 다양한 사회 관계망 서비스가 폭발적으로 성장하는 현상을 목격하였다. 하지만 사회 관계망 서비스가 개인과 개인간의 관계 및 커뮤니티 형성과 뉴스 전파 등의 여러 이점을 제공해 주고 있는데 반해 반갑지 않은 현상 역시 발생하고 있다. 스패머들은 사회 관계망 서비스를 동력 삼아 스팸을 매우 빠르고 넓게 전파하는 식으로 악용하고 있다. 스팸은 수신자가 원치 않는 메시지들을 일컽는데 이는 서비스의 신뢰도와 안정성을 크게 손상시킨다. 따라서, 스패머를 탐지하는 것이 현재 소셜 미디어에서 매우 긴급하고 중요한 문제가 되었다. 이 논문은 대표적인 사회 관계망 서비스들 중 Twitter와 Weibo에서 발생하는 스패밍을 다루고 있다. 이러한 유형의 스패밍들은 불특정 다수에게 메시지를 전파하는 대신에, 많은 일반 사용자들을 '팔로우(구독)'하고 이들로부터 '맞 팔로잉(맞 구독)'을 이끌어 내는 것을 목적으로 하기도 한다. 때로는 link farm을 이용해 특정 계정의 팔로워 수를 높이고 명시적 영향력을 증가시키기도 한다. 스패머의 온라인 관계망이 일반 사용자의 온라인 사회망과 다를 것이라는 가정 하에, 나는 스패머들을 포함한 일반적인 온라인 사회망 공격자들을 탐지하는 분류 방법을 제시한다. 나는 먼저 개인 사회망 내 사회 관계에 주목하고 두 가지 종류의 분류 특성을 제안하였다. 이들은 개인 사회망의 Triad Significance Profile (TSP)에 기반한 구조적 특성과 Hierarchical homophily에 기반한 관계 의미적 특성이다. 실제 Twitter와 Weibo 데이터셋에 대한 실험 결과는 제안한 방법이 매우 실용적이라는 것을 보여준다. 제안한 특성들은 전체 네트워크를 분석하지 않아도 개인 사회망만 분석하면 되기 때문에 scalable하게 측정될 수 있다. 나의 성능 분석 결과는 제안한 기법이 기존 방법에 비해 true positive와 false positive 측면에서 우수하다는 것을 보여준다.1 Introduction 1 2 Related Work 6 2.1 OSN Spammer Detection Approaches 6 2.1.1 Contents-based Approach 6 2.1.2 Social Network-based Approach 7 2.1.3 Subnetwork-based Approach 8 2.1.4 Behavior-based Approach 9 2.2 Link Spam Detection 10 2.3 Data mining schemes for Spammer Detection 10 2.4 Sybil Detection 12 3 Triad Significance Profile Analysis 14 3.1 Motivation 14 3.2 Twitter Dataset 18 3.3 Indegree and Outdegree of Dataset 20 3.4 Twitter spammer Detection with TSP 22 3.5 TSP-Filtering 27 3.6 Performance Evaluation of TSP-Filtering 29 4 Hierarchical Homophily Analysis 33 4.1 Motivation 33 4.2 Hierarchical Homophily in OSN 37 4.2.1 Basic Analysis of Datasets 39 4.2.2 Status gap distribution and Assortativity 44 4.2.3 Hierarchical gap distribution 49 4.3 Performance Evaluation of HH-Filtering 53 5 Overall Performance Evaluation 58 6 Conclusion 63 Bibliography 65Docto

    Personalized large scale classification of public tenders on hadoop

    Get PDF
    Ce projet a été réalisé dans le cadre d’un partenariat entre Fujitsu Canada et Université Laval. Les besoins du projets ont été centrés sur une problématique d’affaire définie conjointement avec Fujitsu. Le projet consistait à classifier un corpus d’appels d’offres électroniques avec une approche orienté big data. L’objectif était d’identifier avec un très fort rappel les offres pertinentes au domaine d’affaire de l’entreprise. Après une séries d’expérimentations à petite échelle qui nous ont permise d’illustrer empiriquement (93% de rappel) l’efficacité de notre approche basé sur l’algorithme BNS (Bi-Normal Separation), nous avons implanté un système complet qui exploite l’infrastructure technologique big data Hadoop. Nos expérimentations sur le système complet démontrent qu’il est possible d’obtenir une performance de classification tout aussi efficace à grande échelle (91% de rappel) tout en exploitant les gains de performance rendus possible par l’architecture distribuée de Hadoop.This project was completed as part of an innovation partnership with Fujitsu Canada and Université Laval. The needs and objectives of the project were centered on a business problem defined jointly with Fujitsu. Our project aimed to classify a corpus of electronic public tenders based on state of the art Hadoop big data technology. The objective was to identify with high recall public tenders relevant to the IT services business of Fujitsu Canada. A small scale prototype based on the BNS algorithm (Bi-Normal Separation) was empirically shown to classify with high recall (93%) the public tender corpus. The prototype was then re-implemented on a full scale Hadoop cluster using Apache Pig for the data preparation pipeline and using Apache Mahout for classification. Our experimentation show that the large scale system not only maintains high recall (91%) on the classification task, but can readily take advantage of the massive scalability gains made possible by Hadoop’s distributed architecture

    What Users Ask a Search Engine: Analyzing One Billion Russian Question Queries

    Full text link
    We analyze the question queries submitted to a large commercial web search engine to get insights about what people ask, and to better tailor the search results to the users’ needs. Based on a dataset of about one billion question queries submitted during the year 2012, we investigate askers’ querying behavior with the support of automatic query categorization. While the importance of question queries is likely to increase, at present they only make up 3–4% of the total search traffic. Since questions are such a small part of the query stream and are more likely to be unique than shorter queries, clickthrough information is typically rather sparse. Thus, query categorization methods based on the categories of clicked web documents do not work well for questions. As an alternative, we propose a robust question query classification method that uses the labeled questions from a large community question answering platform (CQA) as a training set. The resulting classifier is then transferred to the web search questions. Even though questions on CQA platforms tend to be different to web search questions, our categorization method proves competitive with strong baselines with respect to classification accuracy. To show the scalability of our proposed method we apply the classifiers to about one billion question queries and discuss the trade-offs between performance and accuracy that different classification models offer. Our findings reveal what people ask a search engine and also how this contrasts behavior on a CQA platform

    Email classification via intention-based segmentation

    Get PDF
    Email is the most popular way of personal and official communication among people and organizations. Due to untrusted virtual environment, email systems may face frequent attacks like malware, spamming, social engineering, etc. Spamming is the most common malicious activity, where unsolicited emails are sent in bulk, and these spam emails can be the source of malware, waste resources, hence degrade the productivity. In spam filter development, the most important challenge is to find the correlation between the nature of spam and the interest of the users because the interests of users are dynamic. This paper proposes a novel dynamic spam filter model that considers the changes in the interests of users with time while handling the spam activities. It uses intention-based segmentation to compare different segments of text documents instead of comparing them as a whole. The proposed spam filter is a multi-tier approach where initially, the email content is divided into segments with the help of part of speech (POS) tagging based on voices and tenses. Further, the segments are clustered using hierarchical clustering and compared using the vector space model. In the third stage, concept drift is detected in the clusters to identify the change in the interest of the user. Later, the classification of ham emails into various categories is done in the last stage. For experiments Enron dataset is used and the obtained results are promising

    Assessing and augmenting SCADA cyber security: a survey of techniques

    Get PDF
    SCADA systems monitor and control critical infrastructures of national importance such as power generation and distribution, water supply, transportation networks, and manufacturing facilities. The pervasiveness, miniaturisations and declining costs of internet connectivity have transformed these systems from strictly isolated to highly interconnected networks. The connectivity provides immense benefits such as reliability, scalability and remote connectivity, but at the same time exposes an otherwise isolated and secure system, to global cyber security threats. This inevitable transformation to highly connected systems thus necessitates effective security safeguards to be in place as any compromise or downtime of SCADA systems can have severe economic, safety and security ramifications. One way to ensure vital asset protection is to adopt a viewpoint similar to an attacker to determine weaknesses and loopholes in defences. Such mind sets help to identify and fix potential breaches before their exploitation. This paper surveys tools and techniques to uncover SCADA system vulnerabilities. A comprehensive review of the selected approaches is provided along with their applicability
    corecore