587 research outputs found

    Addressing the new generation of spam (Spam 2.0) through Web usage models

    Get PDF
    New Internet collaborative media introduce new ways of communicating that are not immune to abuse. A fake eye-catching profile in social networking websites, a promotional review, a response to a thread in online forums with unsolicited content or a manipulated Wiki page, are examples of new the generation of spam on the web, referred to as Web 2.0 Spam or Spam 2.0. Spam 2.0 is defined as the propagation of unsolicited, anonymous, mass content to infiltrate legitimate Web 2.0 applications.The current literature does not address Spam 2.0 in depth and the outcome of efforts to date are inadequate. The aim of this research is to formalise a definition for Spam 2.0 and provide Spam 2.0 filtering solutions. Early-detection, extendibility, robustness and adaptability are key factors in the design of the proposed method.This dissertation provides a comprehensive survey of the state-of-the-art web spam and Spam 2.0 filtering methods to highlight the unresolved issues and open problems, while at the same time effectively capturing the knowledge in the domain of spam filtering.This dissertation proposes three solutions in the area of Spam 2.0 filtering including: (1) characterising and profiling Spam 2.0, (2) Early-Detection based Spam 2.0 Filtering (EDSF) approach, and (3) On-the-Fly Spam 2.0 Filtering (OFSF) approach. All the proposed solutions are tested against real-world datasets and their performance is compared with that of existing Spam 2.0 filtering methods.This work has coined the term ‘Spam 2.0’, provided insight into the nature of Spam 2.0, and proposed filtering mechanisms to address this new and rapidly evolving problem

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    Supporting exploratory browsing with visualization of social interaction history

    Get PDF
    This thesis is concerned with the design, development, and evaluation of information visualization tools for supporting exploratory browsing. Information retrieval (IR) systems currently do not support browsing well. Responding to user queries, IR systems typically compute relevance scores of documents and then present the document surrogates to users in order of relevance. Other systems such as email clients and discussion forums simply arrange messages in reverse chronological order. Using these systems, people cannot gain an overview of a collection easily, nor do they receive adequate support for finding potentially useful items in the collection. This thesis explores the feasibility of using social interaction history to improve exploratory browsing. Social interaction history refers to traces of interaction among users in an information space, such as discussions that happen in the blogosphere or online newspapers through the commenting facility. The basic hypothesis of this work is that social interaction history can serve as a good indicator of the potential value of information items. Therefore, visualization of social interaction history would offer navigational cues for finding potentially valuable information items in a collection. To test this basic hypothesis, I conducted three studies. First, I ran statistical analysis of a social media data set. The results showed that there were positive relationships between traces of social interaction and the degree of interestingness of web articles. Second, I conducted a feasibility study to collect initial feedback about the potential of social interaction history to support information exploration. Comments from the participants were in line with the research hypothesis. Finally, I conducted a summative evaluation to measure how well visualization of social interaction history can improve exploratory browsing. The results showed that visualization of social interaction history was able to help users find interesting articles, to reduce wasted effort, and to increase user satisfaction with the visualization tool

    Understanding and Detecting Malicious Cyber Infrastructures

    Get PDF
    Malware (e.g., trojans, bots, and spyware) is still a pervasive threat on the Internet. It is able to infect computer systems to further launch a variety of malicious activities such as sending spam, stealing sensitive information and launching distributed denial-of-service (DDoS) attacks. In order to continue malevolent activities without being detected and to improve the efficiency of malicious activities, cyber-criminals tend to build malicious cyber infrastructures to communicate with their malware and to exploit benign users. In these infrastructures, multiple servers are set to be efficient and anonymous in (i) malware distribution (using redirectors and exploit servers), (ii) control (using C&C servers), (iii) monetization (using payment servers), and (iv) robustness against server takedowns (using multiple backups for each type of server). The most straightforward way to counteract the malware threat is to detect malware directly on infected hosts. However, it is difficult since packing and obfuscation techniques are frequently used by malware to evade state-of-the-art anti-virus tools. Therefore, an alternate solution is to detect and disrupt the malicious cyber infrastructures used by malware. In this dissertation, we take an important step in this direction and focus on identifying malicious servers behind those malicious cyber infrastructures. We present a comprehensive inferring framework to infer servers involved in malicious cyber infrastructure based on the three roles of those servers: compromised server, malicious server accessed through redirection and malicious server accessed through directly connecting. We characterize these three roles from four novel perspectives and demonstrate our detection technologies in four systems: PoisonAmplifier, SMASH, VisHunter and NeighbourWatcher. PoisonAmplifier focuses on compromised servers. It explores the fact that cybercriminals tend to use compromised servers to trick benign users during the attacking process. Therefore, it is designed to proactively find more compromised servers. SMASH focuses on malicious servers accessed through directly connecting. It explores the fact that multiple backups are usually used in malicious cyber infrastructures to avoid server takedowns. Therefore, it leverages the correlation among malicious servers to infer a group of malicious servers. VisHunter focuses on the redirections from compromised servers to malicious servers. It explores the fact that cybercriminals usually conceal their core malicious servers. Therefore, it is designed to detect those “invisible” malicious servers. NeighbourWatcher focuses on all general malicious servers promoted by spammers. It explores the observation that spammers intend to promote some servers (e.g., phishing servers) on the special websites (e.g., forum and wikis) to trick benign users and to improve the reputation of their malicious servers. In short, we build a comprehensive inferring framework to identify servers involved in malicious cyber infrastructures from four novel perspectives and implement different inference techniques in different systems that complement each other. Our inferring framework has been evaluated in live networks and/or real-world network traces. The evaluation results show that it can accurately detect malicious servers involved in malicious cyber infrastructures with a very low false positive rate. We found the three roles of malicious servers we proposed can characterize most of servers involved in malicious cyber infrastructures, and the four principles we developed for the detection are invariable across different malicious cyber infrastructures. We believe our experience and lessons are of great benefit to the future malicious cyber infrastructure study and detection

    Detection of Hate Speech in Videos Using Machine Learning

    Get PDF
    With the progression of the internet and social media, people are given multiple platforms to share their thoughts and opinions about various subject matters freely. However, this freedom of speech is misused to direct hate towards individuals or group of people due to their race, religion, gender etc. The rise of hate speech has led to conflicts and cases of cyber bullying, causing many organizations to look for optimal solutions to solve this problem. Developments in the field of machine learning and deep learning have piqued the interest of researchers, leading them to research and implement solutions to solve the problem of hate speech. Currently, machine learning techniques are applied to textual data to detect hate speech. With the ample use of video sharing sites, there is a need to find a way to detect hate speech in videos. This project deals with classification of videos into normal or hateful categories based on the spoken content of the videos. The video dataset is built using a crawler to search and download videos based on offensive words that are specified as keywords. The audio is extracted from the videos and is converted into textual format using a speech-to-text converter to obtain a transcript of the videos. Experiments are conducted by training four models with three different feature sets extracted from the dataset. The models are evaluated by computing the specified evaluation metrics. The evaluated metrics indicate that random forest classifier model delivers the best results in classifying videos

    A survey on opinion summarization technique s for social media

    Get PDF
    The volume of data on the social media is huge and even keeps increasing. The need for efficient processing of this extensive information resulted in increasing research interest in knowledge engineering tasks such as Opinion Summarization. This survey shows the current opinion summarization challenges for social media, then the necessary pre-summarization steps like preprocessing, features extraction, noise elimination, and handling of synonym features. Next, it covers the various approaches used in opinion summarization like Visualization, Abstractive, Aspect based, Query-focused, Real Time, Update Summarization, and highlight other Opinion Summarization approaches such as Contrastive, Concept-based, Community Detection, Domain Specific, Bilingual, Social Bookmarking, and Social Media Sampling. It covers the different datasets used in opinion summarization and future work suggested in each technique. Finally, it provides different ways for evaluating opinion summarization

    ISP/PhD Comprehensive Examination

    Get PDF
    corecore