25 research outputs found

    Addressing the new generation of spam (Spam 2.0) through Web usage models

    Get PDF
    New Internet collaborative media introduce new ways of communicating that are not immune to abuse. A fake eye-catching profile in social networking websites, a promotional review, a response to a thread in online forums with unsolicited content or a manipulated Wiki page, are examples of new the generation of spam on the web, referred to as Web 2.0 Spam or Spam 2.0. Spam 2.0 is defined as the propagation of unsolicited, anonymous, mass content to infiltrate legitimate Web 2.0 applications.The current literature does not address Spam 2.0 in depth and the outcome of efforts to date are inadequate. The aim of this research is to formalise a definition for Spam 2.0 and provide Spam 2.0 filtering solutions. Early-detection, extendibility, robustness and adaptability are key factors in the design of the proposed method.This dissertation provides a comprehensive survey of the state-of-the-art web spam and Spam 2.0 filtering methods to highlight the unresolved issues and open problems, while at the same time effectively capturing the knowledge in the domain of spam filtering.This dissertation proposes three solutions in the area of Spam 2.0 filtering including: (1) characterising and profiling Spam 2.0, (2) Early-Detection based Spam 2.0 Filtering (EDSF) approach, and (3) On-the-Fly Spam 2.0 Filtering (OFSF) approach. All the proposed solutions are tested against real-world datasets and their performance is compared with that of existing Spam 2.0 filtering methods.This work has coined the term ‘Spam 2.0’, provided insight into the nature of Spam 2.0, and proposed filtering mechanisms to address this new and rapidly evolving problem

    Modeling User Expertise in Folksonomies by Fusing Multi-type Features

    Get PDF
    Abstract. The folksonomy refers to the online collaborative tagging system which offers a new open platform for content annotation with uncontrolled vocabulary. As folksonomies are gaining in popularity, the expert search and spammer detection in folksonomies attract more and more attention. However, most of previous work are limited on some folksonomy features. In this paper, we introduce a generic and flexible user expertise model for expert search and spammer detection. We first investigate a comprehensive set of expertise evidences related to users, objects and tags in folksonomies. Then we discuss the rich interactions between them and propose a unified Continuous CRF model to integrate these features and interactions. This model's applications for expert recommendation and spammer detection are also exploited. Extensive experiments are conducted on a real tagging dataset and demonstrate the model's advantages over previous methods, both in performance and coverage

    Web Spambot Detection Based on Web Navigation Behaviour

    Get PDF
    Web robots have been widely used for various beneficial and malicious activities. Web spambots are a type of web robot that spreads spam content throughout the web by typically targeting Web 2.0 applications. They are intelligently designed to replicate human behaviour in order to bypass system checks. Spam content not only wastes valuable resources but can also mislead users to unsolicited websites and award undeserved search engine rankings to spammers' campaign websites. While most of the research in anti-spam filtering focuses on the identification of spam content on the web, only a few have investigated the origin of spam content, hence identification and detection of web spambots still remains an open area of research.In this paper, we describe an automated supervised machine learning solution which utilises web navigation behaviour to detect web spambots. We propose a new feature set (referred to as an action set) as a representation of user behaviour to differentiate web spambots from human users. Our experimental results show that our solution achieves a 96.24% accuracy in classifying web spambots

    Identifying experts and authoritative documents in social bookmarking systems

    Get PDF
    Social bookmarking systems allow people to create pointers to Web resources in a shared, Web-based environment. These services allow users to add free-text labels, or “tags”, to their bookmarks as a way to organize resources for later recall. Ease-of-use, low cognitive barriers, and a lack of controlled vocabulary have allowed social bookmaking systems to grow exponentially over time. However, these same characteristics also raise concerns. Tags lack the formality of traditional classificatory metadata and suffer from the same vocabulary problems as full-text search engines. It is unclear how many valuable resources are untagged or tagged with noisy, irrelevant tags. With few restrictions to entry, annotation spamming adds noise to public social bookmarking systems. Furthermore, many algorithms for discovering semantic relations among tags do not scale to the Web. Recognizing these problems, we develop a novel graph-based Expert and Authoritative Resource Location (EARL) algorithm to find the most authoritative documents and expert users on a given topic in a social bookmarking system. In EARL’s first phase, we reduce noise in a Delicious dataset by isolating a smaller sub-network of “candidate experts”, users whose tagging behavior shows potential domain and classification expertise. In the second phase, a HITS-based graph analysis is performed on the candidate experts’ data to rank the top experts and authoritative documents by topic. To identify topics of interest in Delicious, we develop a distributed method to find subsets of frequently co-occurring tags shared by many candidate experts. We evaluated EARL’s ability to locate authoritative resources and domain experts in Delicious by conducting two independent experiments. The first experiment relies on human judges’ n-point scale ratings of resources suggested by three graph-based algorithms and Google. The second experiment evaluated the proposed approach’s ability to identify classification expertise through human judges’ n-point scale ratings of classification terms versus expert-generated data

    Information quality in online social media and big data collection: an example of Twitter spam detection

    Get PDF
    La popularitĂ© des mĂ©dias sociaux en ligne (Online Social Media - OSM) est fortement liĂ©e Ă  la qualitĂ© du contenu gĂ©nĂ©rĂ© par l'utilisateur (User Generated Content - UGC) et la protection de la vie privĂ©e des utilisateurs. En se basant sur la dĂ©finition de la qualitĂ© de l'information, comme son aptitude Ă  ĂȘtre exploitĂ©e, la facilitĂ© d'utilisation des OSM soulĂšve de nombreux problĂšmes en termes de la qualitĂ© de l'information ce qui impacte les performances des applications exploitant ces OSM. Ces problĂšmes sont causĂ©s par des individus mal intentionnĂ©s (nommĂ©s spammeurs) qui utilisent les OSM pour dissĂ©miner des fausses informations et/ou des informations indĂ©sirables telles que les contenus commerciaux illĂ©gaux. La propagation et la diffusion de telle information, dit spam, entraĂźnent d'Ă©normes problĂšmes affectant la qualitĂ© de services proposĂ©s par les OSM. La majoritĂ© des OSM (comme Facebook, Twitter, etc.) sont quotidiennement attaquĂ©es par un Ă©norme nombre d'utilisateurs mal intentionnĂ©s. Cependant, les techniques de filtrage adoptĂ©es par les OSM se sont avĂ©rĂ©es inefficaces dans le traitement de ce type d'information bruitĂ©e, nĂ©cessitant plusieurs semaines ou voir plusieurs mois pour filtrer l'information spam. En effet, plusieurs dĂ©fis doivent ĂȘtre surmontĂ©es pour rĂ©aliser une mĂ©thode de filtrage de l'information bruitĂ©e . Les dĂ©fis majeurs sous-jacents Ă  cette problĂ©matique peuvent ĂȘtre rĂ©sumĂ©s par : (i) donnĂ©es de masse ; (ii) vie privĂ©e et sĂ©curitĂ© ; (iii) hĂ©tĂ©rogĂ©nĂ©itĂ© des structures dans les rĂ©seaux sociaux ; (iv) diversitĂ© des formats du UGC ; (v) subjectivitĂ© et objectivitĂ©. Notre travail s'inscrit dans le cadre de l'amĂ©lioration de la qualitĂ© des contenus en termes de messages partagĂ©s (contenu spam) et de profils des utilisateurs (spammeurs) sur les OSM en abordant en dĂ©tail les dĂ©fis susmentionnĂ©s. Comme le spam social est le problĂšme le plus rĂ©curant qui apparaĂźt sur les OSM, nous proposons deux approches gĂ©nĂ©riques pour dĂ©tecter et filtrer le contenu spam : i) La premiĂšre approche consiste Ă  dĂ©tecter le contenu spam (par exemple, les tweets spam) dans un flux en temps rĂ©el. ii) La seconde approche est dĂ©diĂ©e au traitement d'un grand volume des donnĂ©es relatives aux profils utilisateurs des spammeurs (par exemple, les comptes Twitter). Pour filtrer le contenu spam en temps rĂ©el, nous introduisons une approche d'apprentissage non supervisĂ©e qui permet le filtrage en temps rĂ©el des tweets spams dans laquelle la fonction de classification est adaptĂ©e automatiquement. La fonction de classification est entraĂźnĂ© de maniĂšre itĂ©rative et ne requiĂšre pas une collection de donnĂ©es annotĂ©es manuellement. Dans la deuxiĂšme approche, nous traitons le problĂšme de classification des profils utilisateurs dans le contexte d'une collection de donnĂ©es Ă  grande Ă©chelle. Nous proposons de faire une recherche dans un espace rĂ©duit de profils utilisateurs (une communautĂ© d'utilisateurs) au lieu de traiter chaque profil d'utilisateur Ă  part. Ensuite, chaque profil qui appartient Ă  cet espace rĂ©duit est analysĂ© pour prĂ©dire sa classe Ă  l'aide d'un modĂšle de classification binaire. Les expĂ©riences menĂ©es sur Twitter ont montrĂ© que le modĂšle de classification collective non supervisĂ© proposĂ© est capable de gĂ©nĂ©rer une fonction efficace de classification binaire en temps rĂ©el des tweets qui s'adapte avec l'Ă©volution des stratĂ©gies des spammeurs sociaux sur Twitter. L'approche proposĂ©e surpasse les performances de deux mĂ©thodes de l'Ă©tat de l'art de dĂ©tection de spam en temps rĂ©el. Les rĂ©sultats de la deuxiĂšme approche ont dĂ©montrĂ© que l'extraction des mĂ©tadonnĂ©es des spams et leur exploitation dans le processus de recherche de profils de spammeurs est rĂ©alisable dans le contexte de grandes collections de profils Twitter. L'approche proposĂ©e est une alternative au traitement de tous les profils existants dans le OSM.The popularity of OSM is mainly conditioned by the integrity and the quality of UGC as well as the protection of users' privacy. Based on the definition of information quality as fitness for use, the high usability and accessibility of OSM have exposed many information quality (IQ) problems which consequently decrease the performance of OSM dependent applications. Such problems are caused by ill-intentioned individuals who misuse OSM services to spread different kinds of noisy information, including fake information, illegal commercial content, drug sales, mal- ware downloads, and phishing links. The propagation and spreading of noisy information cause enormous drawbacks related to resources consumptions, decreasing quality of service of OSM-based applications, and spending human efforts. The majority of popular social networks (e.g., Facebook, Twitter, etc) over the Web 2.0 is daily attacked by an enormous number of ill-intentioned users. However, those popular social networks are ineffective in handling the noisy information, requiring several weeks or months to detect them. Moreover, different challenges stand in front of building a complete OSM-based noisy information filtering methods that can overcome the shortcomings of OSM information filters. These challenges are summarized in: (i) big data; (ii) privacy and security; (iii) structure heterogeneity; (iv) UGC format diversity; (v) subjectivity and objectivity; (vi) and service limitations In this thesis, we focus on increasing the quality of social UGC that are published and publicly accessible in forms of posts and profiles over OSNs through addressing in-depth the stated serious challenges. As the social spam is the most common IQ problem appearing over the OSM, we introduce a design of two generic approaches for detecting and filtering out the spam content. The first approach is for detecting the spam posts (e.g., spam tweets) in a real-time stream, while the other approach is dedicated for handling a big data collection of social profiles (e.g., Twitter accounts). For filtering the spam content in real-time, we introduce an unsupervised collective-based framework that automatically adapts a supervised spam tweet classification function in order to have an updated real-time classifier without requiring manual annotated data-sets. In the second approach, we treat the big data collections through minimizing the search space of profiles that needs advanced analysis, instead of processing every user's profile existing in the collections. Then, each profile falling in the reduced search space is further analyzed in an advanced way to produce an accurate decision using a binary classification model. The experiments conducted on Twitter online social network have shown that the unsupervised collective-based framework is able to produce updated and effective real- time binary tweet-based classification function that adapts the high evolution of social spammer's strategies on Twitter, outperforming the performance of two existing real- time spam detection methods. On the other hand, the results of the second approach have demonstrated that performing a preprocessing step for extracting spammy meta-data values and leveraging them in the retrieval process is a feasible solution for handling a large collections of Twitter profiles, as an alternative solution for processing all profiles existing in the input data collection. The introduced approaches open different opportunities for information science researcher to leverage our solutions in other information filtering problems and applications. Our long term perspective consists of (i) developing a generic platform covering most common OSM for instantly checking the quality of a given piece of information where the forms of the input information could be profiles, website links, posts, and plain texts; (ii) and transforming and adapting our methods to handle additional IQ problems such as rumors and information overloading

    Community Detection in Hypergraphen

    Get PDF
    Viele DatensĂ€tze können als Graphen aufgefasst werden, d.h. als Elemente (Knoten) und binĂ€re Verbindungen zwischen ihnen (Kanten). Unter dem Begriff der "Complex Network Analysis" sammeln sich eine ganze Reihe von Verfahren, die die Untersuchung von DatensĂ€tzen allein aufgrund solcher struktureller Eigenschaften erlauben. "Community Detection" als Untergebiet beschĂ€ftigt sich mit der Identifikation besonders stark vernetzter Teilgraphen. Über den Nutzen hinaus, den eine Gruppierung verwandter Element direkt mit sich bringt, können derartige Gruppen zu einzelnen Knoten zusammengefasst werden, was einen neuen Graphen von reduzierter KomplexitĂ€t hervorbringt, der die Makrostruktur des ursprĂŒnglichen Graphen unter UmstĂ€nden besser hervortreten lĂ€sst. Fortschritte im Bereich der "Community Detection" verbessern daher auch das VerstĂ€ndnis komplexer Netzwerke im allgemeinen. Nicht jeder Datensatz lĂ€sst sich jedoch angemessen mit binĂ€ren Relationen darstellen - Relationen höherer Ordnung fĂŒhren zu sog. Hypergraphen. Gegenstand dieser Arbeit ist die Verallgemeinerung von AnsĂ€tzen zur "Community Detection" auf derartige Hypergraphen. Im Zentrum der Aufmerksamkeit stehen dabei "Social Bookmarking"-DatensĂ€tze, wie sie von Benutzern von "Bookmarking"-Diensten erzeugt werden. Dabei ordnen Benutzer Dokumenten frei gewĂ€hlte Stichworte, sog. "Tags" zu. Dieses "Tagging" erzeugt, fĂŒr jede Tag-Zuordnung, eine ternĂ€re Verbindung zwischen Benutzer, Dokument und Tag, was zu Strukturen fĂŒhrt, die 3-partite, 3-uniforme (im folgenden 3,3-, oder allgemeiner k,k-) Hypergraphen genannt werden. Die Frage, der diese Arbeit nachgeht, ist wie diese Strukturen formal angemessen in "Communities" unterteilt werden können, und wie dies das VerstĂ€ndnis dieser DatensĂ€tze erleichtert, die potenziell sehr reich an latenten Informationen sind. ZunĂ€chst wird eine Verallgemeinerung der verbundenen Komponenten fĂŒr k,k-Hypergraphen eingefĂŒhrt. Die normale Definition verbundener Komponenten weist auf den untersuchten DatensĂ€tzen, recht uninformativ, alle Elemente einer einzelnen Riesenkomponente zu. Die verallgemeinerten, so genannten hyper-inzidenten verbundenen Komponenten hingegen zeigen auf den "Social Bookmarking"-DatensĂ€tzen eine charakteristische GrĂ¶ĂŸenverteilung, die jedoch bspw. von Spam-Verhalten zerstört wird - was eine Verbindung zwischen Verhaltensmustern und strukturellen Eigenschaften zeigt, der im folgenden weiter nachgegangen wird. Als nĂ€chstes wird das allgemeine Thema der "Community Detection" auf k,k-Hypergraphen eingefĂŒhrt. Drei Herausforderungen werden definiert, die mit der naiven Anwendung bestehender Verfahren nicht gemeistert werden können. Außerdem werden drei Familien synthetischer Hypergraphen mit "Community"-Strukturen von steigender KomplexitĂ€t eingefĂŒhrt, die prototypisch fĂŒr Situationen stehen, die ein erfolgreicher Detektionsansatz rekonstruieren können sollte. Der zentrale methodische Beitrag dieser Arbeit besteht aus der im folgenden dargestellten Entwicklung eines multipartiten (d.h. fĂŒr k,k-Hypergraphen geeigneten) Verfahrens zur Erkennung von "Communities". Es basiert auf der Optimierung von ModularitĂ€t, einem etablierten Verfahrung zur Erkennung von "Communities" auf nicht-partiten, d.h. "normalen" Graphen. Ausgehend vom einfachst möglichen Ansatz wird das Verfahren iterativ verfeinert, um den zuvor definierten sowie neuen, in der Praxis aufgetretenen Herausforderungen zu begegnen. Am Ende steht die Definition der "ausgeglichenen multi-partiten ModularitĂ€t". Schließlich wird ein interaktives Werkzeug zur Untersuchung der so gewonnenen "Community"-Zuordnungen vorgestellt. Mithilfe dieses Werkzeugs können die Vorteile der zuvor eingefĂŒhrten ModularitĂ€t demonstriert werden: So können komplexe ZusammenhĂ€nge beobachtet werden, die den einfacheren Verfahren entgehen. Diese Ergebnisse werden von einer stĂ€rker quantitativ angelegten Untersuchung bestĂ€tigt: UnĂŒberwachte QualitĂ€tsmaße, die bspw. den Kompressionsgrad berĂŒcksichtigen, können ĂŒber eine grĂ¶ĂŸere Menge von Beispielen die Vorteile der ausgeglichenen multi-partiten ModularitĂ€t gegenĂŒber den anderen Verfahren belegen. Zusammenfassend lassen sich die Ergebnisse dieser Arbeit in zwei Bereiche einteilen: Auf der praktischen Seite werden Werkzeuge zur Erforschung von "Social Bookmarking"-Daten bereitgestellt. DemgegenĂŒber stehen theoretische BeitrĂ€ge, die fĂŒr Graphen etablierte Konzepte - verbundene Komponenten und "Community Detection" - auf k,k-Hypergraphen ĂŒbertragen.Many datasets can be interpreted as graphs, i.e. as elements (nodes) and binary relations between them (edges). Under the label of complex network analysis, a vast array of graph-based methods allows the exploration of datasets purely based on such structural properties. Community detection, as a subfield of network analysis, aims to identify well-connected subparts of graphs. While the grouping of related elements is useful in itself, these groups can furthermore be collapsed into single nodes, creating a new graph of reduced complexity which may better reveal the original graph's macrostructure. Therefore, advances in community detection improve the understanding of complex networks in general. However, not every dataset can be modelled properly with binary relations - higher-order relations give rise to so-called hypergraphs. This thesis explores the generalization of community detection approaches to hypergraphs. In the focus of attention are social bookmarking datasets, created by users of online bookmarking services who assign freely chosen keywords, so-called "tags", to documents. This "tagging" creates, for each tag assignment, a ternary connection between the user, the document, and the tag, inducing particular structures called 3-partite, 3-uniform hypergraphs (henceforth called 3,3- or more generally k,k-hypergraphs). The question pursued here is how to decompose these structures in a formally adequate manner, and how this improves the understanding of these rich datasets. First, a generalization of connected components to k,k-hypergraphs is proposed. The standard definition of connected components here rather uninformatively assigns almost all elements to a single giant component. The generalized so-called hyperincident connected components, however, show a characteristic size distribution on the social bookmarking datasets that is disrupted by, e.g., spamming activity - demonstrating a link between behavioural patterns and structural features that is further explored in the following. Next, the general topic of community detection in k,k-hypergraphs is introduced. Three challenges are posited that are not met by the naive application of standard techniques, and three families of synthetic hypergraphs are introduced containing increasingly complex community setups that a successful detection approach must be able to identify. The main methodical contribution of this thesis consists of the following development of a multi-partite (i.e. suitable for k,k-hypergraphs) community detection algorithm. It is based on modularity optimization, a well-established algorithm to detect communities in non-partite, i.e. "normal" graphs. Starting from the simplest approach possible, the method is successively refined to meet the previously defined as well as empirically encountered challenges, culminating in the definition of the "balanced multi-partite modularity". Finally, an interactive tool for exploring the obtained community assignments is introduced. Using this tool, the benefits of balanced multi-partite modularity can be shown: Intricate patters can be observed that are missed by the simpler approaches. These findings are confirmed by a more quantitative examination: Unsupervised quality measures considering, e.g., compression document the advantages of this approach on a larger number of samples. To conclude, the contributions of this thesis are twofold. It provides practical tools for the analysis of social bookmarking data, complemented with theoretical contributions, the generalization of connected components and modularity from graphs to k,k-hypergraphs

    Proceedings of the 9th Dutch-Belgian Information Retrieval Workshop

    Get PDF

    Mining, Modeling, and Leveraging Multidimensional Web Metrics to Support Scholarly Communities

    Get PDF
    The significant proliferation of scholarly output and the emergence of multidisciplinary research areas are rendering the research environment increasingly complex. In addition, an increasing number of researchers are using academic social networks to discover and store scholarly content. The spread of scientific discourse and research activities across the web, especially on social media platforms, suggests that far-reaching changes are taking place in scholarly communication and the geography of science. This dissertation provides integrated techniques and methods designed to address the information overload problem facing scholarly environments and to enhance the research process. There are four main contributions in this dissertation. First, this study identifies, quantifies, and analyzes international researchers’ dynamic scholarly information behaviors, activities, and needs, especially after the emergence of social media platforms. The findings based on qualitative and quantitative analysis report new scholarly patterns and reveals differences between researchers according to academic status and discipline. Second, this study mines massive scholarly datasets, models diverse multidimensional non-traditional web-based indicators (altmetrics), and evaluates and predicts scholarly and societal impact at various levels. The results address some of the limitations of traditional citation-based metrics and broaden the understanding and utilization of altmetrics. Third, this study recommends scholarly venues semantically related to researchers’ current interests. The results provide important up-to-the-minute signals that represent a closer reflection of research interests than post-publication usage-based metrics. Finally, this study develops a new scholarly framework by supporting the construction of online scholarly communities and bibliographies through reputation-based social collaboration, through the introduction of a collaborative, self-promoting system for users to advance their participation through analysis of the quality, timeliness and quantity of contributions. The framework improves the precision and quality of social reference management systems. By analyzing and modeling digital footprints, this dissertation provides a basis for tracking and documenting the impact of scholarship using new models that are more akin to reading breaking news than to watching a historical documentary made several years after the events it describes
    corecore