25 research outputs found
Addressing the new generation of spam (Spam 2.0) through Web usage models
New Internet collaborative media introduce new ways of communicating that are not immune to abuse. A fake eye-catching profile in social networking websites, a promotional review, a response to a thread in online forums with unsolicited content or a manipulated Wiki page, are examples of new the generation of spam on the web, referred to as Web 2.0 Spam or Spam 2.0. Spam 2.0 is defined as the propagation of unsolicited, anonymous, mass content to infiltrate legitimate Web 2.0 applications.The current literature does not address Spam 2.0 in depth and the outcome of efforts to date are inadequate. The aim of this research is to formalise a definition for Spam 2.0 and provide Spam 2.0 filtering solutions. Early-detection, extendibility, robustness and adaptability are key factors in the design of the proposed method.This dissertation provides a comprehensive survey of the state-of-the-art web spam and Spam 2.0 filtering methods to highlight the unresolved issues and open problems, while at the same time effectively capturing the knowledge in the domain of spam filtering.This dissertation proposes three solutions in the area of Spam 2.0 filtering including: (1) characterising and profiling Spam 2.0, (2) Early-Detection based Spam 2.0 Filtering (EDSF) approach, and (3) On-the-Fly Spam 2.0 Filtering (OFSF) approach. All the proposed solutions are tested against real-world datasets and their performance is compared with that of existing Spam 2.0 filtering methods.This work has coined the term âSpam 2.0â, provided insight into the nature of Spam 2.0, and proposed filtering mechanisms to address this new and rapidly evolving problem
Modeling User Expertise in Folksonomies by Fusing Multi-type Features
Abstract. The folksonomy refers to the online collaborative tagging system which offers a new open platform for content annotation with uncontrolled vocabulary. As folksonomies are gaining in popularity, the expert search and spammer detection in folksonomies attract more and more attention. However, most of previous work are limited on some folksonomy features. In this paper, we introduce a generic and flexible user expertise model for expert search and spammer detection. We first investigate a comprehensive set of expertise evidences related to users, objects and tags in folksonomies. Then we discuss the rich interactions between them and propose a unified Continuous CRF model to integrate these features and interactions. This model's applications for expert recommendation and spammer detection are also exploited. Extensive experiments are conducted on a real tagging dataset and demonstrate the model's advantages over previous methods, both in performance and coverage
Web Spambot Detection Based on Web Navigation Behaviour
Web robots have been widely used for various beneficial and malicious activities. Web spambots are a type of web robot that spreads spam content throughout the web by typically targeting Web 2.0 applications. They are intelligently designed to replicate human behaviour in order to bypass system checks. Spam content not only wastes valuable resources but can also mislead users to unsolicited websites and award undeserved search engine rankings to spammers' campaign websites. While most of the research in anti-spam filtering focuses on the identification of spam content on the web, only a few have investigated the origin of spam content, hence identification and detection of web spambots still remains an open area of research.In this paper, we describe an automated supervised machine learning solution which utilises web navigation behaviour to detect web spambots. We propose a new feature set (referred to as an action set) as a representation of user behaviour to differentiate web spambots from human users. Our experimental results show that our solution achieves a 96.24% accuracy in classifying web spambots
Identifying experts and authoritative documents in social bookmarking systems
Social bookmarking systems allow people to create pointers to Web resources in a shared, Web-based environment. These services allow users to add free-text labels, or âtagsâ, to their bookmarks as a way to organize resources for later recall. Ease-of-use, low cognitive barriers, and a lack of controlled vocabulary have allowed social bookmaking systems to grow exponentially over time. However, these same characteristics also raise concerns. Tags lack the formality of traditional classificatory metadata and suffer from the same vocabulary problems as full-text search engines. It is unclear how many valuable resources are untagged or tagged with noisy, irrelevant tags. With few restrictions to entry, annotation spamming adds noise to public social bookmarking systems. Furthermore, many algorithms for discovering semantic relations among tags do not scale to the Web.
Recognizing these problems, we develop a novel graph-based Expert and Authoritative Resource Location (EARL) algorithm to find the most authoritative documents and expert users on a given topic in a social bookmarking system. In EARLâs first phase, we reduce noise in a Delicious dataset by isolating a smaller sub-network of âcandidate expertsâ, users whose tagging behavior shows potential domain and classification expertise. In the second phase, a HITS-based graph analysis is performed on the candidate expertsâ data to rank the top experts and authoritative documents by topic. To identify topics of interest in Delicious, we develop a distributed method to find subsets of frequently co-occurring tags shared by many candidate experts.
We evaluated EARLâs ability to locate authoritative resources and domain experts in Delicious by conducting two independent experiments. The first experiment relies on human judgesâ n-point scale ratings of resources suggested by three graph-based algorithms and Google. The second experiment evaluated the proposed approachâs ability to identify classification expertise through human judgesâ n-point scale ratings of classification terms versus expert-generated data
Information quality in online social media and big data collection: an example of Twitter spam detection
La popularité des médias sociaux en ligne (Online Social Media - OSM) est fortement liée à la qualité du contenu généré par l'utilisateur (User Generated Content - UGC) et la
protection de la vie privĂ©e des utilisateurs. En se basant sur la dĂ©finition de la qualitĂ© de l'information, comme son aptitude Ă ĂȘtre exploitĂ©e, la facilitĂ© d'utilisation des
OSM soulÚve de nombreux problÚmes en termes de la qualité de l'information ce qui impacte les performances des applications exploitant ces OSM. Ces problÚmes sont causés par des
individus mal intentionnés (nommés spammeurs) qui utilisent les OSM pour disséminer des fausses informations et/ou des informations indésirables telles que les contenus
commerciaux illégaux. La propagation et la diffusion de telle information, dit spam, entraßnent d'énormes problÚmes affectant la qualité de services proposés par les OSM.
La majorité des OSM (comme Facebook, Twitter, etc.) sont quotidiennement attaquées par un énorme nombre d'utilisateurs mal intentionnés. Cependant, les techniques de filtrage
adoptées par les OSM se sont avérées inefficaces dans le traitement de ce type d'information bruitée, nécessitant plusieurs semaines ou voir plusieurs mois pour filtrer
l'information spam. En effet, plusieurs dĂ©fis doivent ĂȘtre surmontĂ©es pour rĂ©aliser une mĂ©thode de filtrage de l'information bruitĂ©e . Les dĂ©fis majeurs sous-jacents Ă cette
problĂ©matique peuvent ĂȘtre rĂ©sumĂ©s par : (i) donnĂ©es de masse ; (ii) vie privĂ©e et sĂ©curitĂ© ; (iii) hĂ©tĂ©rogĂ©nĂ©itĂ© des structures dans les rĂ©seaux sociaux ; (iv) diversitĂ© des
formats du UGC ; (v) subjectivité et objectivité.
Notre travail s'inscrit dans le cadre de l'amélioration de la qualité des contenus en termes de messages partagés (contenu spam) et de profils des utilisateurs (spammeurs) sur
les OSM en abordant en détail les défis susmentionnés. Comme le spam social est le problÚme le plus récurant qui apparaßt sur les OSM, nous proposons deux approches génériques
pour détecter et filtrer le contenu spam : i) La premiÚre approche consiste à détecter le contenu spam (par exemple, les tweets spam) dans un flux en temps réel. ii) La seconde
approche est dédiée au traitement d'un grand volume des données relatives aux profils utilisateurs des spammeurs (par exemple, les comptes Twitter).
Pour filtrer le contenu spam en temps réel, nous introduisons une approche d'apprentissage non supervisée qui permet le filtrage en temps réel des tweets spams dans laquelle la
fonction de classification est adaptée automatiquement. La fonction de classification est entraßné de maniÚre itérative et ne requiÚre pas une collection de données annotées
manuellement.
Dans la deuxiÚme approche, nous traitons le problÚme de classification des profils utilisateurs dans le contexte d'une collection de données à grande échelle. Nous proposons de
faire une recherche dans un espace réduit de profils utilisateurs (une communauté d'utilisateurs) au lieu de traiter chaque profil d'utilisateur à part. Ensuite, chaque profil
qui appartient à cet espace réduit est analysé pour prédire sa classe à l'aide d'un modÚle de classification binaire.
Les expériences menées sur Twitter ont montré que le modÚle de classification collective non supervisé proposé est capable de générer une fonction efficace de classification
binaire en temps réel des tweets qui s'adapte avec l'évolution des stratégies des spammeurs sociaux sur Twitter. L'approche proposée surpasse les performances de deux méthodes
de l'état de l'art de détection de spam en temps réel. Les résultats de la deuxiÚme approche ont démontré que l'extraction des métadonnées des spams et leur exploitation dans le
processus de recherche de profils de spammeurs est réalisable dans le contexte de grandes collections de profils Twitter. L'approche proposée est une alternative au traitement
de tous les profils existants dans le OSM.The popularity of OSM is mainly conditioned by the integrity and the quality of UGC as well as the protection of users' privacy. Based on the definition of information quality
as fitness for use, the high usability and accessibility of OSM have exposed many information quality (IQ) problems which consequently decrease the performance of OSM dependent
applications. Such problems are caused by ill-intentioned individuals who misuse OSM services to spread different kinds of noisy information, including fake information, illegal
commercial content, drug sales, mal- ware downloads, and phishing links. The propagation and spreading of noisy information cause enormous drawbacks related to resources
consumptions, decreasing quality of service of OSM-based applications, and spending human efforts.
The majority of popular social networks (e.g., Facebook, Twitter, etc) over the Web 2.0 is daily attacked by an enormous number of ill-intentioned users. However, those popular
social networks are ineffective in handling the noisy information, requiring several weeks or months to detect them. Moreover, different challenges stand in front of building a
complete OSM-based noisy information filtering methods that can overcome the shortcomings of OSM information filters. These challenges are summarized in: (i) big data; (ii)
privacy and security; (iii) structure heterogeneity; (iv) UGC format diversity; (v) subjectivity and objectivity; (vi) and service limitations
In this thesis, we focus on increasing the quality of social UGC that are published and publicly accessible in forms of posts and profiles over OSNs through addressing in-depth
the stated serious challenges. As the social spam is the most common IQ problem appearing over the OSM, we introduce a design of two generic approaches for detecting and
filtering out the spam content. The first approach is for detecting the spam posts (e.g., spam tweets) in a real-time stream, while the other approach is dedicated for handling
a big data collection of social profiles (e.g., Twitter accounts). For filtering the spam content in real-time, we introduce an unsupervised collective-based framework that
automatically adapts a supervised spam tweet classification function in order to have an updated real-time classifier without requiring manual annotated data-sets. In the second
approach, we treat the big data collections through minimizing the search space of profiles that needs advanced analysis, instead of processing every user's profile existing in
the collections. Then, each profile falling in the reduced search space is further analyzed in an advanced way to produce an accurate decision using a binary classification
model.
The experiments conducted on Twitter online social network have shown that the unsupervised collective-based framework is able to produce updated and effective real- time binary
tweet-based classification function that adapts the high evolution of social spammer's strategies on Twitter, outperforming the performance of two existing real- time spam
detection methods. On the other hand, the results of the second approach have demonstrated that performing a preprocessing step for extracting spammy meta-data values and
leveraging them in the retrieval process is a feasible solution for handling a large collections of Twitter profiles, as an alternative solution for processing all profiles
existing in the input data collection.
The introduced approaches open different opportunities for information science researcher to leverage our solutions in other information filtering problems and applications. Our
long term perspective consists of (i) developing a generic platform covering most common OSM for instantly checking the quality of a given piece of information where the forms
of the input information could be profiles, website links, posts, and plain texts; (ii) and transforming and adapting our methods to handle additional IQ problems such as rumors
and information overloading
Community Detection in Hypergraphen
Viele DatensĂ€tze können als Graphen aufgefasst werden, d.h. als Elemente (Knoten) und binĂ€re Verbindungen zwischen ihnen (Kanten). Unter dem Begriff der "Complex Network Analysis" sammeln sich eine ganze Reihe von Verfahren, die die Untersuchung von DatensĂ€tzen allein aufgrund solcher struktureller Eigenschaften erlauben. "Community Detection" als Untergebiet beschĂ€ftigt sich mit der Identifikation besonders stark vernetzter Teilgraphen. Ăber den Nutzen hinaus, den eine Gruppierung verwandter Element direkt mit sich bringt, können derartige Gruppen zu einzelnen Knoten zusammengefasst werden, was einen neuen Graphen von reduzierter KomplexitĂ€t hervorbringt, der die Makrostruktur des ursprĂŒnglichen Graphen unter UmstĂ€nden besser hervortreten lĂ€sst. Fortschritte im Bereich der "Community Detection" verbessern daher auch das VerstĂ€ndnis komplexer Netzwerke im allgemeinen. Nicht jeder Datensatz lĂ€sst sich jedoch angemessen mit binĂ€ren Relationen darstellen - Relationen höherer Ordnung fĂŒhren zu sog. Hypergraphen. Gegenstand dieser Arbeit ist die Verallgemeinerung von AnsĂ€tzen zur "Community Detection" auf derartige Hypergraphen. Im Zentrum der Aufmerksamkeit stehen dabei "Social Bookmarking"-DatensĂ€tze, wie sie von Benutzern von "Bookmarking"-Diensten erzeugt werden. Dabei ordnen Benutzer Dokumenten frei gewĂ€hlte Stichworte, sog. "Tags" zu. Dieses "Tagging" erzeugt, fĂŒr jede Tag-Zuordnung, eine ternĂ€re Verbindung zwischen Benutzer, Dokument und Tag, was zu Strukturen fĂŒhrt, die 3-partite, 3-uniforme (im folgenden 3,3-, oder allgemeiner k,k-) Hypergraphen genannt werden. Die Frage, der diese Arbeit nachgeht, ist wie diese Strukturen formal angemessen in "Communities" unterteilt werden können, und wie dies das VerstĂ€ndnis dieser DatensĂ€tze erleichtert, die potenziell sehr reich an latenten Informationen sind. ZunĂ€chst wird eine Verallgemeinerung der verbundenen Komponenten fĂŒr k,k-Hypergraphen eingefĂŒhrt. Die normale Definition verbundener Komponenten weist auf den untersuchten DatensĂ€tzen, recht uninformativ, alle Elemente einer einzelnen Riesenkomponente zu. Die verallgemeinerten, so genannten hyper-inzidenten verbundenen Komponenten hingegen zeigen auf den "Social Bookmarking"-DatensĂ€tzen eine charakteristische GröĂenverteilung, die jedoch bspw. von Spam-Verhalten zerstört wird - was eine Verbindung zwischen Verhaltensmustern und strukturellen Eigenschaften zeigt, der im folgenden weiter nachgegangen wird. Als nĂ€chstes wird das allgemeine Thema der "Community Detection" auf k,k-Hypergraphen eingefĂŒhrt. Drei Herausforderungen werden definiert, die mit der naiven Anwendung bestehender Verfahren nicht gemeistert werden können. AuĂerdem werden drei Familien synthetischer Hypergraphen mit "Community"-Strukturen von steigender KomplexitĂ€t eingefĂŒhrt, die prototypisch fĂŒr Situationen stehen, die ein erfolgreicher Detektionsansatz rekonstruieren können sollte. Der zentrale methodische Beitrag dieser Arbeit besteht aus der im folgenden dargestellten Entwicklung eines multipartiten (d.h. fĂŒr k,k-Hypergraphen geeigneten) Verfahrens zur Erkennung von "Communities". Es basiert auf der Optimierung von ModularitĂ€t, einem etablierten Verfahrung zur Erkennung von "Communities" auf nicht-partiten, d.h. "normalen" Graphen. Ausgehend vom einfachst möglichen Ansatz wird das Verfahren iterativ verfeinert, um den zuvor definierten sowie neuen, in der Praxis aufgetretenen Herausforderungen zu begegnen. Am Ende steht die Definition der "ausgeglichenen multi-partiten ModularitĂ€t". SchlieĂlich wird ein interaktives Werkzeug zur Untersuchung der so gewonnenen "Community"-Zuordnungen vorgestellt. Mithilfe dieses Werkzeugs können die Vorteile der zuvor eingefĂŒhrten ModularitĂ€t demonstriert werden: So können komplexe ZusammenhĂ€nge beobachtet werden, die den einfacheren Verfahren entgehen. Diese Ergebnisse werden von einer stĂ€rker quantitativ angelegten Untersuchung bestĂ€tigt: UnĂŒberwachte QualitĂ€tsmaĂe, die bspw. den Kompressionsgrad berĂŒcksichtigen, können ĂŒber eine gröĂere Menge von Beispielen die Vorteile der ausgeglichenen multi-partiten ModularitĂ€t gegenĂŒber den anderen Verfahren belegen. Zusammenfassend lassen sich die Ergebnisse dieser Arbeit in zwei Bereiche einteilen: Auf der praktischen Seite werden Werkzeuge zur Erforschung von "Social Bookmarking"-Daten bereitgestellt. DemgegenĂŒber stehen theoretische BeitrĂ€ge, die fĂŒr Graphen etablierte Konzepte - verbundene Komponenten und "Community Detection" - auf k,k-Hypergraphen ĂŒbertragen.Many datasets can be interpreted as graphs, i.e. as elements (nodes) and binary relations between them (edges). Under the label of complex network analysis, a vast array of graph-based methods allows the exploration of datasets purely based on such structural properties. Community detection, as a subfield of network analysis, aims to identify well-connected subparts of graphs. While the grouping of related elements is useful in itself, these groups can furthermore be collapsed into single nodes, creating a new graph of reduced complexity which may better reveal the original graph's macrostructure. Therefore, advances in community detection improve the understanding of complex networks in general. However, not every dataset can be modelled properly with binary relations - higher-order relations give rise to so-called hypergraphs. This thesis explores the generalization of community detection approaches to hypergraphs. In the focus of attention are social bookmarking datasets, created by users of online bookmarking services who assign freely chosen keywords, so-called "tags", to documents. This "tagging" creates, for each tag assignment, a ternary connection between the user, the document, and the tag, inducing particular structures called 3-partite, 3-uniform hypergraphs (henceforth called 3,3- or more generally k,k-hypergraphs). The question pursued here is how to decompose these structures in a formally adequate manner, and how this improves the understanding of these rich datasets. First, a generalization of connected components to k,k-hypergraphs is proposed. The standard definition of connected components here rather uninformatively assigns almost all elements to a single giant component. The generalized so-called hyperincident connected components, however, show a characteristic size distribution on the social bookmarking datasets that is disrupted by, e.g., spamming activity - demonstrating a link between behavioural patterns and structural features that is further explored in the following. Next, the general topic of community detection in k,k-hypergraphs is introduced. Three challenges are posited that are not met by the naive application of standard techniques, and three families of synthetic hypergraphs are introduced containing increasingly complex community setups that a successful detection approach must be able to identify. The main methodical contribution of this thesis consists of the following development of a multi-partite (i.e. suitable for k,k-hypergraphs) community detection algorithm. It is based on modularity optimization, a well-established algorithm to detect communities in non-partite, i.e. "normal" graphs. Starting from the simplest approach possible, the method is successively refined to meet the previously defined as well as empirically encountered challenges, culminating in the definition of the "balanced multi-partite modularity". Finally, an interactive tool for exploring the obtained community assignments is introduced. Using this tool, the benefits of balanced multi-partite modularity can be shown: Intricate patters can be observed that are missed by the simpler approaches. These findings are confirmed by a more quantitative examination: Unsupervised quality measures considering, e.g., compression document the advantages of this approach on a larger number of samples. To conclude, the contributions of this thesis are twofold. It provides practical tools for the analysis of social bookmarking data, complemented with theoretical contributions, the generalization of connected components and modularity from graphs to k,k-hypergraphs
Mining, Modeling, and Leveraging Multidimensional Web Metrics to Support Scholarly Communities
The significant proliferation of scholarly output and the emergence of multidisciplinary research areas are rendering the research environment increasingly complex. In addition, an increasing number of researchers are using academic social networks to discover and store scholarly content. The spread of scientific discourse and research activities across the web, especially on social media platforms, suggests that far-reaching changes are taking place in scholarly communication and the geography of science.
This dissertation provides integrated techniques and methods designed to address the information overload problem facing scholarly environments and to enhance the research process. There are four main contributions in this dissertation. First, this study identifies, quantifies, and analyzes international researchersâ dynamic scholarly information behaviors, activities, and needs, especially after the emergence of social media platforms. The findings based on qualitative and quantitative analysis report new scholarly patterns and reveals differences between researchers according to academic status and discipline.
Second, this study mines massive scholarly datasets, models diverse multidimensional non-traditional web-based indicators (altmetrics), and evaluates and predicts scholarly and societal impact at various levels. The results address some of the limitations of traditional citation-based metrics and broaden the understanding and utilization of altmetrics. Third, this study recommends scholarly venues semantically related to researchersâ current interests. The results provide important up-to-the-minute signals that represent a closer reflection of research interests than post-publication usage-based metrics.
Finally, this study develops a new scholarly framework by supporting the construction of online scholarly communities and bibliographies through reputation-based social collaboration, through the introduction of a collaborative, self-promoting system for users to advance their participation through analysis of the quality, timeliness and quantity of contributions. The framework improves the precision and quality of social reference management systems.
By analyzing and modeling digital footprints, this dissertation provides a basis for tracking and documenting the impact of scholarship using new models that are more akin to reading breaking news than to watching a historical documentary made several years after the events it describes