    RTbust: Exploiting Temporal Patterns for Botnet Detection on Twitter

    Within OSNs, many of our supposedly online friends may instead be fake accounts called social bots, part of large groups that purposely re-share targeted content. Here, we study retweeting behaviors on Twitter, with the ultimate goal of detecting retweeting social bots. We collect a dataset of 10M retweets. We design a novel visualization that we leverage to highlight benign and malicious patterns of retweeting activity. In this way, we uncover a 'normal' retweeting pattern that is peculiar of human-operated accounts, and 3 suspicious patterns related to bot activities. Then, we propose a bot detection technique that stems from the previous exploration of retweeting behaviors. Our technique, called Retweet-Buster (RTbust), leverages unsupervised feature extraction and clustering. An LSTM autoencoder converts the retweet time series into compact and informative latent feature vectors, which are then clustered with a hierarchical density-based algorithm. Accounts belonging to large clusters characterized by malicious retweeting patterns are labeled as bots. RTbust obtains excellent detection results, with F1 = 0.87, whereas competitors achieve F1 < 0.76. Finally, we apply RTbust to a large dataset of retweets, uncovering 2 previously unknown active botnets with hundreds of accounts

    Applying Machine Learning to Cyber Security

    Intrusion Detection Systems (IDS) nowadays are a very important part of a system. In the last years many methods have been proposed to implement this kind of security measure against cyber attacks, including Machine Learning and Data Mining based. In this work we discuss in details the family of anomaly based IDSs, which are able to detect never seen attacks, paying particular attention to adherence to the FAIR principles. This principles include the Accessibility and the Reusability of software. Moreover, as the purpose of this work is the assessment of what is going on in the state of the art we have selected three approaches, according to their reproducibility and we have compared their performances with a common experimental setting. Lastly real world use case has been analyzed, resulting in the proposal of an usupervised ML model for pre-processing and analyzing web server logs. The proposed solution uses clustering and outlier detection techniques to detect attacks in an unsupervised way

    The Democratization of News - Analysis and Behavior Modeling of Users in the Context of Online News Consumption

    Die Erfindung des Internets ebnete den Weg für die Demokratisierung von Information. Die Tatsache, dass Nachrichten für die breite Öffentlichkeit zugänglicher wurden, barg wichtige politische Versprechen, wie zum Beispiel das Erreichen von zuvor uninformierten und daher oft inaktiven Bürgern. Diese konnten sich nun dank des Internets tagesaktuell über das politische Geschehen informieren und selbst politisch engagieren. Während viele Politiker und Journalisten ein Jahrzehnt lang mit dieser Entwicklung zufrieden waren, änderte sich die Situation mit dem Aufkommen der sozialen Online-Netzwerke (OSN). Diese OSNs sind heute nahezu allgegenwärtig – so beziehen inzwischen 67%67\% der Amerikaner zumindest einen Teil ihrer Nachrichten über die sozialen Medien. Dieser Trend hat die Kosten für die Veröffentlichung von Inhalten weiter gesenkt. Dies sah zunächst nach einer positiven Entwicklung aus, stellt inzwischen jedoch ein ernsthaftes Problem für Demokratien dar. Anstatt dass eine schier unendliche Menge an leicht zugänglichen Informationen uns klüger machen, wird die Menge an Inhalten zu einer Belastung. Eine ausgewogene Nachrichtenauswahl muss einer Flut an Beiträgen und Themen weichen, die durch das digitale soziale Umfeld des Nutzers gefiltert werden. Dies fördert die politische Polarisierung und ideologische Segregation. Mehr als die Hälfte der OSN-Nutzer trauen zudem den Nachrichten, die sie lesen, nicht mehr (54%54\% machen sich Sorgen wegen Falschnachrichten). In dieses Bild passt, dass Studien berichten, dass Nutzer von OSNs dem Populismus extrem linker und rechter politischer Akteure stärker ausgesetzt sind, als Personen ohne Zugang zu sozialen Medien. Um die negativen Effekt dieser Entwicklung abzumildern, trägt meine Arbeit zum einen zum Verständnis des Problems bei und befasst sich mit Grundlagenforschung im Bereich der Verhaltensmodellierung. Abschließend beschäftigen wir uns mit der Gefahr der Beeinflussung der Internetnutzer durch soziale Bots und präsentieren eine auf Verhaltensmodellierung basierende Lösung. Zum besseren Verständnis des Nachrichtenkonsums deutschsprachiger Nutzer in OSNs, haben wir deren Verhalten auf Twitter analysiert und die Reaktionen auf kontroverse - teils verfassungsfeindliche - und nicht kontroverse Inhalte verglichen. Zusätzlich untersuchten wir die Existenz von Echokammern und ähnlichen Phänomenen. Hinsichtlich des Nutzerverhaltens haben wir uns auf Netzwerke konzentriert, die ein komplexeres Nutzerverhalten zulassen. Wir entwickelten probabilistische Verhaltensmodellierungslösungen für das Clustering und die Segmentierung von Zeitserien. Neben den Beiträgen zum Verständnis des Problems haben wir Lösungen zur Erkennung automatisierter Konten entwickelt. Diese Bots nehmen eine wichtige Rolle in der frühen Phase der Verbreitung von Fake News ein. Unser Expertenmodell - basierend auf aktuellen Deep-Learning-Lösungen - identifiziert, z. B., automatisierte Accounts anhand ihres Verhaltens. Meine Arbeit sensibilisiert für diese negative Entwicklung und befasst sich mit der Grundlagenforschung im Bereich der Verhaltensmodellierung. Auch wird auf die Gefahr der Beeinflussung durch soziale Bots eingegangen und eine auf Verhaltensmodellierung basierende Lösung präsentiert


    Il diabete mellito (DM) è una malattia metabolica autogestita, in cui se l'individuo non è motivato o non è in grado di gestire regolarmente il proprio DM, i risultati medici e psicosociali saranno scarsi. Il DM è più di una condizione di salute fisica: ha impatti comportamentali, fisiologici, psicologici e sociali, e richiede alti livelli di motivazione per seguire le raccomandazioni cliniche e adottare comportamenti sani. A questo scopo, le linee guida dell'American Association of Diabetes Educators (AADE) hanno introdotto il costrutto di coping sano per identificare le strategie di coping per ridurre i sintomi di depressione, ansia, stress e disagio emotivo legato al diabete, migliorando anche il benessere degli adulti con DM. In questo contesto, i Virtual Coaches (VCs) sono diventati un importante risorsa nel supporto e nella gestione delle barriere comuni nel contesto dell'aderenza ai comportamenti sani tra gli adulti con DM. Tuttavia, pochi sono i VC specificamente sviluppati a fornire supporto psicosociale agli adulti con DM. L'obiettivo principale della presente tesi è stato, infatti, lo sviluppo di un VC per fornire supporto psicosociale agli adulti con DM di tipo 1 (T1DM) o DM di tipo 2 (T2DM). Più specificamente, questo VC mirava a motivare gli adulti con DM a ridurre sintomi di depressione, ansia, stress, il disagio emotivo legato al diabete, e a migliorare il loro benessere, incoraggiandoli ad acquisire e coltivare strategie di coping psicosociale sano. Queste abilità di coping facevano riferimento alle linee guida dell'AADE e quindi alla pratica della meditazione; in questo studio è stata, infatti, applicata la Mindfulness-Based Cognitive Therapy. La presente tesi è articolata secondo tre studi. Lo studio 1 mirava a fornire prove meta-analitiche sull'efficacia degli interventi eHealth nel sostenere il benessere psicosociale e medico degli adulti con T1DM o T2DM. Lo studio 2 mirava a testare il prototipo del VC simulato, cioè Wizard of Oz (WOZ), attraverso la piattaforma di messaggistica WhatsApp per 6 settimane, con due sessioni a settimana. In particolare, questo studio ha indagato l'accettabilità preliminare e la User Experience (UX) del protocollo di intervento, che sarà incorporato nel futuro VC. Infatti, il metodo di progettazione è stato duplice. Da un lato, è stato applicato il metodo WOZ, in cui gli studenti di psicologia credevano di interagire con un VC; invece, stavano comunicando con un essere umano. Dall'altro lato, è stato utilizzato il modello Obesity-Related Behavioural Intervention Trials (ORBIT), in particolare le sue prime fasi, poiché favorisce un approccio iterativo. Lo studio 3, seguendo le fasi successive del modello ORBIT, mirava a valutare l'efficacia preliminare del VC, chiamato Motibot - abbreviazione di Motivational bot - sviluppato attraverso una combinazione di Natural Language Processing (NLU) e regole pre-strutturate. Un totale di 13 adulti italiani con DM (Mage = 30.08, SD = 10.61) hanno interagito con Motibot attraverso l'applicazione di messaggistica Telegram per 12 sessioni, in cui il paziente poteva pianificare l'appuntamento secondo le sue esigenze: ha interagito con Motibot una o due sessioni a settimana. Motibot è stato percepito come motivante, incoraggiante e capace di innescare un'auto-riflessione sulle proprie emozioni: gli utenti e i pazienti hanno riferito di aver avuto un'esperienza molto positiva con Motibot. Motibot può essere uno strumento utile per fornire supporto psicosociale agli adulti con DM; potrebbe essere prescritto dal diabetologo come misura preventiva per il benessere del paziente e/o quando il paziente presenta sintomi psicosociali lievi e moderati. L'approccio di design centrato sull'utente e il concetto di bidirezionalità tra fattori psicosociali e medici sono punti chiave nello sviluppo di un trattamento digitale personalizzato.Diabetes Mellitus (DM) is a self-managed, metabolic disease, in which if the individual is unwilling, unmotivated, or unable to regularly self-manage their DM, the medical and psychosocial outcomes will be poor. Indeed, DM is more than a physical health condition: it has behavioural, physiological, psychological, and social impacts, and demands high levels of motivation in order to follow the clinical recommendations and adopt healthy behaviours. To this end, the American Association of Diabetes Educators (AADE) guidelines introduced the healthy coping construct to identify healthy coping strategies for reducing symptoms of depression, anxiety, stress, and diabetes-related emotional distress while also improving the well-being of adults with DM. Virtual Coaches (VCs) have recently become more prevalent in the support and management of common barriers in the context of adherence to healthy behaviours among adults with DM, in particular those regarding medical and physical behaviours. However, few VCs were found to be specifically aimed at providing psychosocial support to adults with DM. The main aim of the present thesis was, indeed, the development and implementation of a VC for the provision of psychosocial support to adults with Type 1 (T1DM) or Type 2 DM (T2DM). More specifically, this VC aimed at motivating adults with DM to reduce depression, anxiety, perceived stress symptoms, diabetes-related emotional distress, and improve their well-being, by encouraging them to acquire and cultivate psychosocial healthy coping strategies. These coping skills referred to the AADE guidelines and thus to practicing meditation; in this study, the Mindfulness-Based Cognitive Therapy has been applied. The present thesis is articulated according to three studies. Study 1 aimed at providing meta-analytical evidence on the efficacy of eHealth interventions in supporting the psychosocial and medical well-being of adults with T1DM or T2DM. Study 2 aimed at testing the prototype of the simulated VC, namely Wizard of Oz (WOZ), via the WhatsApp messaging platform for 6-week, with two sessions per week. In particular, this study investigated the preliminary acceptability and the User Experience (UX) of the intervention protocol, which will be incorporated into the future VC. Indeed, the design method was two-fold. On the one hand, the WOZ method was applied, in which psychology students believed that they were interacting with a VC, instead they were communicating with a human being. On the other hand, the Obesity-Related Behavioural Intervention Trials (ORBIT) model was used, particularly its early phases, since it favours an iterative approach. Study 3, following the next phases of the ORBIT model, aimed at assessing the preliminary efficacy of the VC, called Motibot—the abbreviation for Motivational bot—developed through a combination of Natural Language Processing (NLU) and hand-crafted rules. A total of 13 Italian adults with DM (Mage = 30.08, SD = 10.61) interacted with Motibot through the Telegram messaging application for 12 sessions, in which the patient planned the appointment according to his/her needs: he/she interacted with Motibot one or two sessions per week. Therefore, Motibot was perceived as motivating, encouraging and able to trigger self-reflection on one’s own emotions: users and patients reported having a very positive experience with Motibot. Motibot, thus, can be a useful tool to provide psychosocial support to adults with DM; as such, it might be prescribed by the diabetologist as a preventive measure for the patient’s well-being and/or when the patient presents mild and moderate psychosocial symptoms. The user-centred design approach and the concept of bidirectionality between psychosocial and medical factors are key points in the development of a personalised treatment within the digital intervention

    Building a semantic search engine with games and crowdsourcing

    Semantic search engines aim at improving conventional search with semantic information, or meta-data, on the data searched for and/or on the searchers. So far, approaches to semantic search exploit characteristics of the searchers like age, education, or spoken language for selecting and/or ranking search results. Such data allow to build up a semantic search engine as an extension of a conventional search engine. The crawlers of well established search engines like Google, Yahoo! or Bing can index documents but, so far, their capabilities to recognize the intentions of searchers are still rather limited. Indeed, taking into account characteristics of the searchers considerably extend both, the quantity of data to analyse and the dimensionality of the search problem. Well established search engines therefore still focus on general search, that is, "search for all", not on specialized search, that is, "search for a few". This thesis reports on techniques that have been adapted or conceived, deployed, and tested for building a semantic search engine for the very specific context of artworks. In contrast to, for example, the interpretation of X-ray images, the interpretation of artworks is far from being fully automatable. Therefore artwork interpretation has been based on Human Computation, that is, a software-based gathering of contributions by many humans. The approach reported about in this thesis first relies on so called Games With A Purpose, or GWAPs, for this gathering: Casual games provide an incentive for a potentially unlimited community of humans to contribute with their appreciations of artworks. Designing convenient incentives is less trivial than it might seem at first. An ecosystem of games is needed so as to collect the meta-data on artworks intended for. One game generates the data that can serve as input of another game. This results in semantically rich meta-data that can be used for building up a successful semantic search engine. Thus, a first part of this thesis reports on a "game ecosystem" specifically designed from one known game and including several novel games belonging to the following game classes: (1) Description Games for collecting obvious and trivial meta-data, basically the well-known ESP (for extra-sensorial perception) game of Luis von Ahn, (2) the Dissemination Game Eligo generating translations, (3) the Diversification Game Karido aiming at sharpening differences between the objects, that is, the artworks, interpreted and (3) the Integration Games Combino, Sentiment and TagATag that generate structured meta-data. Secondly, the approach to building a semantic search engine reported about in this thesis relies on Higher-Order Singular Value Decomposition (SVD). More precisely, the data and meta-data on artworks gathered with the afore mentioned GWAPs are collected in a tensor, that is a mathematical structure generalising matrices to more than only two dimensions, columns and rows. The dimensions considered are the artwork descriptions, the players, and the artwork themselves. A Higher-Order SVD of this tensor is first used for noise reduction in This thesis reports also on deploying a Higher-Order LSA. The parallel Higher-Order SVD algorithm applied for the Higher-Order LSA and its implementation has been validated on an application related to, but independent from, the semantic search engine for artworks striven for: image compression. This thesis reports on the surprisingly good image compression which can be achieved with Higher-Order SVD. While compression methods based on matrix SVD for each color, the approach reported about in this thesis relies on one single (higher-order) SVD of the whole tensor. This results in both, better quality of the compressed image and in a significant reduction of the memory space needed. Higher-Order SVD is extremely time-consuming what calls for parallel computation. Thus, a step towards automatizing the construction of a semantic search engine for artworks was parallelizing the higher-order SVD method used and running the resulting parallel algorithm on a super-computer. This thesis reports on using Hestenes’ method and R-SVD for parallelising the higher-order SVD. This method is an unconventional choice which is explained and motivated. As of the super-computer needed, this thesis reports on turning the web browsers of the players or searchers into a distributed parallel computer. This is done by a novel specific system and a novel implementation of the MapReduce data framework to data parallelism. Harnessing the web browsers of the players or searchers saves computational power on the server-side. It also scales extremely well with the number of players or searchers because both, playing with and searching for artworks, require human reflection and therefore results in idle local processors that can be brought together into a distributed super-computer.Semantische Suchmaschinen dienen der Verbesserung konventioneller Suche mit semantischen Informationen, oder Metadaten, zu Daten, nach denen gesucht wird, oder zu den Suchenden. Bisher nutzt Semantische Suche Charakteristika von Suchenden wie Alter, Bildung oder gesprochene Sprache für die Auswahl und/oder das Ranking von Suchergebnissen. Solche Daten erlauben den Aufbau einer Semantischen Suchmaschine als Erweiterung einer konventionellen Suchmaschine. Die Crawler der fest etablierten Suchmaschinen wie Google, Yahoo! oder Bing können Dokumente indizieren, bisher sind die Fähigkeiten eher beschränkt, die Absichten von Suchenden zu erkennen. Tatsächlich erweitert die Berücksichtigung von Charakteristika von Suchenden beträchtlich beides, die Menge an zu analysierenden Daten und die Dimensionalität des Such-Problems. Fest etablierte Suchmaschinen fokussieren deswegen stark auf allgemeine Suche, also "Suche für alle", nicht auf spezialisierte Suche, also "Suche für wenige". Diese Arbeit berichtet von Techniken, die adaptiert oder konzipiert, eingesetzt und getestet wurden, um eine semantische Suchmaschine für den sehr speziellen Kontext von Kunstwerken aufzubauen. Im Gegensatz beispielsweise zur Interpretation von Röntgenbildern ist die Interpretation von Kunstwerken weit weg davon gänzlich automatisiert werden zu können. Deswegen basiert die Interpretation von Kunstwerken auf menschlichen Berechnungen, also Software-basiertes Sammeln von menschlichen Beiträgen. Der Ansatz, über den in dieser Arbeit berichtet wird, beruht auf sogenannten "Games With a Purpose" oder GWAPs die folgendes sammeln: Zwanglose Spiele bieten einen Anreiz für eine potenziell unbeschränkte Gemeinde von Menschen, mit Ihrer Wertschätzung von Kunstwerken beizutragen. Geeignete Anreize zu entwerfen in weniger trivial als es zuerst scheinen mag. Ein Ökosystem von Spielen wird benötigt, um Metadaten gedacht für Kunstwerke zu sammeln. Ein Spiel erzeugt Daten, die als Eingabe für ein anderes Spiel dienen können. Dies resultiert in semantisch reichhaltigen Metadaten, die verwendet werden können, um eine erfolgreiche Semantische Suchmaschine aufzubauen. Deswegen berichtet der erste Teil dieser Arbeit von einem "Spiel-Ökosystem", entwickelt auf Basis eines bekannten Spiels und verschiedenen neuartigen Spielen, die zu verschiedenen Spiel-Klassen gehören. (1) Beschreibungs-Spiele zum Sammeln offensichtlicher und trivialer Metadaten, vor allem dem gut bekannten ESP-Spiel (Extra Sensorische Wahrnehmung) von Luis von Ahn, (2) dem Verbreitungs-Spiel Eligo zur Erzeugung von Übersetzungen, (3) dem Diversifikations-Spiel Karido, das Unterschiede zwischen Objekten, also interpretierten Kunstwerken, schärft und (3) Integrations-Spiele Combino, Sentiment und Tag A Tag, die strukturierte Metadaten erzeugen. Zweitens beruht der Ansatz zum Aufbau einer semantischen Suchmaschine, wie in dieser Arbeit berichtet, auf Singulärwertzerlegung (SVD) höherer Ordnung. Präziser werden die Daten und Metadaten über Kunstwerk gesammelt mit den vorher genannten GWAPs in einem Tensor gesammelt, einer mathematischen Struktur zur Generalisierung von Matrizen zu mehr als zwei Dimensionen, Spalten und Zeilen. Die betrachteten Dimensionen sind die Beschreibungen der Kunstwerke, die Spieler, und die Kunstwerke selbst. Eine Singulärwertzerlegung höherer Ordnung dieses Tensors wird zuerst zur Rauschreduktion verwendet nach der Methode der sogenannten Latenten Semantischen Analyse (LSA). Diese Arbeit berichtet auch über die Anwendung einer LSA höherer Ordnung. Der parallele Algorithmus für Singulärwertzerlegungen höherer Ordnung, der für LSA höherer Ordnung verwendet wird, und seine Implementierung wurden validiert an einer verwandten aber von der semantischen Suche unabhängig angestrebten Anwendung: Bildkompression. Diese Arbeit berichtet von überraschend guter Kompression, die mit Singulärwertzerlegung höherer Ordnung erzielt werden kann. Neben Matrix-SVD-basierten Kompressionsverfahren für jede Farbe, beruht der Ansatz wie in dieser Arbeit berichtet auf einer einzigen SVD (höherer Ordnung) auf dem gesamten Tensor. Dies resultiert in beidem, besserer Qualität von komprimierten Bildern und einer signifikant geringeren des benötigten Speicherplatzes. Singulärwertzerlegung höherer Ordnung ist extrem zeitaufwändig, was parallele Berechnung verlangt. Deswegen war ein Schritt in Richtung Aufbau einer semantischen Suchmaschine für Kunstwerke eine Parallelisierung der verwendeten SVD höherer Ordnung auf einem Super-Computer. Diese Arbeit berichtet vom Einsatz der Hestenes’-Methode und R-SVD zur Parallelisierung der SVD höherer Ordnung. Diese Methode ist eine unkonventionell Wahl, die erklärt und motiviert wird. Ab nun wird ein Super-Computer benötigt. Diese Arbeit berichtet über die Wandlung der Webbrowser von Spielern oder Suchenden in einen verteilten Super-Computer. Dies leistet ein neuartiges spezielles System und eine neuartige Implementierung des MapReduce Daten-Frameworks für Datenparallelismus. Das Einspannen der Webbrowser von Spielern und Suchenden spart server-seitige Berechnungskraft. Ebenso skaliert die Berechnungskraft so extrem gut mit der Spieleranzahl oder Suchenden, denn beides, Spiel mit oder Suche nach Kunstwerken, benötigt menschliche Reflektion, was deswegen zu ungenutzten lokalen Prozessoren führt, die zu einem verteilten Super-Computer zusammengeschlossen werden können

    Modeling Human Group Behavior In Virtual Worlds

    Virtual worlds and massively-multiplayer online games are rich sources of information about large-scale teams and groups, offering the tantalizing possibility of harvesting data about group formation, social networks, and network evolution. They provide new outlets for human social interaction that differ from both face-to-face interactions and non-physically-embodied social networking tools such as Facebook and Twitter. We aim to study group dynamics in these virtual worlds by collecting and analyzing public conversational patterns of users grouped in close physical proximity. To do this, we created a set of tools for monitoring, partitioning, and analyzing unstructured conversations between changing groups of participants in Second Life, a massively multi-player online user-constructed environment that allows users to construct and inhabit their own 3D world. Although there are some cues in the dialog, determining social interactions from unstructured chat data alone is a difficult problem, since these environments lack many of the cues that facilitate natural language processing in other conversational settings and different types of social media. Public chat data often features players who speak simultaneously, use jargon and emoticons, and only erratically adhere to conversational norms. Humans are adept social animals capable of identifying friendship groups from a combination of linguistic cues and social network patterns. But what is more important, the content of what people say or their history of social interactions? Moreover, is it possible to identify whether iii people are part of a group with changing membership merely from general network properties, such as measures of centrality and latent communities? These are the questions that we aim to answer in this thesis. The contributions of this thesis include: 1) a link prediction algorithm for identifying friendship relationships from unstructured chat data 2) a method for identifying social groups based on the results of community detection and topic analysis. The output of these two algorithms (links and group membership) are useful for studying a variety of research questions about human behavior in virtual worlds. To demonstrate this we have performed a longitudinal analysis of human groups in different regions of the Second Life virtual world. We believe that studies performed with our tools in virtual worlds will be a useful stepping stone toward creating a rich computational model of human group dynamics

    Chatbots for Modelling, Modelling of Chatbots

    Tesis Doctoral inédita leída en la Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de Ingeniería Informática. Fecha de Lectura: 28-03-202
