138 research outputs found

    Survey of review spam detection using machine learning techniques

    Get PDF

    Community Detection in Hypergraphen

    Get PDF
    Viele Datensätze können als Graphen aufgefasst werden, d.h. als Elemente (Knoten) und binäre Verbindungen zwischen ihnen (Kanten). Unter dem Begriff der "Complex Network Analysis" sammeln sich eine ganze Reihe von Verfahren, die die Untersuchung von Datensätzen allein aufgrund solcher struktureller Eigenschaften erlauben. "Community Detection" als Untergebiet beschäftigt sich mit der Identifikation besonders stark vernetzter Teilgraphen. Über den Nutzen hinaus, den eine Gruppierung verwandter Element direkt mit sich bringt, können derartige Gruppen zu einzelnen Knoten zusammengefasst werden, was einen neuen Graphen von reduzierter Komplexität hervorbringt, der die Makrostruktur des ursprünglichen Graphen unter Umständen besser hervortreten lässt. Fortschritte im Bereich der "Community Detection" verbessern daher auch das Verständnis komplexer Netzwerke im allgemeinen. Nicht jeder Datensatz lässt sich jedoch angemessen mit binären Relationen darstellen - Relationen höherer Ordnung führen zu sog. Hypergraphen. Gegenstand dieser Arbeit ist die Verallgemeinerung von Ansätzen zur "Community Detection" auf derartige Hypergraphen. Im Zentrum der Aufmerksamkeit stehen dabei "Social Bookmarking"-Datensätze, wie sie von Benutzern von "Bookmarking"-Diensten erzeugt werden. Dabei ordnen Benutzer Dokumenten frei gewählte Stichworte, sog. "Tags" zu. Dieses "Tagging" erzeugt, für jede Tag-Zuordnung, eine ternäre Verbindung zwischen Benutzer, Dokument und Tag, was zu Strukturen führt, die 3-partite, 3-uniforme (im folgenden 3,3-, oder allgemeiner k,k-) Hypergraphen genannt werden. Die Frage, der diese Arbeit nachgeht, ist wie diese Strukturen formal angemessen in "Communities" unterteilt werden können, und wie dies das Verständnis dieser Datensätze erleichtert, die potenziell sehr reich an latenten Informationen sind. Zunächst wird eine Verallgemeinerung der verbundenen Komponenten für k,k-Hypergraphen eingeführt. Die normale Definition verbundener Komponenten weist auf den untersuchten Datensätzen, recht uninformativ, alle Elemente einer einzelnen Riesenkomponente zu. Die verallgemeinerten, so genannten hyper-inzidenten verbundenen Komponenten hingegen zeigen auf den "Social Bookmarking"-Datensätzen eine charakteristische Größenverteilung, die jedoch bspw. von Spam-Verhalten zerstört wird - was eine Verbindung zwischen Verhaltensmustern und strukturellen Eigenschaften zeigt, der im folgenden weiter nachgegangen wird. Als nächstes wird das allgemeine Thema der "Community Detection" auf k,k-Hypergraphen eingeführt. Drei Herausforderungen werden definiert, die mit der naiven Anwendung bestehender Verfahren nicht gemeistert werden können. Außerdem werden drei Familien synthetischer Hypergraphen mit "Community"-Strukturen von steigender Komplexität eingeführt, die prototypisch für Situationen stehen, die ein erfolgreicher Detektionsansatz rekonstruieren können sollte. Der zentrale methodische Beitrag dieser Arbeit besteht aus der im folgenden dargestellten Entwicklung eines multipartiten (d.h. für k,k-Hypergraphen geeigneten) Verfahrens zur Erkennung von "Communities". Es basiert auf der Optimierung von Modularität, einem etablierten Verfahrung zur Erkennung von "Communities" auf nicht-partiten, d.h. "normalen" Graphen. Ausgehend vom einfachst möglichen Ansatz wird das Verfahren iterativ verfeinert, um den zuvor definierten sowie neuen, in der Praxis aufgetretenen Herausforderungen zu begegnen. Am Ende steht die Definition der "ausgeglichenen multi-partiten Modularität". Schließlich wird ein interaktives Werkzeug zur Untersuchung der so gewonnenen "Community"-Zuordnungen vorgestellt. Mithilfe dieses Werkzeugs können die Vorteile der zuvor eingeführten Modularität demonstriert werden: So können komplexe Zusammenhänge beobachtet werden, die den einfacheren Verfahren entgehen. Diese Ergebnisse werden von einer stärker quantitativ angelegten Untersuchung bestätigt: Unüberwachte Qualitätsmaße, die bspw. den Kompressionsgrad berücksichtigen, können über eine größere Menge von Beispielen die Vorteile der ausgeglichenen multi-partiten Modularität gegenüber den anderen Verfahren belegen. Zusammenfassend lassen sich die Ergebnisse dieser Arbeit in zwei Bereiche einteilen: Auf der praktischen Seite werden Werkzeuge zur Erforschung von "Social Bookmarking"-Daten bereitgestellt. Demgegenüber stehen theoretische Beiträge, die für Graphen etablierte Konzepte - verbundene Komponenten und "Community Detection" - auf k,k-Hypergraphen übertragen.Many datasets can be interpreted as graphs, i.e. as elements (nodes) and binary relations between them (edges). Under the label of complex network analysis, a vast array of graph-based methods allows the exploration of datasets purely based on such structural properties. Community detection, as a subfield of network analysis, aims to identify well-connected subparts of graphs. While the grouping of related elements is useful in itself, these groups can furthermore be collapsed into single nodes, creating a new graph of reduced complexity which may better reveal the original graph's macrostructure. Therefore, advances in community detection improve the understanding of complex networks in general. However, not every dataset can be modelled properly with binary relations - higher-order relations give rise to so-called hypergraphs. This thesis explores the generalization of community detection approaches to hypergraphs. In the focus of attention are social bookmarking datasets, created by users of online bookmarking services who assign freely chosen keywords, so-called "tags", to documents. This "tagging" creates, for each tag assignment, a ternary connection between the user, the document, and the tag, inducing particular structures called 3-partite, 3-uniform hypergraphs (henceforth called 3,3- or more generally k,k-hypergraphs). The question pursued here is how to decompose these structures in a formally adequate manner, and how this improves the understanding of these rich datasets. First, a generalization of connected components to k,k-hypergraphs is proposed. The standard definition of connected components here rather uninformatively assigns almost all elements to a single giant component. The generalized so-called hyperincident connected components, however, show a characteristic size distribution on the social bookmarking datasets that is disrupted by, e.g., spamming activity - demonstrating a link between behavioural patterns and structural features that is further explored in the following. Next, the general topic of community detection in k,k-hypergraphs is introduced. Three challenges are posited that are not met by the naive application of standard techniques, and three families of synthetic hypergraphs are introduced containing increasingly complex community setups that a successful detection approach must be able to identify. The main methodical contribution of this thesis consists of the following development of a multi-partite (i.e. suitable for k,k-hypergraphs) community detection algorithm. It is based on modularity optimization, a well-established algorithm to detect communities in non-partite, i.e. "normal" graphs. Starting from the simplest approach possible, the method is successively refined to meet the previously defined as well as empirically encountered challenges, culminating in the definition of the "balanced multi-partite modularity". Finally, an interactive tool for exploring the obtained community assignments is introduced. Using this tool, the benefits of balanced multi-partite modularity can be shown: Intricate patters can be observed that are missed by the simpler approaches. These findings are confirmed by a more quantitative examination: Unsupervised quality measures considering, e.g., compression document the advantages of this approach on a larger number of samples. To conclude, the contributions of this thesis are twofold. It provides practical tools for the analysis of social bookmarking data, complemented with theoretical contributions, the generalization of connected components and modularity from graphs to k,k-hypergraphs

    Cyber Infrastructure Protection: Vol. III

    Get PDF
    Despite leaps in technological advancements made in computing system hardware and software areas, we still hear about massive cyberattacks that result in enormous data losses. Cyberattacks in 2015 included: sophisticated attacks that targeted Ashley Madison, the U.S. Office of Personnel Management (OPM), the White House, and Anthem; and in 2014, cyberattacks were directed at Sony Pictures Entertainment, Home Depot, J.P. Morgan Chase, a German steel factory, a South Korean nuclear plant, eBay, and others. These attacks and many others highlight the continued vulnerability of various cyber infrastructures and the critical need for strong cyber infrastructure protection (CIP). This book addresses critical issues in cybersecurity. Topics discussed include: a cooperative international deterrence capability as an essential tool in cybersecurity; an estimation of the costs of cybercrime; the impact of prosecuting spammers on fraud and malware contained in email spam; cybersecurity and privacy in smart cities; smart cities demand smart security; and, a smart grid vulnerability assessment using national testbed networks.https://press.armywarcollege.edu/monographs/1412/thumbnail.jp

    Detection of illicit behaviours and mining for contrast patterns

    Get PDF
    This thesis describes a set of novel algorithms and models designed to detect illicit behaviour. This includes development of domain specific solutions, focusing on anti-money laundering and detection of opinion spam. In addition, advancements are presented for the mining and application of contrast patterns, which are a useful tool for characterising illicit behaviour. For anti-money laundering, this thesis presents a novel approach for detection based on analysis of financial networks and supervised learning. This includes the development of a network model, features extracted from this model, and evaluation of classifiers trained using real financial data. Results indicate that this approach successfully identifies suspicious groups whose collaborative behaviour is indicative of money laundering. For the detection of opinion spam, this thesis presents a model of reviewer behaviour and a method for detection based on statistical anomaly detection. This method considers review ratings, and does not rely on text-based features. Evaluation using real data shows that spammers are successfully identified. Comparison with existing methods shows a small improvement in accuracy, but significant improvements in computational efficiency. This thesis also considers the application of contrast patterns to network analysis and presents a novel algorithm for mining contrast patterns in a distributed system. Contrast patterns may be used to characterise illicit behaviour by contrasting illicit and non-illicit behaviour and uncovering significant differences. However, existing mining algorithms are limited by serial processing making them unsuitable for large data sets. This thesis advances the current state-of-the-art, describing an algorithm for mining in parallel. This algorithm is evaluated using real data and is shown to achieve a high level of scalability, allowing mining of large, high-dimensional data sets. In addition, this thesis explores methods for mapping network features to an item-space suitable for analysis using contrast patterns. Experiments indicate that contrast patterns may become a valuable tool for network analysis

    CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines

    Get PDF
    Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective. The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines. From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research

    Optimization techniques for human computation-enabled data processing systems

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.Cataloged from PDF version of thesis.Includes bibliographical references (p. 119-124).Crowdsourced labor markets make it possible to recruit large numbers of people to complete small tasks that are difficult to automate on computers. These marketplaces are increasingly widely used, with projections of over $1 billion being transferred between crowd employers and crowd workers by the end of 2012. While crowdsourcing enables forms of computation that artificial intelligence has not yet achieved, it also presents crowd workflow designers with a series of challenges including describing tasks, pricing tasks, identifying and rewarding worker quality, dealing with incorrect responses, and integrating human computation into traditional programming frameworks. In this dissertation, we explore the systems-building, operator design, and optimization challenges involved in building a crowd-powered workflow management system. We describe a system called Qurk that utilizes techniques from databases such as declarative workflow definition, high-latency workflow execution, and query optimization to aid crowd-powered workflow developers. We study how crowdsourcing can enhance the capabilities of traditional databases by evaluating how to implement basic database operators such as sorts and joins on datasets that could not have been processed using traditional computation frameworks. Finally, we explore the symbiotic relationship between the crowd and query optimization, enlisting crowd workers to perform selectivity estimation, a key component in optimizing complex crowd-powered workflows.by Adam Marcus.Ph.D

    A framework to extract biomedical knowledge from gluten-related tweets: the case of dietary concerns in digital era

    Get PDF
    Journal pre proofBig data importance and potential are becoming more and more relevant nowadays, enhanced by the explosive growth of information volume that is being generated on the Internet in the last years. In this sense, many experts agree that social media networks are one of the internet areas with higher growth in recent years and one of the fields that are expected to have a more significant increment in the coming years. Similarly, social media sites are quickly becoming one of the most popular platforms to discuss health issues and exchange social support with others. In this context, this work presents a new methodology to process, classify, visualise and analyse the big data knowledge produced by the sociome on social media platforms. This work proposes a methodology that combines natural language processing techniques, ontology-based named entity recognition methods, machine learning algorithms and graph mining techniques to: (i) reduce the irrelevant messages by identifying and focusing the analysis only on individuals and patient experiences from the public discussion; (ii) reduce the lexical noise produced by the different ways in how users express themselves through the use of domain ontologies; (iii) infer the demographic data of the individuals through the combined analysis of textual, geographical and visual profile information; (iv) perform a community detection and evaluate the health topic study combining the semantic processing of the public discourse with knowledge graph representation techniques; and (v) gain information about the shared resources combining the social media statistics with the semantical analysis of the web contents. The practical relevance of the proposed methodology has been proven in the study of 1.1 million unique messages from more than 400,000 distinct users related to one of the most popular dietary fads that evolve into a multibillion-dollar industry, i.e., gluten-free food. Besides, this work analysed one of the least research fields studied on Twitter concerning public health (i.e., the allergies or immunology diseases as celiac disease), discovering a wide range of health-related conclusions.SING group thanks CITI (Centro de Investigacion, Transferencia e Innovacion) from the University of Vigo for hosting its IT infrastructure. This work was supported by: the Associate Laboratory for Green Chemistry-LAQV, which is financed by national funds from and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic funding of [UIDB/50006/2020] and [UIDB/04469/2020] units, and BioTecNorte operation [NORTE010145FEDER000004] funded by the European Regional Development Fund under the scope of Norte2020Programa Operacional Regional do Norte, the Xunta de Galicia (Centro singular de investigacion de Galicia accreditation 2019-2022) and the European Union (European Regional Development Fund - ERDF)- Ref. [ED431G2019/06] , and Conselleria de Educacion, Universidades e Formacion Profesional (Xunta de Galicia) under the scope of the strategic funding of [ED431C2018/55GRC] Competitive Reference Group. The authors also acknowledge the post-doctoral fellowship [ED481B2019032] of Martin PerezPerez, funded by the Xunta de Galicia. Funding for open access charge: Universidade de Vigo/CISUGinfo:eu-repo/semantics/publishedVersio

    Recommender Systems

    Get PDF
    The ongoing rapid expansion of the Internet greatly increases the necessity of effective recommender systems for filtering the abundant information. Extensive research for recommender systems is conducted by a broad range of communities including social and computer scientists, physicists, and interdisciplinary researchers. Despite substantial theoretical and practical achievements, unification and comparison of different approaches are lacking, which impedes further advances. In this article, we review recent developments in recommender systems and discuss the major challenges. We compare and evaluate available algorithms and examine their roles in the future developments. In addition to algorithms, physical aspects are described to illustrate macroscopic behavior of recommender systems. Potential impacts and future directions are discussed. We emphasize that recommendation has a great scientific depth and combines diverse research fields which makes it of interests for physicists as well as interdisciplinary researchers.Comment: 97 pages, 20 figures (To appear in Physics Reports
    • …
    corecore