124 research outputs found

    Support Efficient, Scalable, and Online Social Spam Detection in System

    Get PDF
    The broad success of online social networks (OSNs) has created fertile soil for the emergence and fast spread of social spam. Fake news, malicious URL links, fraudulent advertisements, fake reviews, and biased propaganda are bringing serious consequences for both virtual social networks and human life in the real world. Effectively detecting social spam is a hot topic in both academia and industry. However, traditional social spam detection techniques are limited to centralized processing on top of one specific data source but ignore the social spam correlations of distributed data sources. Moreover, a few research efforts are conducting in integrating the stream system (e.g., Storm, Spark) with the large-scale social spam detection, but they typically ignore the specific details in managing and recovering interim states during the social stream data processing. We observed that social spammers who aim to advertise their products or post victim links are more frequently spreading malicious posts during a very short period of time. They are quite smart to adapt themselves to old models that were trained based on historical records. Therefore, these bring a question: how can we uncover and defend against these online spam activities in an online and scalable manner? In this dissertation, we present there systems that support scalable and online social spam detection from streaming social data: (1) the first part introduces Oases, a scalable system that can support large-scale online social spam detection, (2) the second part introduces a system named SpamHunter, a novel system that supports efficient online scalable spam detection in social networks. The system gives novel insights in guaranteeing the efficiency of the modern stream applications by leveraging the spam correlations at scale, and (3) the third part refers to the state recovery during social spam detection, it introduces a customizable state recovery framework that provides fast and scalable state recovery mechanisms for protecting large distributed states in social spam detection applications

    Streaming Infrastructure and Natural Language Modeling with Application to Streaming Big Data

    Get PDF
    Streaming data are produced in great velocity and diverse variety. The vision of this research is to build an end-to-end system that handles the collection, curation and analysis of streaming data. The streaming data used in this thesis contain both numeric type data and text type data. First, in the field of data collection, we design and evaluate a data delivery framework that handles the real-time nature of streaming data. In this component, we use streaming data in automotive domain since it is suitable for testing and evaluating our data delivery system. Secondly, in the field of data curation, we use a language model to analyze two online automotive forums as an example for streaming text data curation. Last but not least, we present our approach for automated query expansion on Twitter data as an example of streaming social media data analysis. This thesis provides a holistic view of the end-to-end system we have designed, built and analyzed. To study the streaming data in automotive domain, a complex and massive amount of data is being collected from on-board sensors of operational connected vehicles (CVs), infrastructure data sources such as roadway sensors and traffic signals, mobile data sources such as cell phones, social media sources such as Twitter, and news and weather data services. Unfortunately, these data create a bottleneck at data centers for processing and retrievals of collected data, and require the deployment of additional message transfer infrastructure between data producers and consumers to support diverse CV applications. The first part of this dissertation, we present a strategy for creating an efficient and low-latency distributed message delivery system for CV systems using a distributed message delivery platform. This strategy enables large-scale ingestion, curation, and transformation of unstructured data (roadway traffic-related and roadway non-traffic-related data) into labeled and customized topics for a large number of subscribers or consumers, such as CVs, mobile devices, and data centers. We evaluate the performance of this strategy by developing a prototype infrastructure using Apache Kafka, an open source message delivery system, and compared its performance with the latency requirements of CV applications. We present experimental results of the message delivery infrastructure on two different distributed computing testbeds at Clemson University. Experiments were performed to measure the latency of the message delivery system for a variety of testing scenarios. These experiments reveal that measured latencies are less than the U.S. Department of Transportation recommended latency requirements for CV applications, which provides evidence that the system is capable for managing CV related data distribution tasks. Human-generated streaming data are large in volume and noisy in content. Direct acquisition of the full scope of human-generated data is often ineffective. In our research, we try to find an alternative resource to study such data. Common Crawl is a massive multi-petabyte dataset hosted by Amazon. It contains archived HTML web page data from 2008 to date. Common Crawl has been widely used for text mining purposes. Using data extracted from Common Crawl has several advantages over a direct crawl of web data, among which is removing the likelihood of a user\u27s home IP address becoming blacklisted for accessing a given web site too frequently. However, Common Crawl is a data sample, and so questions arise about the quality of Common Crawl as a representative sample of the original data. We perform systematic tests on the similarity of topics estimated from Common Crawl compared to topics estimated from the full data of online forums. Our target is online discussions from a user forum for car enthusiasts, but our research strategy can be applied to other domains and samples to evaluate the representativeness of topic models. We show that topic proportions estimated from Common Crawl are not significantly different than those estimated on the full data. We also show that topics are similar in terms of their word compositions, and not worse than topic similarity estimated under true random sampling, which we simulate through a series of experiments. Our research will be of interest to analysts who wish to use Common Crawl to study topics of interest in user forum data, and analysts applying topic models to other data samples. Twitter data is another example of high-velocity streaming data. We use it as an example to study the query expansion application in streaming social media data analysis. Query expansion is a problem concerned with gathering more relevant documents from a given set that cover a certain topic. Here in this thesis we outline a number of tools for a query expansion system that will allow its user to gather more relevant documents (in this case, tweets from the Twitter social media system), while discriminating from irrelevant documents. These tools include a method for triggering a given query expansion using a Jaccard similarity threshold between keywords, and a query expansion method using archived news reports to create a vector space of novel keywords. As the nature of streaming data, Twitter stream contains emerging events that are constantly changing and therefore not predictable using static queries. Since keywords used in static query method often mismatch the words used in topics around emerging events. To solve this problem, our proposed approach of automated query expansion detects the emerging events in the first place. Then we combine both local analysis and global analysis methods to generate queries for capturing the emerging topics. Experiment results show that by combining the global analysis and local analysis method, our approach can capture the semantic information in the emerging events with high efficiency

    Scalable and fault-tolerant data stream processing on multi-core architectures

    Get PDF
    With increasing data volumes and velocity, many applications are shifting from the classical “process-after-store” paradigm to a stream processing model: data is produced and consumed as continuous streams. Stream processing captures latency-sensitive applications as diverse as credit card fraud detection and high-frequency trading. These applications are expressed as queries of algebraic operations (e.g., aggregation) over the most recent data using windows, i.e., finite evolving views over the input streams. To guarantee correct results, streaming applications require precise window semantics (e.g., temporal ordering) for operations that maintain state. While high processing throughput and low latency are performance desiderata for stateful streaming applications, achieving both poses challenges. Computing the state of overlapping windows causes redundant aggregation operations: incremental execution (i.e., reusing previous results) reduces latency but prevents parallelization; at the same time, parallelizing window execution for stateful operations with precise semantics demands ordering guarantees and state access coordination. Finally, streams and state must be recovered to produce consistent and repeatable results in the event of failures. Given the rise of shared-memory multi-core CPU architectures and high-speed networking, we argue that it is possible to address these challenges in a single node without compromising window semantics, performance, or fault-tolerance. In this thesis, we analyze, design, and implement stream processing engines (SPEs) that achieve high performance on multi-core architectures. To this end, we introduce new approaches for in-memory processing that address the previous challenges: (i) for overlapping windows, we provide a family of window aggregation techniques that enable computation sharing based on the algebraic properties of aggregation functions; (ii) for parallel window execution, we balance parallelism and incremental execution by developing abstractions for both and combining them to a novel design; and (iii) for reliable single-node execution, we enable strong fault-tolerance guarantees without sacrificing performance by reducing the required disk I/O bandwidth using a novel persistence model. We combine the above to implement an SPE that processes hundreds of millions of tuples per second with sub-second latencies. These results reveal the opportunity to reduce resource and maintenance footprint by replacing cluster-based SPEs with single-node deployments.Open Acces

    Big Data Analytics in Static and Streaming Provenance

    Get PDF
    Thesis (Ph.D.) - Indiana University, Informatics and Computing,, 2016With recent technological and computational advances, scientists increasingly integrate sensors and model simulations to understand spatial, temporal, social, and ecological relationships at unprecedented scale. Data provenance traces relationships of entities over time, thus providing a unique view on over-time behavior under study. However, provenance can be overwhelming in both volume and complexity; the now forecasting potential of provenance creates additional demands. This dissertation focuses on Big Data analytics of static and streaming provenance. It develops filters and a non-preprocessing slicing technique for in-situ querying of static provenance. It presents a stream processing framework for online processing of provenance data at high receiving rate. While the former is sufficient for answering queries that are given prior to the application start (forward queries), the latter deals with queries whose targets are unknown beforehand (backward queries). Finally, it explores data mining on large collections of provenance and proposes a temporal representation of provenance that can reduce the high dimensionality while effectively supporting mining tasks like clustering, classification and association rules mining; and the temporal representation can be further applied to streaming provenance as well. The proposed techniques are verified through software prototypes applied to Big Data provenance captured from computer network data, weather models, ocean models, remote (satellite) imagery data, and agent-based simulations of agricultural decision making

    Big Data decision support system

    Get PDF
    Includes bibliographical references.2022 Fall.Each day, the amount of data produced by sensors, social and digital media, and Internet of Things is rapidly increasing. The volume of digital data is expected to be doubled within the next three years. At some point, it might not be financially feasible to store all the data that is received. Hence, if data is not analyzed as it is received, the information collected could be lost forever. Actionable Intelligence is the next level of Big Data analysis where data is being used for decision making. This thesis document describes my scientific contribution to Big Data Actionable Intelligence generations. Chapter 1 consists of my colleagues and I's contribution in Big Data Actionable Intelligence Architecture. The proven architecture has demonstrated to support real-time actionable intelligence generation using disparate data sources (e.g., social media, satellite, newsfeeds). This work has been published in the Journal of Big Data. Chapter 2 shows my original method to perform real-time detection of moving targets using Remote Sensing Big Data. This work has also been published in the Journal of Big Data and it has received an issuance of a U.S. patent. As the Field-of-View (FOV) in remote sensing continues to expand, the number of targets observed by each sensor continues to increase. The ability to track large quantities of targets in real-time poses a significant challenge. Chapter 3 describes my colleague and I's contribution to the multi-target tracking domain. We have demonstrated that we can overcome real-time tracking challenges when there are large number of targets. Our work was published in the Journal of Sensors

    State Management for Efficient Event Pattern Detection

    Get PDF
    Event Stream Processing (ESP) Systeme überwachen kontinuierliche Datenströme, um benutzerdefinierte Queries auszuwerten. Die Herausforderung besteht darin, dass die Queryverarbeitung zustandsbehaftet ist und die Anzahl von Teilübereinstimmungen mit der Größe der verarbeiteten Events exponentiell anwächst. Die Dynamik von Streams und die Notwendigkeit, entfernte Daten zu integrieren, erschweren die Zustandsverwaltung. Erstens liefern heterogene Eventquellen Streams mit unvorhersehbaren Eingaberaten und Queryselektivitäten. Während Spitzenzeiten ist eine erschöpfende Verarbeitung unmöglich, und die Systeme müssen auf eine Best-Effort-Verarbeitung zurückgreifen. Zweitens erfordern Queries möglicherweise externe Daten, um ein bestimmtes Event für eine Query auszuwählen. Solche Abhängigkeiten sind problematisch: Das Abrufen der Daten unterbricht die Stream-Verarbeitung. Ohne eine Eventauswahl auf Grundlage externer Daten wird das Wachstum von Teilübereinstimmungen verstärkt. In dieser Dissertation stelle ich Strategien für optimiertes Zustandsmanagement von ESP Systemen vor. Zuerst ermögliche ich eine Best-Effort-Verarbeitung mittels Load Shedding. Dabei werden sowohl Eingabeeevents als auch Teilübereinstimmungen systematisch verworfen, um eine Latenzschwelle mit minimalem Qualitätsverlust zu garantieren. Zweitens integriere ich externe Daten, indem ich das Abrufen dieser von der Verwendung in der Queryverarbeitung entkoppele. Mit einem effizienten Caching-Mechanismus vermeide ich Unterbrechungen durch Übertragungslatenzen. Dazu werden externe Daten basierend auf ihrer erwarteten Verwendung vorab abgerufen und mittels Lazy Evaluation bei der Eventauswahl berücksichtigt. Dabei wird ein Kostenmodell verwendet, um zu bestimmen, wann welche externen Daten abgerufen und wie lange sie im Cache aufbewahrt werden sollen. Ich habe die Effektivität und Effizienz der vorgeschlagenen Strategien anhand von synthetischen und realen Daten ausgewertet und unter Beweis gestellt.Event stream processing systems continuously evaluate queries over event streams to detect user-specified patterns with low latency. However, the challenge is that query processing is stateful and it maintains partial matches that grow exponentially in the size of processed events. State management is complicated by the dynamicity of streams and the need to integrate remote data. First, heterogeneous event sources yield dynamic streams with unpredictable input rates, data distributions, and query selectivities. During peak times, exhaustive processing is unreasonable, and systems shall resort to best-effort processing. Second, queries may require remote data to select a specific event for a pattern. Such dependencies are problematic: Fetching the remote data interrupts the stream processing. Yet, without event selection based on remote data, the growth of partial matches is amplified. In this dissertation, I present strategies for optimised state management in event pattern detection. First, I enable best-effort processing with load shedding that discards both input events and partial matches. I carefully select the shedding elements to satisfy a latency bound while striving for a minimal loss in result quality. Second, to efficiently integrate remote data, I decouple the fetching of remote data from its use in query evaluation by a caching mechanism. To this end, I hide the transmission latency by prefetching remote data based on anticipated use and by lazy evaluation that postpones the event selection based on remote data to avoid interruptions. A cost model is used to determine when to fetch which remote data items and how long to keep them in the cache. I evaluated the above techniques with queries over synthetic and real-world data. I show that the load shedding technique significantly improves the recall of pattern detection over baseline approaches, while the technique for remote data integration significantly reduces the pattern detection latency

    Analyzing audit trails in a distributed and hybrid intrusion detection platform

    Get PDF
    Efforts have been made over the last decades in order to design and perfect Intrusion Detection Systems (IDS). In addition to the widespread use of Intrusion Prevention Systems (IPS) as perimeter defense devices in systems and networks, various IDS solutions are used together as elements of holistic approaches to cyber security incident detection and prevention, including Network-Intrusion Detection Systems (NIDS) and Host-Intrusion Detection Systems (HIDS). Nevertheless, specific IDS and IPS technology face several effectiveness challenges to respond to the increasing scale and complexity of information systems and sophistication of attacks. The use of isolated IDS components, focused on one-dimensional approaches, strongly limits a common analysis based on evidence correlation. Today, most organizations’ cyber-security operations centers still rely on conventional SIEM (Security Information and Event Management) technology. However, SIEM platforms also have significant drawbacks in dealing with heterogeneous and specialized security event-sources, lacking the support for flexible and uniform multi-level analysis of security audit-trails involving distributed and heterogeneous systems. In this thesis, we propose an auditing solution that leverages on different intrusion detection components and synergistically combines them in a Distributed and Hybrid IDS (DHIDS) platform, taking advantage of their benefits while overcoming the effectiveness drawbacks of each one. In this approach, security events are detected by multiple probes forming a pervasive, heterogeneous and distributed monitoring environment spread over the network, integrating NIDS, HIDS and specialized Honeypot probing systems. Events from those heterogeneous sources are converted to a canonical representation format, and then conveyed through a Publish-Subscribe middleware to a dedicated logging and auditing system, built on top of an elastic and scalable document-oriented storage system. The aggregated events can then be queried and matched against suspicious attack signature patterns, by means of a proposed declarative query-language that provides event-correlation semantics
    corecore