731 research outputs found

    Event Detection on Twitter

    Full text link
    Detecting events by using social media has been an active research problem. In this work, we investigate and compare the performance of two methods for event detection in Twitter by using Apache Storm as the stream processing infrastructure. The first event detection method is based on identifying uncommonly common words inside tweet blocks, and the second one is based on clustering tweets to detect a cluster as an event. Each of the methods has its own characteristics. Uncommonly common word based method relies on the burst of words and hence is not affected from concurrency problems in distributed environment. On the other hand, clustering based method includes a finer grained analysis, but it is sensitive to the concurrent processing. We investigate the effect of stream processing and concurrency handling support provided by Apace Storm on event detection by these methods

    Stream Processing Systems Benchmark: StreamBench

    Get PDF
    Batch processing technologies (Such as MapReduce, Hive, Pig) have matured and been widely used in the industry. These systems solved the issue processing big volumes of data successfully. However, first big amount of data need to be collected and stored in a database or file system. That is very time-consuming. Then it takes time to finish batch processing analysis jobs before get any results. While there are many cases that need analysed results from unbounded sequence of data in seconds or sub-seconds. To satisfy the increasing demand of processing such streaming data, several streaming processing systems are implemented and widely adopted, such as Apache Storm, Apache Spark, IBM InfoSphere Streams, and Apache Flink. They all support online stream processing, high scalability, and tasks monitoring. While how to evaluate stream processing systems before choosing one in production development is an open question. In this thesis, we introduce StreamBench, a benchmark framework to facilitate performance comparisons of stream processing systems. A common API component and a core set of workloads are defined. We implement the common API and run benchmarks for three widely used open source stream processing systems: Apache Storm, Flink, and Spark Streaming. A key feature of the StreamBench framework is that it is extensible -- it supports easy definition of new workloads, in addition to making it easy to benchmark new stream processing systems

    Comparative Evaluation for the Performance of Big Stream Processing Systems

    Get PDF
    Andmete hulk kasvab tänapäeval meeletu kiirusega ning seda andmete hulka tuleb korrektselt töödelda, et saavutada kontroll andmete üle. Antud olukord sunnib meid mõtlema andmevoo töötlemise peale. Enamasti nõuavad andmemahuline pettuse tuvastus-, kaubandus-, tootmis-, sõjanduse ja luure süsteemid pidevat andmete analüüsi (reaalajas). Sellist tüüpi süsteemid nõuavad kõrgetasemel ist mustrite sobitamist ja korrelatsioone. Aja jooksul on ilmnenud erinevaid andmevoo töötlemise võimalusi. Antud lõputöös tehakse jõudlustest Apache Flink, Apache Storm, Heron, Kafka ja Apache Spark andmevoo töötlemismootoritega ning tulemusi võrreldakse ja vastandatakse omavahel. Nendes rakendustes ja domeenides on väga oluline nõue koguda, menetleda ning analüüsida olulisi andmevooge, et eraldada sealt väärtusliku informatsiooni. Antud magistritöö eesmärk on läbi viia empiiriline hindamine ning võrdlemine kõrgtasemel andmevoo töötlemissüsteemide vahel.Nowadays data is growing with tremendous acceleration, and this growing data must be processed properly if we want to have control over it. It pushes us to think about data stream processing. Most of the time, a data-intensive fraud detecting, trading, manufacturing, military and intelligence systems require processing data immediately (real-time). These kinds of systems need considerably ssophisticated pattern matching and correlations. However, other uses of stream processing have also emerged over time. In this thesis, we will benchmark to compare and contrast Apache Flink, Apache Storm, Heron, Kafka an Apache Spark stream processing engines. In these applications and domains, there is a crucial requirement to collect, process, and analyze significant streams of data to extract valuable information. This thesis aims to conduct an empirical evaluation and benchmarking of the state-of-the-art of big stream processing systems

    Data Stream Clustering: A Review

    Full text link
    Number of connected devices is steadily increasing and these devices continuously generate data streams. Real-time processing of data streams is arousing interest despite many challenges. Clustering is one of the most suitable methods for real-time data stream processing, because it can be applied with less prior information about the data and it does not need labeled instances. However, data stream clustering differs from traditional clustering in many aspects and it has several challenging issues. Here, we provide information regarding the concepts and common characteristics of data streams, such as concept drift, data structures for data streams, time window models and outlier detection. We comprehensively review recent data stream clustering algorithms and analyze them in terms of the base clustering technique, computational complexity and clustering accuracy. A comparison of these algorithms is given along with still open problems. We indicate popular data stream repositories and datasets, stream processing tools and platforms. Open problems about data stream clustering are also discussed.Comment: Has been accepted for publication in Artificial Intelligence Revie

    ARM Wrestling with Big Data: A Study of Commodity ARM64 Server for Big Data Workloads

    Full text link
    ARM processors have dominated the mobile device market in the last decade due to their favorable computing to energy ratio. In this age of Cloud data centers and Big Data analytics, the focus is increasingly on power efficient processing, rather than just high throughput computing. ARM's first commodity server-grade processor is the recent AMD A1100-series processor, based on a 64-bit ARM Cortex A57 architecture. In this paper, we study the performance and energy efficiency of a server based on this ARM64 CPU, relative to a comparable server running an AMD Opteron 3300-series x64 CPU, for Big Data workloads. Specifically, we study these for Intel's HiBench suite of web, query and machine learning benchmarks on Apache Hadoop v2.7 in a pseudo-distributed setup, for data sizes up to 20GB20GB files, 5M5M web pages and 500M500M tuples. Our results show that the ARM64 server's runtime performance is comparable to the x64 server for integer-based workloads like Sort and Hive queries, and only lags behind for floating-point intensive benchmarks like PageRank, when they do not exploit data parallelism adequately. We also see that the ARM64 server takes 13rd\frac{1}{3}^{rd} the energy, and has an Energy Delay Product (EDP) that is 5071%50-71\% lower than the x64 server. These results hold promise for ARM64 data centers hosting Big Data workloads to reduce their operational costs, while opening up opportunities for further analysis.Comment: Accepted for publication in the Proceedings of the 24th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC), 201
    corecore