4,082 research outputs found

    Random testing based on temporal logic for Apache Flink

    Get PDF
    Facultad de Informática, Departamento de Sistemas Informáticos y Computación, Universidad Complutense de Madrid. Curso 2018/2019. El código de la aplicación está disponible en: https://github.com/Valcev/TFMActualmente, existen muy pocas alternativas para probar los sistemas de stream processing, consistiendo la mayoría de ellas en tests de unidad, los cuales no son viables en casos en los que se requiera disponer de una gran cantidad de streams o streams de gran longitud, ya que es necesario definir cada stream de entrada y el esperado de salida. Es por esto que surgió la idea de implementar un programa que pueda aplicar random testing, una técnica de testing basada en propiedades en lugar de en coberturas, a uno de estos sistemas de stream processing, incorporando además propiedades de lógica temporal. Además, ha sido empíricamente demostrado que el random testing funciona igual o incluso mejor que las técnicas de cobertura, lo que inclina la balanza a favor de esta técnica de testing. En este trabajo presentamos una herramienta para realizar testing basado en propiedades para Apache Flink, un sistema de stream processing capaz de procesar datos a tiempo real. Esto significa que Flink trata los datos a medida que van siendo generados y en el momento en que son recibidos. Para desarrollar esta herramienta se ha utilizado el lenguaje Scala y la técnica de random testing combinada con lógica temporal. Ya existe un entorno con la misma filosofía que el aquí presentado dirigido a Spark Streaming, Sscheck. Sin embargo, Spark maneja lotes y no tiene, por tanto, tiempo real. Por ello, el objetivo de este proyecto es implementar ese mismo programa pero adaptado para trabajar con Apache Flink, aunque presenta diferentes problemas, como menor flexibilidad en el tratamiento de los datos.Nowadays, there are very few alternatives for stream processing systems testing, most of them consisting of unit tests, which are not viable when it is required a large number of streams or very long streams, because it is necessary to define every stream received as input and every stream expected as corresponding output. Because of this situation, the idea of implementing a program that can use random testing, a testing technique based in properties instead of coverage, with one of these processing systems, came up, also adding temporal logic properties. Furthermore, it has been empirically proved that random testing works as well as, or even better than, coverage techniques, something that makes the choice of using this technique even more attractive. In this Master’s Thesis we present a property-based testing tool for Apache Flink, a stream processing system capable of processing data in real time. This means that Flink treats data as it is generated and as it is received. To develop this tool we have used the Scala language and random testing techniques combined with temporal logic. There exists an environment with the same philosophy as the one presented here applied to Spark Streaming, Sscheck. However, Spark handles batches and therefore has no real time. Therefore, the aim of this project is to implement the same program but adapted to Apache Flink. It presents different problems, such as less flexibility in data processing.Depto. de Sistemas Informáticos y ComputaciónFac. de InformáticaTRUEunpu

    Ontology-Based Data Access to Big Data

    Get PDF
    Recent approaches to ontology-based data access (OBDA) have extended the focus from relational database systems to other types of backends such as cluster frameworks in order to cope with the four Vs associated with big data: volume, veracity, variety and velocity (stream processing). The abstraction that an ontology provides is a benefit from the enduser point of view, but it represents a challenge for developers because high-level queries must be transformed into queries executable on the backend level. In this paper, we discuss and evaluate an OBDA system that uses STARQL (Streaming and Temporal ontology Access with a Reasoning-based Query Language), as a high-level query language to access data stored in a SPARK cluster framework. The development of the STARQL-SPARK engine show that there is a need to provide a homogeneous interface to access both static and temporal as well as streaming data because cluster frameworks usually lack such an interface. The experimental evaluation shows that building a scalable OBDA system that runs with SPARK is more than plug-and-play as one needs to know quite well the data formats and the data organisation in the cluster framework

    Real-time big data processing for anomaly detection : a survey

    Get PDF
    The advent of connected devices and omnipresence of Internet have paved way for intruders to attack networks, which leads to cyber-attack, financial loss, information theft in healthcare, and cyber war. Hence, network security analytics has become an important area of concern and has gained intensive attention among researchers, off late, specifically in the domain of anomaly detection in network, which is considered crucial for network security. However, preliminary investigations have revealed that the existing approaches to detect anomalies in network are not effective enough, particularly to detect them in real time. The reason for the inefficacy of current approaches is mainly due the amassment of massive volumes of data though the connected devices. Therefore, it is crucial to propose a framework that effectively handles real time big data processing and detect anomalies in networks. In this regard, this paper attempts to address the issue of detecting anomalies in real time. Respectively, this paper has surveyed the state-of-the-art real-time big data processing technologies related to anomaly detection and the vital characteristics of associated machine learning algorithms. This paper begins with the explanation of essential contexts and taxonomy of real-time big data processing, anomalous detection, and machine learning algorithms, followed by the review of big data processing technologies. Finally, the identified research challenges of real-time big data processing in anomaly detection are discussed. © 2018 Elsevier Lt

    Storage Solutions for Big Data Systems: A Qualitative Study and Comparison

    Full text link
    Big data systems development is full of challenges in view of the variety of application areas and domains that this technology promises to serve. Typically, fundamental design decisions involved in big data systems design include choosing appropriate storage and computing infrastructures. In this age of heterogeneous systems that integrate different technologies for optimized solution to a specific real world problem, big data system are not an exception to any such rule. As far as the storage aspect of any big data system is concerned, the primary facet in this regard is a storage infrastructure and NoSQL seems to be the right technology that fulfills its requirements. However, every big data application has variable data characteristics and thus, the corresponding data fits into a different data model. This paper presents feature and use case analysis and comparison of the four main data models namely document oriented, key value, graph and wide column. Moreover, a feature analysis of 80 NoSQL solutions has been provided, elaborating on the criteria and points that a developer must consider while making a possible choice. Typically, big data storage needs to communicate with the execution engine and other processing and visualization technologies to create a comprehensive solution. This brings forth second facet of big data storage, big data file formats, into picture. The second half of the research paper compares the advantages, shortcomings and possible use cases of available big data file formats for Hadoop, which is the foundation for most big data computing technologies. Decentralized storage and blockchain are seen as the next generation of big data storage and its challenges and future prospects have also been discussed
    • …
    corecore