1,777 research outputs found

    Streamability of nested word transductions

    Full text link
    We consider the problem of evaluating in streaming (i.e., in a single left-to-right pass) a nested word transduction with a limited amount of memory. A transduction T is said to be height bounded memory (HBM) if it can be evaluated with a memory that depends only on the size of T and on the height of the input word. We show that it is decidable in coNPTime for a nested word transduction defined by a visibly pushdown transducer (VPT), if it is HBM. In this case, the required amount of memory may depend exponentially on the height of the word. We exhibit a sufficient, decidable condition for a VPT to be evaluated with a memory that depends quadratically on the height of the word. This condition defines a class of transductions that strictly contains all determinizable VPTs

    A Grammatical Inference Approach to Language-Based Anomaly Detection in XML

    Full text link
    False-positives are a problem in anomaly-based intrusion detection systems. To counter this issue, we discuss anomaly detection for the eXtensible Markup Language (XML) in a language-theoretic view. We argue that many XML-based attacks target the syntactic level, i.e. the tree structure or element content, and syntax validation of XML documents reduces the attack surface. XML offers so-called schemas for validation, but in real world, schemas are often unavailable, ignored or too general. In this work-in-progress paper we describe a grammatical inference approach to learn an automaton from example XML documents for detecting documents with anomalous syntax. We discuss properties and expressiveness of XML to understand limits of learnability. Our contributions are an XML Schema compatible lexical datatype system to abstract content in XML and an algorithm to learn visibly pushdown automata (VPA) directly from a set of examples. The proposed algorithm does not require the tree representation of XML, so it can process large documents or streams. The resulting deterministic VPA then allows stream validation of documents to recognize deviations in the underlying tree structure or datatypes.Comment: Paper accepted at First Int. Workshop on Emerging Cyberthreats and Countermeasures ECTCM 201

    Streaming Property Testing of Visibly Pushdown Languages

    Get PDF
    In the context of language recognition, we demonstrate the superiority of streaming property testers against streaming algorithms and property testers, when they are not combined. Initiated by Feigenbaum et al., a streaming property tester is a streaming algorithm recognizing a language under the property testing approximation: it must distinguish inputs of the language from those that are Δ\varepsilon-far from it, while using the smallest possible memory (rather than limiting its number of input queries). Our main result is a streaming Δ\varepsilon-property tester for visibly pushdown languages (VPL) with one-sided error using memory space poly((log⁥n)/Δ)\mathrm{poly}((\log n) / \varepsilon). This constructions relies on a (non-streaming) property tester for weighted regular languages based on a previous tester by Alon et al. We provide a simple application of this tester for streaming testing special cases of instances of VPL that are already hard for both streaming algorithms and property testers. Our main algorithm is a combination of an original simulation of visibly pushdown automata using a stack with small height but possible items of linear size. In a second step, those items are replaced by small sketches. Those sketches relies on a notion of suffix-sampling we introduce. This sampling is the key idea connecting our streaming tester algorithm to property testers.Comment: 23 pages. Major modifications in the presentatio

    Social media analytics: a survey of techniques, tools and platforms

    Get PDF
    This paper is written for (social science) researchers seeking to analyze the wealth of social media now available. It presents a comprehensive review of software tools for social networking media, wikis, really simple syndication feeds, blogs, newsgroups, chat and news feeds. For completeness, it also includes introductions to social media scraping, storage, data cleaning and sentiment analysis. Although principally a review, the paper also provides a methodology and a critique of social media tools. Analyzing social media, in particular Twitter feeds for sentiment analysis, has become a major research and business activity due to the availability of web-based application programming interfaces (APIs) provided by Twitter, Facebook and News services. This has led to an ‘explosion’ of data services, software tools for scraping and analysis and social media analytics platforms. It is also a research area undergoing rapid change and evolution due to commercial pressures and the potential for using social media data for computational (social science) research. Using a simple taxonomy, this paper provides a review of leading software tools and how to use them to scrape, cleanse and analyze the spectrum of social media. In addition, it discussed the requirement of an experimental computational environment for social media research and presents as an illustration the system architecture of a social media (analytics) platform built by University College London. The principal contribution of this paper is to provide an overview (including code fragments) for scientists seeking to utilize social media scraping and analytics either in their research or business. The data retrieval techniques that are presented in this paper are valid at the time of writing this paper (June 2014), but they are subject to change since social media data scraping APIs are rapidly changing

    Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources

    Get PDF
    Apache Calcite is a foundational software framework that provides query processing, optimization, and query language support to many popular open-source data processing systems such as Apache Hive, Apache Storm, Apache Flink, Druid, and MapD. Calcite's architecture consists of a modular and extensible query optimizer with hundreds of built-in optimization rules, a query processor capable of processing a variety of query languages, an adapter architecture designed for extensibility, and support for heterogeneous data models and stores (relational, semi-structured, streaming, and geospatial). This flexible, embeddable, and extensible architecture is what makes Calcite an attractive choice for adoption in big-data frameworks. It is an active project that continues to introduce support for the new types of data sources, query languages, and approaches to query processing and optimization.Comment: SIGMOD'1

    Lower Bounds for Multi-Pass Processing of Multiple Data Streams

    Get PDF
    This paper gives a brief overview of computation models for data stream processing, and it introduces a new model for multi-pass processing of multiple streams, the so-called mp2s-automata. Two algorithms for solving the set disjointness problem wi th these automata are presented. The main technical contribution of this paper is the proof of a lower bound on the size of memory and the number of heads that are required for solvin g the set disjointness problem with mp2s-automata

    Digital television applications

    Get PDF
    Studying development of interactive services for digital television is a leading edge area of work as there is minimal research or precedent to guide their design. Published research is limited and therefore this thesis aims at establishing a set of computing methods using Java and XML technology for future set-top box interactive services. The main issues include middleware architecture, a Java user interface for digital television, content representation and return channel communications. The middleware architecture used was made up of an Application Manager, Application Programming Interface (API), a Java Virtual Machine, etc., which were arranged in a layered model to ensure the interoperability. The application manager was designed to control the lifecycle of Xlets; manage set-top box resources and remote control keys and to adapt the graphical device environment. The architecture of both application manager and Xlet forms the basic framework for running multiple interactive services simultaneously in future set-top box designs. User interface development is more complex for this type of platform (when compared to that for a desktop computer) as many constraints are set on the look and feel (e.g., TV-like and limited buttons). Various aspects of Java user interfaces were studied and my research in this area focused on creating a remote control event model and lightweight drawing components using the Java Abstract Window Toolkit (AWT) and Java Media Framework (JMF) together with Extensible Markup Language (XML). Applications were designed aimed at studying the data structure and efficiency of the XML language to define interactive content. Content parsing was designed as a lightweight software module based around two parsers (i.e., SAX parsing and DOM parsing). The still content (i.e., text, images, and graphics) and dynamic content (i.e., hyperlinked text, animations, and forms) can then be modeled and processed efficiently. This thesis also studies interactivity methods using Java APIs via a return channel. Various communication models are also discussed that meet the interactivity requirements for different interactive services. They include URL, Socket, Datagram, and SOAP models which applications can choose to use in order to establish a connection with the service or broadcaster in order to transfer data. This thesis is presented in two parts: The first section gives a general summary of the research and acts as a complement to the second section, which contains a series of related publications.reviewe
    • 

    corecore