29,021 research outputs found

    DataHub: Collaborative Data Science & Dataset Version Management at Scale

    Get PDF
    Relational databases have limited support for data collaboration, where teams collaboratively curate and analyze large datasets. Inspired by software version control systems like git, we propose (a) a dataset version control system, giving users the ability to create, branch, merge, difference and search large, divergent collections of datasets, and (b) a platform, DataHub, that gives users the ability to perform collaborative data analysis building on this version control system. We outline the challenges in providing dataset version control at scale.Comment: 7 page

    Building Wavelet Histograms on Large Data in MapReduce

    Full text link
    MapReduce is becoming the de facto framework for storing and processing massive data, due to its excellent scalability, reliability, and elasticity. In many MapReduce applications, obtaining a compact accurate summary of data is essential. Among various data summarization tools, histograms have proven to be particularly important and useful for summarizing data, and the wavelet histogram is one of the most widely used histograms. In this paper, we investigate the problem of building wavelet histograms efficiently on large datasets in MapReduce. We measure the efficiency of the algorithms by both end-to-end running time and communication cost. We demonstrate straightforward adaptations of existing exact and approximate methods for building wavelet histograms to MapReduce clusters are highly inefficient. To that end, we design new algorithms for computing exact and approximate wavelet histograms and discuss their implementation in MapReduce. We illustrate our techniques in Hadoop, and compare to baseline solutions with extensive experiments performed in a heterogeneous Hadoop cluster of 16 nodes, using large real and synthetic datasets, up to hundreds of gigabytes. The results suggest significant (often orders of magnitude) performance improvement achieved by our new algorithms.Comment: VLDB201

    Detecting exploit patterns from network packet streams

    Get PDF
    Network-based Intrusion Detection Systems (NIDS), e.g., Snort, Bro or NSM, try to detect malicious network activity such as Denial of Service (DoS) attacks and port scans by monitoring network traffic. Research from network traffic measurement has identified various patterns that exploits on today\u27s Internet typically exhibit. However, there has not been any significant attempt, so far, to design algorithms with provable guarantees for detecting exploit patterns from network traffic packets. In this work, we develop and apply data streaming algorithms to detect exploit patterns from network packet streams. In network intrusion detection, it is necessary to analyze large volumes of data in an online fashion. Our work addresses scalable analysis of data under the following situations. (1) Attack traffic can be stealthy in nature, which means detecting a few covert attackers might call for checking traffic logs of days or even months, (2) Traffic is multidimensional and correlations between multiple dimensions maybe important, and (3) Sometimes traffic from multiple sources may need to be analyzed in a combined manner. Our algorithms offer provable bounds on resource consumption and approximation error. Our theoretical results are supported by experiments over real network traces and synthetic datasets

    Astro-WISE: Chaining to the Universe

    Get PDF
    The recent explosion of recorded digital data and its processed derivatives threatens to overwhelm researchers when analysing their experimental data or when looking up data items in archives and file systems. While current hardware developments allow to acquire, process and store 100s of terabytes of data at the cost of a modern sports car, the software systems to handle these data are lagging behind. This general problem is recognized and addressed by various scientific communities, e.g., DATAGRID/EGEE federates compute and storage power over the high-energy physical community, while the astronomical community is building an Internet geared Virtual Observatory, connecting archival data. These large projects either focus on a specific distribution aspect or aim to connect many sub-communities and have a relatively long trajectory for setting standards and a common layer. Here, we report "first light" of a very different solution to the problem initiated by a smaller astronomical IT community. It provides the abstract "scientific information layer" which integrates distributed scientific analysis with distributed processing and federated archiving and publishing. By designing new abstractions and mixing in old ones, a Science Information System with fully scalable cornerstones has been achieved, transforming data systems into knowledge systems. This break-through is facilitated by the full end-to-end linking of all dependent data items, which allows full backward chaining from the observer/researcher to the experiment. Key is the notion that information is intrinsic in nature and thus is the data acquired by a scientific experiment. The new abstraction is that software systems guide the user to that intrinsic information by forcing full backward and forward chaining in the data modelling.Comment: To be published in ADASS XVI ASP Conference Series, 2006, R. Shaw, F. Hill and D. Bell, ed

    Early vocabulary development in deaf native signers: a British Sign Language adaptation of the communicative development inventories

    Get PDF
    Background: There is a dearth of assessments of sign language development in young deaf children. This study gathered age-related scores from a sample of deaf native signing children using an adapted version of the MacArthur-Bates CDI (Fenson et al., 1994). Method: Parental reports on children’s receptive and expressive signing were collected longitudinally on 29 deaf native British Sign Language (BSL) users, aged 8–36 months, yielding 146 datasets. Results: A smooth upward growth curve was obtained for early vocabulary development and percentile scores were derived. In the main, receptive scores were in advance of expressive scores. No gender bias was observed. Correlational analysis identified factors associated with vocabulary development, including parental education and mothers’ training in BSL. Individual children’s profiles showed a range of development and some evidence of a growth spurt. Clinical and research issues relating to the measure are discussed. Conclusions: The study has developed a valid, reliable measure of vocabulary development in BSL. Further research is needed to investigate the relationship between vocabulary acquisition in native and non-native signers

    Techniques for online analysis of large distributed data

    Get PDF
    With the advancement of technology, there has been an exponential growth in the volume of data that is continuously being generated by several applications in domains such as finance, networking, security. Examples of such continuously streaming data include internet traffic data, sensor readings, tweets, stock market data, telecommunication records. As a result, processing and analyzing data to derive useful insights from them in real time is becoming increasingly important. The goal of my research is to propose techniques to effectively find aggregates and patterns from massive distributed data stream in real time. In many real world applications, there may be specific user requirements for analyzing data. We consider three different user requirements for our work - Sliding window, Distributed data stream, and a Union of historical and streaming data. We aim to address the following problems in our research : First, we present a detailed experimental evaluation of streaming algorithms over sliding window for distinct counting, which is a fundamental aggregation problem widely applied in database query optimization and network monitoring. Next, we present the first communication-efficient distributed algorithm for tracking persistent items in a distributed data stream over both infinite and sliding window. We present theoretical analysis on communication cost and accuracy, and provide experimental results to validate the guarantees. Finally, we present the design and evaluation of a low cost algorithm that identifies quantiles from a union of historical and streaming data with improved accuracy
    • …
    corecore