Search CORE

29,021 research outputs found

DataHub: Collaborative Data Science & Dataset Version Management at Scale

Author: Bhardwaj Anant
Bhattacherjee Souvik
Chavan Amit
Deshpande Amol
Elmore Aaron J.
Madden Samuel
Parameswaran Aditya G.
Publication venue
Publication date: 02/09/2014
Field of study

Relational databases have limited support for data collaboration, where teams collaboratively curate and analyze large datasets. Inspired by software version control systems like git, we propose (a) a dataset version control system, giving users the ability to create, branch, merge, difference and search large, divergent collections of datasets, and (b) a platform, DataHub, that gives users the ability to perform collaborative data analysis building on this version control system. We outline the challenges in providing dataset version control at scale.Comment: 7 page

arXiv.org e-Print Archive

CiteSeerX

DSpace@MIT

Building Wavelet Histograms on Large Data in MapReduce

Author: Jestes Jeffrey
Li Feifei
Yi Ke
Publication venue
Publication date: 01/01/2011
Field of study

MapReduce is becoming the de facto framework for storing and processing massive data, due to its excellent scalability, reliability, and elasticity. In many MapReduce applications, obtaining a compact accurate summary of data is essential. Among various data summarization tools, histograms have proven to be particularly important and useful for summarizing data, and the wavelet histogram is one of the most widely used histograms. In this paper, we investigate the problem of building wavelet histograms efficiently on large datasets in MapReduce. We measure the efficiency of the algorithms by both end-to-end running time and communication cost. We demonstrate straightforward adaptations of existing exact and approximate methods for building wavelet histograms to MapReduce clusters are highly inefficient. To that end, we design new algorithms for computing exact and approximate wavelet histograms and discuss their implementation in MapReduce. We illustrate our techniques in Hadoop, and compare to baseline solutions with extensive experiments performed in a heterogeneous Hadoop cluster of 16 nodes, using large real and synthetic datasets, up to hundreds of gigabytes. The results suggest significant (often orders of magnitude) performance improvement achieved by our new algorithms.Comment: VLDB201

arXiv.org e-Print Archive

CiteSeerX

Hong Kong University of Science and Technology Institutional Repository

Detecting exploit patterns from network packet streams

Author: Lahiri Bibudh
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2012
Field of study

Network-based Intrusion Detection Systems (NIDS), e.g., Snort, Bro or NSM, try to detect malicious network activity such as Denial of Service (DoS) attacks and port scans by monitoring network traffic. Research from network traffic measurement has identified various patterns that exploits on today\u27s Internet typically exhibit. However, there has not been any significant attempt, so far, to design algorithms with provable guarantees for detecting exploit patterns from network traffic packets. In this work, we develop and apply data streaming algorithms to detect exploit patterns from network packet streams. In network intrusion detection, it is necessary to analyze large volumes of data in an online fashion. Our work addresses scalable analysis of data under the following situations. (1) Attack traffic can be stealthy in nature, which means detecting a few covert attackers might call for checking traffic logs of days or even months, (2) Traffic is multidimensional and correlations between multiple dimensions maybe important, and (3) Sometimes traffic from multiple sources may need to be analyzed in a combined manner. Our algorithms offer provable bounds on resource consumption and approximation error. Our theoretical results are supported by experiments over real network traces and synthetic datasets

Digital Repository @ Iowa State University (ISU)

Astro-WISE: Chaining to the Universe

Author: Begeman K. G.
Bell D. J.
Bender R.
Bertin E.
Boxhoorn D. R.
Capaccioli M.
Deul E.
Helmich E.
Heraudeau P.
Hill F.
Kuijken K.
Mc Farland John
Mellier Y.
Neeser M.
Rengelink R.
Saglia R.
Shaw R. A.
Silvotti R.
Snigula J.
Tempelaar M. J.
Valentijn E. A.
Verdoes Kleijn G.
Vermeij R.
Vriend W.-J.
Publication venue
Publication date: 01/01/2007
Field of study

The recent explosion of recorded digital data and its processed derivatives threatens to overwhelm researchers when analysing their experimental data or when looking up data items in archives and file systems. While current hardware developments allow to acquire, process and store 100s of terabytes of data at the cost of a modern sports car, the software systems to handle these data are lagging behind. This general problem is recognized and addressed by various scientific communities, e.g., DATAGRID/EGEE federates compute and storage power over the high-energy physical community, while the astronomical community is building an Internet geared Virtual Observatory, connecting archival data. These large projects either focus on a specific distribution aspect or aim to connect many sub-communities and have a relatively long trajectory for setting standards and a common layer. Here, we report "first light" of a very different solution to the problem initiated by a smaller astronomical IT community. It provides the abstract "scientific information layer" which integrates distributed scientific analysis with distributed processing and federated archiving and publishing. By designing new abstractions and mixing in old ones, a Science Information System with fully scalable cornerstones has been achieved, transforming data systems into knowledge systems. This break-through is facilitated by the full end-to-end linking of all dependent data items, which allows full backward chaining from the observer/researcher to the experiment. Key is the notion that information is intrinsic in nature and thus is the data acquired by a scientific experiment. The new abstraction is that software systems guide the user to that intrinsic information by forcing full backward and forward chaining in the data modelling.Comment: To be published in ADASS XVI ASP Conference Series, 2006, R. Shaw, F. Hill and D. Bell, ed

arXiv.org e-Print Archive

Archivio della ricerca - Università degli studi di Napoli Federico II

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

CERN Document Server

Dissertations of the University of Groningen

Early vocabulary development in deaf native signers: a British Sign Language adaptation of the communicative development inventories

Author: Anderson
Arriaga
Bauer
Dale
Eriksson
Feldman
Fenson
Fenson
Fenson
Goldfield
Hamilton
Haug
Heilmann
Herman
Herman
Herman
Kyle
Law
Maital
Mayberry
Miller
Mitchell
Morgan
Newport
Nikolopoulos
Ogura
Prezbindowski
Reese
Reilly
Rossetti
Roy
Rust
Schick
Schick
Seeff-Gabriel
Sutton-Spence
Tait
Tardif
Thal
Tomasello
Publication venue: 'Wiley'
Publication date: 01/01/2010
Field of study

Background: There is a dearth of assessments of sign language development in young deaf children. This study gathered age-related scores from a sample of deaf native signing children using an adapted version of the MacArthur-Bates CDI (Fenson et al., 1994). Method: Parental reports on children’s receptive and expressive signing were collected longitudinally on 29 deaf native British Sign Language (BSL) users, aged 8–36 months, yielding 146 datasets. Results: A smooth upward growth curve was obtained for early vocabulary development and percentile scores were derived. In the main, receptive scores were in advance of expressive scores. No gender bias was observed. Correlational analysis identified factors associated with vocabulary development, including parental education and mothers’ training in BSL. Individual children’s profiles showed a range of development and some evidence of a growth spurt. Clinical and research issues relating to the measure are discussed. Conclusions: The study has developed a valid, reliable measure of vocabulary development in BSL. Further research is needed to investigate the relationship between vocabulary acquisition in native and non-native signers

CiteSeerX

City Research Online

Crossref

Techniques for online analysis of large distributed data

Author: Singh Sneha Aman
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2015
Field of study

With the advancement of technology, there has been an exponential growth in the volume of data that is continuously being generated by several applications in domains such as finance, networking, security. Examples of such continuously streaming data include internet traffic data, sensor readings, tweets, stock market data, telecommunication records. As a result, processing and analyzing data to derive useful insights from them in real time is becoming increasingly important. The goal of my research is to propose techniques to effectively find aggregates and patterns from massive distributed data stream in real time. In many real world applications, there may be specific user requirements for analyzing data. We consider three different user requirements for our work - Sliding window, Distributed data stream, and a Union of historical and streaming data. We aim to address the following problems in our research : First, we present a detailed experimental evaluation of streaming algorithms over sliding window for distinct counting, which is a fundamental aggregation problem widely applied in database query optimization and network monitoring. Next, we present the first communication-efficient distributed algorithm for tracking persistent items in a distributed data stream over both infinite and sliding window. We present theoretical analysis on communication cost and accuracy, and provide experimental results to validate the guarantees. Finally, we present the design and evaluation of a low cost algorithm that identifies quantiles from a union of historical and streaming data with improved accuracy

Digital Repository @ Iowa State University (ISU)