554 research outputs found

    A Survey on Array Storage, Query Languages, and Systems

    Full text link
    Since scientific investigation is one of the most important providers of massive amounts of ordered data, there is a renewed interest in array data processing in the context of Big Data. To the best of our knowledge, a unified resource that summarizes and analyzes array processing research over its long existence is currently missing. In this survey, we provide a guide for past, present, and future research in array processing. The survey is organized along three main topics. Array storage discusses all the aspects related to array partitioning into chunks. The identification of a reduced set of array operators to form the foundation for an array query language is analyzed across multiple such proposals. Lastly, we survey real systems for array processing. The result is a thorough survey on array data storage and processing that should be consulted by anyone interested in this research topic, independent of experience level. The survey is not complete though. We greatly appreciate pointers towards any work we might have forgotten to mention.Comment: 44 page

    Estimating development effort in free/open source software projects by mining software repositories: A case study of OpenStack

    Get PDF
    Because of the distributed and collaborative nature of free/open source software (FOSS) projects, the development effort invested in a project is usually unknown, even after the software has been released. However, this information is becoming of major interest, especially-but not only-because of the growth in the number of companies for which FOSS has become relevant for their business strategy. In this paper we present a novel approach to estimate effort by considering data from source code management repositories. We apply our model to the OpenStack project, a FOSS project with more than 1,000 authors, in which several tens of companies cooperate. Based on data from its repositories and together with the input from a survey answered by more than 100 developers, we show that the model offers a simple, but sound way of obtaining software development estimations with bounded margins of error.Gregorio Robles, Carlos Cervig on and Jes us M. Gonz alez-Barahona, project SobreSale (TIN2011-28110). and The work of Daniel Izquierdo has been funded in part by the Torres Quevedo program (PTQ-12-05577

    Impliance: A Next Generation Information Management Appliance

    Full text link
    ably successful in building a large market and adapting to the changes of the last three decades, its impact on the broader market of information management is surprisingly limited. If we were to design an information management system from scratch, based upon today's requirements and hardware capabilities, would it look anything like today's database systems?" In this paper, we introduce Impliance, a next-generation information management system consisting of hardware and software components integrated to form an easy-to-administer appliance that can store, retrieve, and analyze all types of structured, semi-structured, and unstructured information. We first summarize the trends that will shape information management for the foreseeable future. Those trends imply three major requirements for Impliance: (1) to be able to store, manage, and uniformly query all data, not just structured records; (2) to be able to scale out as the volume of this data grows; and (3) to be simple and robust in operation. We then describe four key ideas that are uniquely combined in Impliance to address these requirements, namely the ideas of: (a) integrating software and off-the-shelf hardware into a generic information appliance; (b) automatically discovering, organizing, and managing all data - unstructured as well as structured - in a uniform way; (c) achieving scale-out by exploiting simple, massive parallel processing, and (d) virtualizing compute and storage resources to unify, simplify, and streamline the management of Impliance. Impliance is an ambitious, long-term effort to define simpler, more robust, and more scalable information systems for tomorrow's enterprises.Comment: This article is published under a Creative Commons License Agreement (http://creativecommons.org/licenses/by/2.5/.) You may copy, distribute, display, and perform the work, make derivative works and make commercial use of the work, but, you must attribute the work to the author and CIDR 2007. 3rd Biennial Conference on Innovative Data Systems Research (CIDR) January 710, 2007, Asilomar, California, US

    Consistent View-Based Management of Variability in Space and Time

    Get PDF
    Systeme entwickeln sich schnell weiter und existieren in verschiedenen Variationen, um unterschiedliche und sich ändernde Anforderungen erfüllen zu können. Das führt zu aufeinanderfolgenden Revisionen (Variabilität in Zeit) und zeitgleich existierenden Produktvarianten (Variabilität in Raum). Redundanzen und Abhängigkeiten zwischen unterschiedlichen Produkten über mehrere Revisionen hinweg sowie heterogene Typen von Artefakten führen schnell zu Inkonsistenzen während der Evolution eines variablen Systems. Die Bewältigung der Komplexität sowie eine einheitliche und konsistente Verwaltung beider Variabilitätsdimensionen sind wesentliche Herausforderungen, um große und langlebige Systeme erfolgreich entwickeln zu können. Variabilität in Raum wird primär in der Softwareproduktlinienentwicklung betrachtet, während Variabilität in Zeit im Softwarekonfigurationsmanagement untersucht wird. Konsistenzerhaltung zwischen heterogenen Artefakttypen und sichtbasierte Softwareentwicklung sind zentrale Forschungsthemen in modellgetriebener Softwareentwicklung. Die Isolation der drei angrenzenden Disziplinen hat zu einer Vielzahl von Ansätzen und Werkzeugen aus den unterschiedlichen Bereichen geführt, was die Definition eines gemeinsamen Verständnisses erschwert und die Gefahr redundanter Forschung und Entwicklung birgt. Werkzeuge aus den verschiedenen Disziplinen sind oftmals nicht ausreichend integriert und führen zu einer heterogenen Werkzeuglandschaft sowie hohem manuellen Aufwand während der Evolution eines variablen Systems, was wiederum der Systemqualität schadet und zu höheren Wartungskosten führt. Basierend auf dem aktuellen Stand der Forschung in den genannten Disziplinen werden in dieser Dissertation drei Kernbeiträge vorgestellt, um den Umgang mit der Komplexität während der Evolution variabler Systeme zu unterstützten. Das unifizierte konzeptionelle Modell dokumentiert und unifiziert Konzepte und Relationen für den gleichzeitigen Umgang mit Variabilität in Raum und Zeit basierend auf einer Vielzahl ausgewählter Ansätze und Werkzeuge aus der Softwareproduktlinienentwicklung und dem Softwarekonfigurationsmanagement. Über die bloße Kombination vorhandener Konzepte hinaus beschreibt das unifizierte konzeptionelle Modell neue Möglichkeiten, beide Variabilitätsdimensionen zueinander in Beziehung zu setzen. Die unifizierten Operationen verwenden das unifizierte konzeptionelle Modell als Datenstruktur und stellen die Basis für operative Verwaltung von Variabilität in Raum und Zeit dar. Die unifizierten Operationen werden basierend auf einer Analyse diverser Ansätze konzipiert, welche verschiedene Modalitäten und Paradigmen verfolgen. Während die unifizierten Operationen die Funktionalität von analysierten Werkzeugen abdecken, ermöglichen sie den gleichzeitigen Umgang mit beiden Variabilitätsdimensionen. Der unifizierte Ansatz basiert auf den vorhergehenden Beiträgen und erweitert diese um Konsistenzerhaltung. Zu diesem Zweck wurden Typen von variabilitätsspezifischen Inkonsistenzen identifiziert, die während der Evolution variabler heterogener Systeme auftreten können. Der unifizierte Ansatz ermöglicht automatisierte Konsistenzerhaltung für eine ausgewählte Teilmenge der identifizierten Inkonsistenztypen. Jeder Kernbeitrag wurde empirisch evaluiert. Zur Evaluierung des unifizierten konzeptionellen Modells und der unifizierten Operationen wurden Expertenbefragungen durchgeführt, Metriken zur Bewertung der Angemessenheit einer Unifizierung definiert und angewendet, sowie beispielhafte Anwendungen demonstriert. Die funktionale Eignung des unifizierten Ansatzes wurde mittels zweier Realweltfallstudien evaluiert: Die häufig verwendete ArgoUML-SPL, die auf ArgoUML basiert, einem UML-Modellierungswerkzeug, sowie MobileMedia, eine mobile Applikation für Medienverwaltung. Der unifizierte Ansatz ist mit dem Eclipse Modeling Framework (EMF) und dem Vitruvius Ansatz implementiert. Die Kernbeiträge dieser Arbeit erweitern das vorhandene Wissen hinsichtlich der uniformen Verwaltung von Variabilität in Raum und Zeit und verbinden diese mit automatisierter Konsistenzerhaltung für variable Systeme bestehend aus heterogenen Artefakttypen

    Consistent View-Based Management of Variability in Space and Time

    Get PDF
    Developing variable systems faces many challenges. Dependencies between interrelated artifacts within a product variant, such as code or diagrams, across product variants and across their revisions quickly lead to inconsistencies during evolution. This work provides a unification of common concepts and operations for variability management, identifies variability-related inconsistencies and presents an approach for view-based consistency preservation of variable systems

    Enabling parallelism and optimizations in data mining algorithms for power-law data

    Get PDF
    Today's data mining tasks aim to extract meaningful information from a large amount of data in a reasonable time mainly via means of --- a) algorithmic advances, such as fast approximate algorithms and efficient learning algorithms, and b) architectural advances, such as machines with massive compute capacity involving distributed multi-core processors and high throughput accelerators. For current and future generation processors, parallel algorithms are critical for fully utilizing computing resources. Furthermore, exploiting data properties for performance gain becomes crucial for data mining applications. In this work, we focus our attention on power-law behavior –-- a common property found in a large class of data, such as text data, internet traffic, and click-stream data. Specifically, we address the following questions in the context of power-law data: How well do the critical data mining algorithms of current interest fit with today's parallel architectures? Which algorithmic and mapping opportunities can be leveraged to further improve performance?, and What are the relative challenges and gains for such approaches? Specifically, we first investigate the suitability of the "frequency estimation" problem for GPU-scale parallelism. Sketching algorithms are a popular choice for this task due to their desirable trade-off between estimation accuracy and space-time efficiency. However, most of the past work on sketch-based frequency estimation focused on CPU implementations. In our work, we propose a novel approach for sketches, which exploits the natural skewness in the power-law data to efficiently utilize the massive amounts of parallelism in modern GPUs. Next, we explore the problem of "identifying top-K frequent elements" for distributed data streams on modern distributed settings with both multi-core and multi-node CPU parallelism. Sketch-based approaches, such as Count-Min Sketch (CMS) with top-K heap, have an excellent update time but lacks the important property of reducibility, which is needed for exploiting data parallelism. On the other end, the popular Frequent Algorithm (FA) leads to reducible summaries, but its update costs are high. Our approach Topkapi, gives the best of both worlds, i.e., it is reducible like FA and has an efficient update time similar to CMS. For power-law data, Topkapi possesses strong theoretical guarantees and leads to significant performance gains, relative to past work. Finally, we study Word2Vec, a popular word embedding method widely used in Machine learning and Natural Language Processing applications, such as machine translation, sentiment analysis, and query answering. This time, we target Single Instruction Multiple Data (SIMD) parallelism. With the increasing vector lengths in commodity CPUs, such as AVX-512 with a vector length of 512 bits, efficient vector processing unit utilization becomes a major performance game-changer. By employing a static multi-version code generation strategy coupled with an algorithmic approximation based on the power-law frequency distribution of words, we achieve significant reductions in training time relative to the state-of-the-art.Ph.D
    corecore