Search CORE

780 research outputs found

Sampling Algorithms for Evolving Datasets

Author: Gemulla Rainer
Publication venue: Technische Universität Dresden
Publication date: 20/10/2008
Field of study

Perhaps the most flexible synopsis of a database is a uniform random sample of the data; such samples are widely used to speed up the processing of analytic queries and data-mining tasks, to enhance query optimization, and to facilitate information integration. Most of the existing work on database sampling focuses on how to create or exploit a random sample of a static database, that is, a database that does not change over time. The assumption of a static database, however, severely limits the applicability of these techniques in practice, where data is often not static but continuously evolving. In order to maintain the statistical validity of the sample, any changes to the database have to be appropriately reflected in the sample. In this thesis, we study efficient methods for incrementally maintaining a uniform random sample of the items in a dataset in the presence of an arbitrary sequence of insertions, updates, and deletions. We consider instances of the maintenance problem that arise when sampling from an evolving set, from an evolving multiset, from the distinct items in an evolving multiset, or from a sliding window over a data stream. Our algorithms completely avoid any accesses to the base data and can be several orders of magnitude faster than algorithms that do rely on such expensive accesses. The improved efficiency of our algorithms comes at virtually no cost: the resulting samples are provably uniform and only a small amount of auxiliary information is associated with the sample. We show that the auxiliary information not only facilitates efficient maintenance, but it can also be exploited to derive unbiased, low-variance estimators for counts, sums, averages, and the number of distinct items in the underlying dataset. In addition to sample maintenance, we discuss methods that greatly improve the flexibility of random sampling from a system's point of view. More specifically, we initiate the study of algorithms that resize a random sample upwards or downwards. Our resizing algorithms can be exploited to dynamically control the size of the sample when the dataset grows or shrinks; they facilitate resource management and help to avoid under- or oversized samples. Furthermore, in large-scale databases with data being distributed across several remote locations, it is usually infeasible to reconstruct the entire dataset for the purpose of sampling. To address this problem, we provide efficient algorithms that directly combine the local samples maintained at each location into a sample of the global dataset. We also consider a more general problem, where the global dataset is defined as an arbitrary set or multiset expression involving the local datasets, and provide efficient solutions based on hashing

Technische Universität Dresden: Qucosa

Recommended from our members

New Data Protection Abstractions for Emerging Mobile and Big Data Workloads

Author: Spahn Riley Burns
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2020
Field of study

Two recent shifts in computing are challenging the effectiveness of traditional approaches to data protection. Emerging machine learning workloads have complex access patterns and unique leakage characteristics that are not well supported by existing protection approaches. Second, mobile operating systems do not provide sufficient support for fine grained data protection tools forcing users to rely on individual applications to correctly manage and protect data. My thesis is that these emerging workloads have unique characteristics that we can leverage to build new, more effective data protection abstractions. This dissertation presents two new data protection systems for machine learning work-loads and a new system for fine grained data management and protection on mobile devices. First is Sage, a differentially private machine learning platform addressing the two primary challenges of differential privacy: running out of budget and the privacy utility tradeoff. The second system, Pyramid, is the first selective data system. Pyramid leverages count featurization to reduce the amount of data exposed while training classification models by two orders of magnitude. The final system, Pebbles, provides users with logical data objects as a new fine grained data management and protection primitive allowing data management at a higher level of abstraction. Pebbles, leverages high level storage abstractions in mobile operating systems to discover user recognizable application level data objects in unmodified mobile applications

Columbia University Academic Commons

QoS-aware Resource-utilisation Self-adaptive (QRS) Framework for Distributed Data Stream Management Systems

Author: Yagnik Tarjana
Publication venue: Faculty of Computing, Engineering and Media
Publication date: 01/03/2022
Field of study

The last decade witnessed a vast number of Big Data applications in the science and industry fields alike. Such applications generate large amounts of streaming data and real-time event-based information. Such data needs to be analysed under the specific quality of service constraints, which must be done within extremely low latencies. Many distributed data stream processing approaches are based on the best-effort QoS principle that lack the capability of dynamic adaptation to the fluctuations in data input rates. Most of the proposed solutions tend to either drop some of the input data (load shedding) or degrade the level of QoS provided by the system. Another approach is to limit the data ingestion input rate using techniques like backpressure heartbeats, which can affect the worker nodes that causes an output delay. Such approaches are not suitable to handle certain types of mission-critical applications such as critical infrastructure surveillance, monitoring and signalling, vital health care monitoring, and military command and control streaming applications. This research presents a novel QoS-aware, Resource-utilisation Self-adaptive (QRS) Framework for managing data stream processing systems. The framework proposes a comprehensive usage model that encompasses proactive operations followed by simultaneous prompt actions. The simultaneous prompt actions instantly collect and analyse the performance and QoS metrics along with running data streams, ensuring that data does not lose its current values, whereas the proactive operations construct the prediction model that anticipate QoS violations and performance degradation in the system. The model triggers essential decision process for dynamic tuning of resources or adapting a new scheduling strategy. A proof of concept model was built that accurately represents the working conditions of the distributed data stream management ecosystem. The proposed framework is validated and verified. The framework’s several components were fully implemented over the emerging and prevalent distributed data streaming processing system, Apache Storm. The framework performs accurate prediction up to 81% about the system’s capacity to handle data load and input rate. The accuracy reaches up to 100% by incorporating abnormal detection techniques. Moreover, the framework performs well compared with the default round-robin and resource-aware schedulers within Storm. It provides a better ability to handle high data rates by re-balancing the topology and re-scheduling resources based on the prediction models well ahead of any congestion or QoS degradation

De Montfort University Open Research Archive

Proceedings Work-In-Progress Session of the 13th Real-Time and Embedded Technology and Applications Symposium

Author: Lu Chenyang
Publication venue: Washington University Open Scholarship
Publication date: 03/04/2007
Field of study

The Work-In-Progress session of the 13th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS\u2707) presents papers describing contributions both to state of the art and state of the practice in the broad field of real-time and embedded systems. The 17 accepted papers were selected from 19 submissions. This proceedings is also available as Washington University in St. Louis Technical Report WUCSE-2007-17, at http://www.cse.seas.wustl.edu/Research/FileDownload.asp?733. Special thanks go to the General Chairs – Steve Goddard and Steve Liu and Program Chairs - Scott Brandt and Frank Mueller for their support and guidance

Washington University St. Louis: Open Scholarship

Scalable and adaptable distributed stream processing

Author: ZHOU YONGLUAN
Publication venue
Publication date: 01/02/2007
Field of study

Ph.DDOCTOR OF PHILOSOPH

ScholarBank@NUS

Graph Processing in Main-Memory Column Stores

Author: Paradies Marcus
Publication venue
Publication date: 03/02/2017
Field of study

Evermore, novel and traditional business applications leverage the advantages of a graph data model, such as the offered schema flexibility and an explicit representation of relationships between entities. As a consequence, companies are confronted with the challenge of storing, manipulating, and querying terabytes of graph data for enterprise-critical applications. Although these business applications operate on graph-structured data, they still require direct access to the relational data and typically rely on an RDBMS to keep a single source of truth and access. Existing solutions performing graph operations on business-critical data either use a combination of SQL and application logic or employ a graph data management system. For the first approach, relying solely on SQL results in poor execution performance caused by the functional mismatch between typical graph operations and the relational algebra. To the worse, graph algorithms expose a tremendous variety in structure and functionality caused by their often domain-specific implementations and therefore can be hardly integrated into a database management system other than with custom coding. Since the majority of these enterprise-critical applications exclusively run on relational DBMSs, employing a specialized system for storing and processing graph data is typically not sensible. Besides the maintenance overhead for keeping the systems in sync, combining graph and relational operations is hard to realize as it requires data transfer across system boundaries. A basic ingredient of graph queries and algorithms are traversal operations and are a fundamental component of any database management system that aims at storing, manipulating, and querying graph data. Well-established graph traversal algorithms are standalone implementations relying on optimized data structures. The integration of graph traversals as an operator into a database management system requires a tight integration into the existing database environment and a development of new components, such as a graph topology-aware optimizer and accompanying graph statistics, graph-specific secondary index structures to speedup traversals, and an accompanying graph query language. In this thesis, we introduce and describe GRAPHITE, a hybrid graph-relational data management system. GRAPHITE is a performance-oriented graph data management system as part of an RDBMS allowing to seamlessly combine processing of graph data with relational data in the same system. We propose a columnar storage representation for graph data to leverage the already existing and mature data management and query processing infrastructure of relational database management systems. At the core of GRAPHITE we propose an execution engine solely based on set operations and graph traversals. Our design is driven by the observation that different graph topologies expose different algorithmic requirements to the design of a graph traversal operator. We derive two graph traversal implementations targeting the most common graph topologies and demonstrate how graph-specific statistics can be leveraged to select the optimal physical traversal operator. To accelerate graph traversals, we devise a set of graph-specific, updateable secondary index structures to improve the performance of vertex neighborhood expansion. Finally, we introduce a domain-specific language with an intuitive programming model to extend graph traversals with custom application logic at runtime. We use the LLVM compiler framework to generate efficient code that tightly integrates the user-specified application logic with our highly optimized built-in graph traversal operators. Our experimental evaluation shows that GRAPHITE can outperform native graph management systems by several orders of magnitude while providing all the features of an RDBMS, such as transaction support, backup and recovery, security and user management, effectively providing a promising alternative to specialized graph management systems that lack many of these features and require expensive data replication and maintenance processes

Technische Universität Dresden: Qucosa

Fostering Participation and Capacity Building with Neighborhood Information Systems.

Author: Epstein David L.
Publication venue
Publication date: 01/01/2015
Field of study

Applying information to decision making, monitoring neighborhood conditions, targeting resources, and recommending action have long been key urban planning functions. Increasingly, nonprofit organizations like community development corporations (CDCs) carry out these functions in distressed urban areas. Scholars in multiple disciplines argue that “data democratization”—increased access to data—would support a wide range of community change efforts. Proponents of a specific data delivery tool—neighborhood information systems (NIS)—claim that the technology can increase public participation and build capacity in distressed urban neighborhoods. This research evaluates these claims in Cleveland where the mortgage foreclosure crisis has left a glut of vacant and abandoned properties and a dire need to prioritize activities with limited resources. The research provides an integrated theoretical framework, bringing together four distinct bodies of knowledge for the first time: science and technology studies; participation, capacity, and capacity building; geographic information systems; and management information systems. The mixed-methods approach employed includes interviews with sixty community development professionals in Cleveland and a longitudinal regression analysis of thirty CDCs’ housing rehabilitation outcomes between July 1, 2007 to June 30, 2011. NIS increased the networking capacity of CDCs engaged in the city’s Code Enforcement Partnership by improving communication between partners. NIS also increased programmatic capacity, especially as measured by the percentage of CDC-owned properties sold to new owners who pay taxes on those properties. Staff in one CDC successfully leveraged NIS to improve public participation, a measure of political capacity. The findings also suggest that access to NIS does not fundamentally change CDC priorities. This research helps to fill specific gaps in multiple bodies of knowledge and features an in depth analysis of threats to validity, practical implications for decision-making with NIS, and recommendations for NIS developers and funders. Developers and funders in other cities may wish to consider their role as not just democratizing data—but providing a platform for partnerships by enabling organizations to better share data in order to achieve shared objectives.PHDUrban and Regional PlanningUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/111415/1/davideps_1.pd

Deep Blue Documents at the University of Michigan

Hydrophilic interaction liquid chromatography-mass spectrometry for the characterization of glycoproteins at the glycan, peptide, subunit, and intact level

Author: Gargano A.F.G.
Haselberg R.
Somsen G.W.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2021
Field of study

VU Research Portal

International Migration, Integration and Social Cohesion online publications

UvA-DARE

Transactional and analytical data management on persistent memory

Author: Götze Philipp
Publication venue
Publication date: 01/01/2022
Field of study

Die zunehmende Anzahl von Smart-Geräten und Sensoren, aber auch die sozialen Medien lassen das Datenvolumen und damit die geforderte Verarbeitungsgeschwindigkeit stetig wachsen. Gleichzeitig müssen viele Anwendungen Daten persistent speichern oder sogar strenge Transaktionsgarantien einhalten. Die neuartige Speichertechnologie Persistent Memory (PMem) mit ihren einzigartigen Eigenschaften scheint ein natürlicher Anwärter zu sein, um diesen Anforderungen effizient nachzukommen. Sie ist im Vergleich zu DRAM skalierbarer, günstiger und dauerhaft. Im Gegensatz zu Disks ist sie deutlich schneller und direkt adressierbar. Daher wird in dieser Dissertation der gezielte Einsatz von PMem untersucht, um den Anforderungen moderner Anwendung gerecht zu werden. Nach der Darlegung der grundlegenden Arbeitsweise von und mit PMem, konzentrieren wir uns primär auf drei Aspekte der Datenverwaltung. Zunächst zerlegen wir mehrere persistente Daten- und Indexstrukturen in ihre zugrundeliegenden Entwurfsprimitive, um Abwägungen für verschiedene Zugriffsmuster aufzuzeigen. So können wir ihre besten Anwendungsfälle und Schwachstellen, aber auch allgemeine Erkenntnisse über das Entwerfen von PMem-basierten Datenstrukturen ermitteln. Zweitens schlagen wir zwei Speicherlayouts vor, die auf analytische Arbeitslasten abzielen und eine effiziente Abfrageausführung auf beliebigen Attributen ermöglichen. Während der erste Ansatz eine verknüpfte Liste von mehrdimensionalen gruppierten Blöcken verwendet, handelt es sich beim zweiten Ansatz um einen mehrdimensionalen Index, der Knoten im DRAM zwischenspeichert. Drittens zeigen wir unter Verwendung der bisherigen Datenstrukturen und Erkenntnisse, wie Datenstrom- und Ereignisverarbeitungssysteme mit transaktionaler Zustandsverwaltung verbessert werden können. Dabei schlagen wir ein neuartiges Transactional Stream Processing (TSP) Modell mit geeigneten Konsistenz- und Nebenläufigkeitsprotokollen vor, die an PMem angepasst sind. Zusammen sollen die diskutierten Aspekte eine Grundlage für die Entwicklung noch ausgereifterer PMem-fähiger Systeme bilden. Gleichzeitig zeigen sie, wie Datenverwaltungsaufgaben PMem ausnutzen können, indem sie neue Anwendungsgebiete erschließen, die Leistung, Skalierbarkeit und Wiederherstellungsgarantien verbessern, die Codekomplexität vereinfachen sowie die ökonomischen und ökologischen Kosten reduzieren.The increasing number of smart devices and sensors, but also social media are causing the volume of data and thus the demanded processing speed to grow steadily. At the same time, many applications need to store data persistently or even comply with strict transactional guarantees. The novel storage technology Persistent Memory (PMem), with its unique properties, seems to be a natural candidate to meet these requirements efficiently. Compared to DRAM, it is more scalable, less expensive, and durable. In contrast to disks, it is significantly faster and directly addressable. Therefore, this dissertation investigates the deliberate employment of PMem to fit the needs of modern applications. After presenting the fundamental work of and with PMem, we focus primarily on three aspects of data management. First, we disassemble several persistent data and index structures into their underlying design primitives to reveal the trade-offs for various access patterns. It allows us to identify their best use cases and vulnerabilities but also to gain general insights into the design of PMem-based data structures. Second, we propose two storage layouts that target analytical workloads and enable an efficient query execution on arbitrary attributes. While the first approach employs a linked list of multi-dimensional clustered blocks that potentially span several storage layers, the second approach is a multi-dimensional index that caches nodes in DRAM. Third, we show how to improve stream and event processing systems involving transactional state management using the preceding data structures and insights. In this context, we propose a novel Transactional Stream Processing (TSP) model with appropriate consistency and concurrency protocols adapted to PMem. Together, the discussed aspects are intended to provide a foundation for developing even more sophisticated PMemenabled systems. At the same time, they show how data management tasks can take advantage of PMem by opening up new application domains, improving performance, scalability, and recovery guarantees, simplifying code complexity, plus reducing economic and environmental costs

Digitale Bibliothek Thüringen