247 research outputs found
Void-and-Cluster Sampling of Large Scattered Data and Trajectories
We propose a data reduction technique for scattered data based on statistical
sampling. Our void-and-cluster sampling technique finds a representative subset
that is optimally distributed in the spatial domain with respect to the blue
noise property. In addition, it can adapt to a given density function, which we
use to sample regions of high complexity in the multivariate value domain more
densely. Moreover, our sampling technique implicitly defines an ordering on the
samples that enables progressive data loading and a continuous level-of-detail
representation. We extend our technique to sample time-dependent trajectories,
for example pathlines in a time interval, using an efficient and iterative
approach. Furthermore, we introduce a local and continuous error measure to
quantify how well a set of samples represents the original dataset. We apply
this error measure during sampling to guide the number of samples that are
taken. Finally, we use this error measure and other quantities to evaluate the
quality, performance, and scalability of our algorithm.Comment: To appear in IEEE Transactions on Visualization and Computer Graphics
as a special issue from the proceedings of VIS 201
Multivariate Pointwise Information-Driven Data Sampling and Visualization
With increasing computing capabilities of modern supercomputers, the size of
the data generated from the scientific simulations is growing rapidly. As a
result, application scientists need effective data summarization techniques
that can reduce large-scale multivariate spatiotemporal data sets while
preserving the important data properties so that the reduced data can answer
domain-specific queries involving multiple variables with sufficient accuracy.
While analyzing complex scientific events, domain experts often analyze and
visualize two or more variables together to obtain a better understanding of
the characteristics of the data features. Therefore, data summarization
techniques are required to analyze multi-variable relationships in detail and
then perform data reduction such that the important features involving multiple
variables are preserved in the reduced data. To achieve this, in this work, we
propose a data sub-sampling algorithm for performing statistical data
summarization that leverages pointwise information theoretic measures to
quantify the statistical association of data points considering multiple
variables and generates a sub-sampled data that preserves the statistical
association among multi-variables. Using such reduced sampled data, we show
that multivariate feature query and analysis can be done effectively. The
efficacy of the proposed multivariate association driven sampling algorithm is
presented by applying it on several scientific data sets.Comment: 25 page
Hillview:A trillion-cell spreadsheet for big data
Hillview is a distributed spreadsheet for browsing very large datasets that
cannot be handled by a single machine. As a spreadsheet, Hillview provides a
high degree of interactivity that permits data analysts to explore information
quickly along many dimensions while switching visualizations on a whim. To
provide the required responsiveness, Hillview introduces visualization
sketches, or vizketches, as a simple idea to produce compact data
visualizations. Vizketches combine algorithmic techniques for data
summarization with computer graphics principles for efficient rendering. While
simple, vizketches are effective at scaling the spreadsheet by parallelizing
computation, reducing communication, providing progressive visualizations, and
offering precise accuracy guarantees. Using Hillview running on eight servers,
we can navigate and visualize datasets of tens of billions of rows and
trillions of cells, much beyond the published capabilities of competing
systems
Efficient Point-Cloud Processing with Primitive Shapes
This thesis presents methods for efficient processing of point-clouds based on primitive shapes. The set of considered simple parametric shapes consists of planes, spheres, cylinders, cones and tori. The algorithms developed in this work are targeted at scenarios in which the occurring surfaces can be well represented by this set of shape primitives which is the case in many man-made environments such as e.g. industrial compounds, cities or building interiors. A primitive subsumes a set of corresponding points in the point-cloud and serves as a proxy for them. Therefore primitives are well suited to directly address the unavoidable oversampling of large point-clouds and lay the foundation for efficient point-cloud processing algorithms. The first contribution of this thesis is a novel shape primitive detection method that is efficient even on very large and noisy point-clouds. Several applications for the detected primitives are subsequently explored, resulting in a set of novel algorithms for primitive-based point-cloud processing in the areas of compression, recognition and completion. Each of these application directly exploits and benefits from one or more of the detected primitives' properties such as approximation, abstraction, segmentation and continuability
Architectural Principles for Database Systems on Storage-Class Memory
Database systems have long been optimized to hide the higher latency of storage media, yielding complex persistence mechanisms. With the advent of large DRAM capacities, it became possible to keep a full copy of the data in DRAM. Systems that leverage this possibility, such as main-memory databases, keep two copies of the data in two different formats: one in main memory and the other one in storage. The two copies are kept synchronized using snapshotting and logging. This main-memory-centric architecture yields nearly two orders of magnitude faster analytical processing than traditional, disk-centric ones. The rise of Big Data emphasized the importance of such systems with an ever-increasing need for more main memory. However, DRAM is hitting its scalability limits: It is intrinsically hard to further increase its density.
Storage-Class Memory (SCM) is a group of novel memory technologies that promise to alleviate DRAMâs scalability limits. They combine the non-volatility, density, and economic characteristics of storage media with the byte-addressability and a latency close to that of DRAM. Therefore, SCM can serve as persistent main memory, thereby bridging the gap between main memory and storage. In this dissertation, we explore the impact of SCM as persistent main memory on database systems. Assuming a hybrid SCM-DRAM hardware architecture, we propose a novel software architecture for database systems that places primary data in SCM and directly operates on it, eliminating the need for explicit IO. This architecture yields many benefits: First, it obviates the need to reload data from storage to main memory during recovery, as data is discovered and accessed directly in SCM. Second, it allows replacing the traditional logging infrastructure by fine-grained, cheap micro-logging at data-structure level. Third, secondary data can be stored in DRAM and reconstructed during recovery. Fourth, system runtime information can be stored in SCM to improve recovery time. Finally, the system may retain and continue in-flight transactions in case of system failures.
However, SCM is no panacea as it raises unprecedented programming challenges. Given its byte-addressability and low latency, processors can access, read, modify, and persist data in SCM using load/store instructions at a CPU cache line granularity. The path from CPU registers to SCM is long and mostly volatile, including store buffers and CPU caches, leaving the programmer with little control over when data is persisted. Therefore, there is a need to enforce the order and durability of SCM writes using persistence primitives, such as cache line flushing instructions. This in turn creates new failure scenarios, such as missing or misplaced persistence primitives.
We devise several building blocks to overcome these challenges. First, we identify the programming challenges of SCM and present a sound programming model that solves them. Then, we tackle memory management, as the first required building block to build a database system, by designing a highly scalable SCM allocator, named PAllocator, that fulfills the versatile needs of database systems. Thereafter, we propose the FPTree, a highly scalable hybrid SCM-DRAM persistent B+-Tree that bridges the gap between the performance of transient and persistent B+-Trees. Using these building blocks, we realize our envisioned database architecture in SOFORT, a hybrid SCM-DRAM columnar transactional engine. We propose an SCM-optimized MVCC scheme that eliminates write-ahead logging from the critical path of transactions. Since SCM -resident data is near-instantly available upon recovery, the new recovery bottleneck is rebuilding DRAM-based data. To alleviate this bottleneck, we propose a novel recovery technique that achieves nearly instant responsiveness of the database by accepting queries right after recovering SCM -based data, while rebuilding DRAM -based data in the background. Additionally, SCM brings new failure scenarios that existing testing tools cannot detect. Hence, we propose an online testing framework that is able to automatically simulate power failures and detect missing or misplaced persistence primitives. Finally, our proposed building blocks can serve to build more complex systems, paving the way for future database systems on SCM
Audit implications of electronic document management; Auditing procedure study;
https://egrove.olemiss.edu/aicpa_guides/1034/thumbnail.jp
Stochastic Volume Rendering of Multi-Phase SPH Data
In this paper, we present a novel method for the direct volume rendering of large smoothedâparticle hydrodynamics (SPH) simulation data without transforming the unstructured data to an intermediate representation. By directly visualizing the unstructured particle data, we avoid long preprocessing times and large storage requirements. This enables the visualization of large, timeâdependent, and multivariate data both as a postâprocess and in situ. To address the computational complexity, we introduce stochastic volume rendering that considers only a subset of particles at each step during ray marching. The sample probabilities for selecting this subset at each step are thereby determined both in a viewâdependent manner and based on the spatial complexity of the data. Our stochastic volume rendering enables us to scale continuously from a fast, interactive preview to a more accurate volume rendering at higher cost. Lastly, we discuss the visualization of freeâsurface and multiâphase flows by including a multiâmaterial model with volumetric and surface shading into the stochastic volume rendering
- âŠ