2,190 research outputs found

    CubiST++: Evaluating Ad-Hoc CUBE Queries Using Statistics Trees

    Get PDF
    We report on a new, efficient encoding for the data cube, which results in a drastic speed-up of OLAP queries that aggregate along any combination of dimensions over numerical and categorical attributes. We are focusing on a class of queries called cube queries, which return aggregated values rather than sets of tuples. Our approach, termed CubiST++ (Cubing with Statistics Trees Plus Families), represents a drastic departure from existing relational (ROLAP) and multi-dimensional (MOLAP) approaches in that it does not use the view lattice to compute and materialize new views from existing views in some heuristic fashion. Instead, CubiST++ encodes all possible aggregate views in the leaves of a new data structure called statistics tree (ST) during a one-time scan of the detailed data. In order to optimize the queries involving constraints on hierarchy levels of the underlying dimensions, we select and materialize a family of candidate trees, which represent superviews over the different hierarchical levels of the dimensions. Given a query, our query evaluation algorithm selects the smallest tree in the family, which can provide the answer. Extensive evaluations of our prototype implementation have demonstrated its superior run-time performance and scalability when compared with existing MOLAP and ROLAP systems

    Parallel Algorithms for Geometric Graph Problems

    Full text link
    We give algorithms for geometric graph problems in the modern parallel models inspired by MapReduce. For example, for the Minimum Spanning Tree (MST) problem over a set of points in the two-dimensional space, our algorithm computes a (1+ϵ)(1+\epsilon)-approximate MST. Our algorithms work in a constant number of rounds of communication, while using total space and communication proportional to the size of the data (linear space and near linear time algorithms). In contrast, for general graphs, achieving the same result for MST (or even connectivity) remains a challenging open problem, despite drawing significant attention in recent years. We develop a general algorithmic framework that, besides MST, also applies to Earth-Mover Distance (EMD) and the transportation cost problem. Our algorithmic framework has implications beyond the MapReduce model. For example it yields a new algorithm for computing EMD cost in the plane in near-linear time, n1+oϵ(1)n^{1+o_\epsilon(1)}. We note that while recently Sharathkumar and Agarwal developed a near-linear time algorithm for (1+ϵ)(1+\epsilon)-approximating EMD, our algorithm is fundamentally different, and, for example, also solves the transportation (cost) problem, raised as an open question in their work. Furthermore, our algorithm immediately gives a (1+ϵ)(1+\epsilon)-approximation algorithm with nδn^{\delta} space in the streaming-with-sorting model with 1/δO(1)1/\delta^{O(1)} passes. As such, it is tempting to conjecture that the parallel models may also constitute a concrete playground in the quest for efficient algorithms for EMD (and other similar problems) in the vanilla streaming model, a well-known open problem

    The Skip Quadtree: A Simple Dynamic Data Structure for Multidimensional Data

    Full text link
    We present a new multi-dimensional data structure, which we call the skip quadtree (for point data in R^2) or the skip octree (for point data in R^d, with constant d>2). Our data structure combines the best features of two well-known data structures, in that it has the well-defined "box"-shaped regions of region quadtrees and the logarithmic-height search and update hierarchical structure of skip lists. Indeed, the bottom level of our structure is exactly a region quadtree (or octree for higher dimensional data). We describe efficient algorithms for inserting and deleting points in a skip quadtree, as well as fast methods for performing point location and approximate range queries.Comment: 12 pages, 3 figures. A preliminary version of this paper appeared in the 21st ACM Symp. Comp. Geom., Pisa, 2005, pp. 296-30

    Sparse Volumetric Deformation

    Get PDF
    Volume rendering is becoming increasingly popular as applications require realistic solid shape representations with seamless texture mapping and accurate filtering. However rendering sparse volumetric data is difficult because of the limited memory and processing capabilities of current hardware. To address these limitations, the volumetric information can be stored at progressive resolutions in the hierarchical branches of a tree structure, and sampled according to the region of interest. This means that only a partial region of the full dataset is processed, and therefore massive volumetric scenes can be rendered efficiently. The problem with this approach is that it currently only supports static scenes. This is because it is difficult to accurately deform massive amounts of volume elements and reconstruct the scene hierarchy in real-time. Another problem is that deformation operations distort the shape where more than one volume element tries to occupy the same location, and similarly gaps occur where deformation stretches the elements further than one discrete location. It is also challenging to efficiently support sophisticated deformations at hierarchical resolutions, such as character skinning or physically based animation. These types of deformation are expensive and require a control structure (for example a cage or skeleton) that maps to a set of features to accelerate the deformation process. The problems with this technique are that the varying volume hierarchy reflects different feature sizes, and manipulating the features at the original resolution is too expensive; therefore the control structure must also hierarchically capture features according to the varying volumetric resolution. This thesis investigates the area of deforming and rendering massive amounts of dynamic volumetric content. The proposed approach efficiently deforms hierarchical volume elements without introducing artifacts and supports both ray casting and rasterization renderers. This enables light transport to be modeled both accurately and efficiently with applications in the fields of real-time rendering and computer animation. Sophisticated volumetric deformation, including character animation, is also supported in real-time. This is achieved by automatically generating a control skeleton which is mapped to the varying feature resolution of the volume hierarchy. The output deformations are demonstrated in massive dynamic volumetric scenes

    A framework for multidimensional indexes on distributed and highly-available data stores

    Get PDF
    Spatial Big Data is considered an essential trend in future scientific and business applications. Indeed, research instruments, medical devices, and social networks generate hundreds of peta bytes of spatial data per year. However, as many authors have pointed out, the lack of specialized frameworks dealing with such kind of data is limiting possible applications and probably precluding many scientific breakthroughs. In this thesis, we describe three HPC scientific applications, ranging from molecular dynamics, neuroscience analysis, and physics simulations, where we experience first hand the limits of the existing technologies. Thanks to our experience, we define the desirable missing functionalities, and we focus on two features that when combined significantly improve the way scientific data is analyzed. On one side, scientific simulations generate complex datasets where multiple correlated characteristics describe each item. For instance, a particle might have a space position (x,y,z) at a given time (t). If we want to find all elements within the same area and period, we either have to scan the whole dataset, or we must organize the data so that all items in the same space and time are stored together. The second approach is called Multidimensional Indexing (MI), and it uses different techniques to cluster and to organize similar data together. On the other side, approximate analytics has been often indicated as a smart and flexible way to explore large datasets in a short period. Approximate analytics includes a broad family of algorithms which aims to speed up analytical workloads by relaxing the precision of the results within a specific interval of confidence. For instance, if we want to know the average age in a group with 1-year precision, we can consider just a random fraction of all the people, thus reducing the amount of calculation. But if we also want less I/O operations, we need efficient data sampling, which means organizing data in a way that we do not need to scan the whole data set to generate a random sample of it. According to our analysis, combining Multidimensional Indexing with efficient data Sampling (MIS) is a vital missing feature not available in the current distributed data management solutions. This thesis aims to solve such a shortcoming and it provides novel scalable solutions. At first, we describe the existing data management alternatives; then we motivate our preference for NoSQL key-value databases. Secondly, we propose an analytical model to study the influence of data models on the scalability and performance of this kind of distributed database. Thirdly, we use the analytical model to design two novel multidimensional indexes with efficient data sampling: the D8tree and the AOTree. Our first solution, the D8tree, improves state of the art for approximate spatial queries on static and mostly read dataset. Later, we enhanced the data ingestion capability or our approach by introducing the AOTree, an algorithm that enables the query performance of the D8tree even for HPC write-intensive applications. We compared our solution with PostgreSQL and plain storage, and we demonstrate that our proposal has better performance and scalability. Finally, we describe Qbeast, the novel distributed system that implements the D8tree and the AOTree using NoSQL technologies, and we illustrate how Qbeast simplifies the workflow of scientists in various HPC applications providing a scalable and integrated solution for data analysis and management.La gestión de BigData con información espacial está considerada como una tendencia esencial en el futuro de las aplicaciones científicas y de negocio. De hecho, se generan cientos de petabytes de datos espaciales por año mediante instrumentos de investigación, dispositivos médicos y redes sociales. Sin embargo, tal y como muchos autores han señalado, la falta de entornos especializados en manejar este tipo de datos está limitando sus posibles aplicaciones y está impidiendo muchos avances científicos. En esta tesis, describimos 3 aplicaciones científicas HPC, que cubren los ámbitos de dinámica molecular, análisis neurocientífico y simulaciones físicas, donde hemos experimentado en primera mano las limitaciones de las tecnologías existentes. Gracias a nuestras experiencias, hemos podido definir qué funcionalidades serían deseables y no existen, y nos hemos centrado en dos características que, al combinarlas, mejoran significativamente la manera en la que se analizan los datos científicos. Por un lado, las simulaciones científicas generan conjuntos de datos complejos, en los que cada elemento es descrito por múltiples características correlacionadas. Por ejemplo, una partícula puede tener una posición espacial (x, y, z) en un momento dado (t). Si queremos encontrar todos los elementos dentro de la misma área y periodo, o bien recorremos y analizamos todo el conjunto de datos, o bien organizamos los datos de manera que se almacenen juntos todos los elementos que comparten área en un momento dado. Esta segunda opción se conoce como Indexación Multidimensional (IM) y usa diferentes técnicas para agrupar y organizar datos similares. Por otro lado, se suele señalar que las analíticas aproximadas son una manera inteligente y flexible de explorar grandes conjuntos de datos en poco tiempo. Este tipo de analíticas incluyen una amplia familia de algoritmos que acelera el tiempo de procesado, relajando la precisión de los resultados dentro de un determinado intervalo de confianza. Por ejemplo, si queremos saber la edad media de un grupo con precisión de un año, podemos considerar sólo un subconjunto aleatorio de todas las personas, reduciendo así la cantidad de cálculo. Pero si además queremos menos operaciones de entrada/salida, necesitamos un muestreo eficiente de datos, que implica organizar los datos de manera que no necesitemos recorrerlos todos para generar una muestra aleatoria. De acuerdo con nuestros análisis, la combinación de Indexación Multidimensional con Muestreo eficiente de datos (IMM) es una característica vital que no está disponible en las soluciones actuales de gestión distribuida de datos. Esta tesis pretende resolver esta limitación y proporciona unas soluciones novedosas que son escalables. En primer lugar, describimos las alternativas de gestión de datos que existen y motivamos nuestra preferencia por las bases de datos NoSQL basadas en clave-valor. En segundo lugar, proponemos un modelo analítico para estudiar la influencia que tienen los modelos de datos sobre la escalabilidad y el rendimiento de este tipo de bases de datos distribuidas. En tercer lugar, usamos el modelo analítico para diseñar dos novedosos algoritmos IMM: el D8tree y el AOTree. Nuestra primera solución, el D8tree, mejora el estado del arte actual para consultas espaciales aproximadas, cuando el conjunto de datos es estático y mayoritariamente de lectura. Después, mejoramos la capacidad de ingestión introduciendo el AOTree, un algoritmo que conserva el rendimiento del D8tree incluso para aplicaciones HPC intensivas en escritura. Hemos comparado nuestra solución con PostgreSQL y almacenamiento plano demostrando que nuestra propuesta mejora tanto el rendimiento como la escalabilidad. Finalmente, describimos Qbeast, el sistema que implementa los algoritmos D8tree y AOTree, e ilustramos cómo Qbeast simplifica el flujo de trabajo de los científicos ofreciendo una solución escalable e integraPostprint (published version

    Diamond-based models for scientific visualization

    Get PDF
    Hierarchical spatial decompositions are a basic modeling tool in a variety of application domains including scientific visualization, finite element analysis and shape modeling and analysis. A popular class of such approaches is based on the regular simplex bisection operator, which bisects simplices (e.g. line segments, triangles, tetrahedra) along the midpoint of a predetermined edge. Regular simplex bisection produces adaptive simplicial meshes of high geometric quality, while simplifying the extraction of crack-free, or conforming, approximations to the original dataset. Efficient multiresolution representations for such models have been achieved in 2D and 3D by clustering sets of simplices sharing the same bisection edge into structures called diamonds. In this thesis, we introduce several diamond-based approaches for scientific visualization. We first formalize the notion of diamonds in arbitrary dimensions in terms of two related simplicial decompositions of hypercubes. This enables us to enumerate the vertices, simplices, parents and children of a diamond. In particular, we identify the number of simplices involved in conforming updates to be factorial in the dimension and group these into a linear number of subclusters of simplices that are generated simultaneously. The latter form the basis for a compact pointerless representation for conforming meshes generated by regular simplex bisection and for efficiently navigating the topological connectivity of these meshes. Secondly, we introduce the supercube as a high-level primitive on such nested meshes based on the atomic units within the underlying triangulation grid. We propose the use of supercubes to associate information with coherent subsets of the full hierarchy and demonstrate the effectiveness of such a representation for modeling multiresolution terrain and volumetric datasets. Next, we introduce Isodiamond Hierarchies, a general framework for spatial access structures on a hierarchy of diamonds that exploits the implicit hierarchical and geometric relationships of the diamond model. We use an isodiamond hierarchy to encode irregular updates to a multiresolution isosurface or interval volume in terms of regular updates to diamonds. Finally, we consider nested hypercubic meshes, such as quadtrees, octrees and their higher dimensional analogues, through the lens of diamond hierarchies. This allows us to determine the relationships involved in generating balanced hypercubic meshes and to propose a compact pointerless representation of such meshes. We also provide a local diamond-based triangulation algorithm to generate high-quality conforming simplicial meshes

    BROWSING LARGE ONLINE DATA USING GENERALIZED QUERY PREVIEWS

    Get PDF
    Companies, government agencies, and other organizations are making their data available to the world over the Internet. These organizations store their data in large tables. These tables are usually kept in relational databases. Online access to such databases is common. Users query these databases with different front-ends. These front-ends use command languages, menus, or form fillin interfaces. Many of these interfaces rarely give users information about the contents and distribution of the data. This leads users to waste time and network resources posing queries that have zero-hit or mega-hit results. Generalized query previews forms a user interface architecture for efficient browsing of large online data. Generalized query previews supplies distribution information to the users. This provides an overview of the data. Generalized query previews gives continuous feedback about the size of the results as the query is being formed. This provides a preview of the results. Generalized query previews allows users to visually browse all of the attributes of the data. Users can select from these attributes to form a view. Views are used to display the distribution information. Queries are incrementally and visually formed by selecting items from numerous charts attached to these views. Users continuously get feedback on the distribution information while they make their selections. Later, users fetch the desired portions of the data by sending their queries over the network. As they make informed queries, they can avoid submitting queries that will generate zero-hit or mega-hit results. Generalized query previews works on distributions. Distribution information tends to be smaller than raw data. This aspect of generalized query previews also contributes to better network performance. This dissertation presents the development of generalized query previews, field studies on various platforms, and experimental results. It also presents an architecture of the algorithms and data structures for the generalized query previews. There are three contributions of this dissertation. First, this work offers a general user interface architecture for browsing large online data. Second, it presents field studies and experimental work that define the application domain for generalized query previews. Third, it contributes to the field of algorithms and data structures. (UMIACS-TR-2001-70) (HCIL-TR-2001-22
    • …
    corecore