737 research outputs found
Fine-grained visualization pipelines and lazy functional languages
The pipeline model in visualization has evolved from a conceptual model of data processing into a widely used architecture for implementing visualization systems. In the process, a number of capabilities have been introduced, including streaming of data in chunks, distributed pipelines, and demand-driven processing. Visualization systems have invariably built on stateful programming technologies, and these capabilities have had to be implemented explicitly within the lower layers of a complex hierarchy of services. The good news for developers is that applications built on top of this hierarchy can access these capabilities without concern for how they are implemented. The bad news is that by freezing capabilities into low-level services expressive power and flexibility is lost. In this paper we express visualization systems in a programming language that more naturally supports this kind of processing model. Lazy functional languages support fine-grained demand-driven processing, a natural form of streaming, and pipeline-like function composition for assembling applications. The technology thus appears well suited to visualization applications. Using surface extraction algorithms as illustrative examples, and the lazy functional language Haskell, we argue the benefits of clear and concise expression combined with fine-grained, demand-driven computation. Just as visualization provides insight into data, functional abstraction provides new insight into visualization
Rumble: Data Independence for Large Messy Data Sets
This paper introduces Rumble, an engine that executes JSONiq queries on
large, heterogeneous and nested collections of JSON objects, leveraging the
parallel capabilities of Spark so as to provide a high degree of data
independence. The design is based on two key insights: (i) how to map JSONiq
expressions to Spark transformations on RDDs and (ii) how to map JSONiq FLWOR
clauses to Spark SQL on DataFrames. We have developed a working implementation
of these mappings showing that JSONiq can efficiently run on Spark to query
billions of objects into, at least, the TB range. The JSONiq code is concise in
comparison to Spark's host languages while seamlessly supporting the nested,
heterogeneous data sets that Spark SQL does not. The ability to process this
kind of input, commonly found, is paramount for data cleaning and curation. The
experimental analysis indicates that there is no excessive performance loss,
occasionally even a gain, over Spark SQL for structured data, and a performance
gain over PySpark. This demonstrates that a language such as JSONiq is a simple
and viable approach to large-scale querying of denormalized, heterogeneous,
arborescent data sets, in the same way as SQL can be leveraged for structured
data sets. The results also illustrate that Codd's concept of data independence
makes as much sense for heterogeneous, nested data sets as it does on highly
structured tables.Comment: Preprint, 9 page
Data Provenance and Management in Radio Astronomy: A Stream Computing Approach
New approaches for data provenance and data management (DPDM) are required
for mega science projects like the Square Kilometer Array, characterized by
extremely large data volume and intense data rates, therefore demanding
innovative and highly efficient computational paradigms. In this context, we
explore a stream-computing approach with the emphasis on the use of
accelerators. In particular, we make use of a new generation of high
performance stream-based parallelization middleware known as InfoSphere
Streams. Its viability for managing and ensuring interoperability and integrity
of signal processing data pipelines is demonstrated in radio astronomy. IBM
InfoSphere Streams embraces the stream-computing paradigm. It is a shift from
conventional data mining techniques (involving analysis of existing data from
databases) towards real-time analytic processing. We discuss using InfoSphere
Streams for effective DPDM in radio astronomy and propose a way in which
InfoSphere Streams can be utilized for large antennae arrays. We present a
case-study: the InfoSphere Streams implementation of an autocorrelating
spectrometer, and using this example we discuss the advantages of the
stream-computing approach and the utilization of hardware accelerators
Cubes convexes
In various approaches, data cubes are pre-computed in order to answer
efficiently OLAP queries. The notion of data cube has been declined in various
ways: iceberg cubes, range cubes or differential cubes. In this paper, we
introduce the concept of convex cube which captures all the tuples of a
datacube satisfying a constraint combination. It can be represented in a very
compact way in order to optimize both computation time and required storage
space. The convex cube is not an additional structure appended to the list of
cube variants but we propose it as a unifying structure that we use to
characterize, in a simple, sound and homogeneous way, the other quoted types of
cubes. Finally, we introduce the concept of emerging cube which captures the
significant trend inversions. characterizations
Using Fuzzy Linguistic Representations to Provide Explanatory Semantics for Data Warehouses
A data warehouse integrates large amounts of extracted and summarized data from multiple sources for direct querying and analysis. While it provides decision makers with easy access to such historical and aggregate data, the real meaning of the data has been ignored. For example, "whether a total sales amount 1,000 items indicates a good or bad sales performance" is still unclear. From the decision makers' point of view, the semantics rather than raw numbers which convey the meaning of the data is very important. In this paper, we explore the use of fuzzy technology to provide this semantics for the summarizations and aggregates developed in data warehousing systems. A three layered data warehouse semantic model, consisting of quantitative (numerical) summarization, qualitative (categorical) summarization, and quantifier summarization, is proposed for capturing and explicating the semantics of warehoused data. Based on the model, several algebraic operators are defined. We also extend the SQL language to allow for flexible queries against such enhanced data warehouses
Dwarf: A Complete System for Analyzing High-Dimensional Data Sets
The need for data analysis by different industries, including
telecommunications, retail, manufacturing and financial services, has
generated a flurry of research, highly sophisticated methods and
commercial products. However, all of the current attempts are haunted
by the so-called "high-dimensionality curse"; the complexity of space
and time increases exponentially with the number of analysis
"dimensions". This means that all existing approaches are limited
only to coarse levels of analysis and/or to approximate answers with
reduced precision. As the need for detailed analysis keeps
increasing, along with the volume and the detail of the data that is
stored, these approaches are very quickly rendered unusable. I have
developed a unique method for efficiently performing analysis that is
not affected by the high-dimensionality of data and scales only
polynomially -and almost linearly- with the dimensions without
sacrificing any accuracy in the returned results. I have implemented a
complete system (called "Dwarf") and performed an extensive
experimental evaluation that demonstrated tremendous improvements over
existing methods for all aspects of performing analysis -initial
computation, storing, querying and updating it.
I have extended my research to the "data-streaming" model where
updates are performed on-line, exacerbating any concurrent analysis
but has a very high impact on applications like security, network
management/monitoring router traffic control and sensor networks. I
have devised streaming algorithms that provide complex statistics
within user-specified relative-error bounds over a data stream. I
introduced the class of "distinct implicated statistics", which is
much more general than the established class of "distinct count"
statistics. The latter has been proved invaluable in applications such
as analyzing and monitoring the distinct count of species in a
population or even in query optimization. The "distinct implicated
statistics" class provides invaluable information about the
correlations in the stream and is necessary for applications such as
security. My algorithms are designed to use bounded amounts of memory
and processing -so that they can even be implemented in hardware for
resource-limited environments such as network-routers or sensors- and
also to work in "noisy" environments, where some data may be flawed
either implicitly due to the extraction process or explicitly
Analysis of Hardware Descriptions
The design process for integrated circuits requires a lot of analysis of circuit descriptions. An important class of analyses determines how easy it will be to determine if a physical component suffers from any manufacturing errors. As circuit complexities grow rapidly, the problem of testing circuits also becomes increasingly difficult. This thesis explores the potential for analysing a recent high level hardware description language called Ruby. In particular, we are interested in performing testability analyses of Ruby circuit descriptions. Ruby is ammenable to algebraic manipulation, so we have sought transformations that improve testability while preserving behaviour. The analysis of Ruby descriptions is performed by adapting a technique called abstract interpretation. This has been used successfully to analyse functional programs. This technique is most applicable where the analysis to be captured operates over structures isomorphic to the structure of the circuit. Many digital systems analysis tools require the circuit description to be given in some special form. This can lead to inconsistency between representations, and involves additional work converting between representations. We propose using the original description medium, in this case Ruby, for performing analyses. A related technique, called non-standard interpretation, is shown to be very useful for capturing many circuit analyses. An implementation of a system that performs non-standard interpretation forms the central part of the work. This allows Ruby descriptions to be analysed using alternative interpretations such test pattern generation and circuit layout interpretations. This system follows a similar approach to Boute's system semantics work and O'Donnell's work on Hydra. However, we have allowed a larger class of interpretations to be captured and offer a richer description language. The implementation presented here is constructed to allow a large degree of code sharing between different analyses. Several analyses have been implemented including simulation, test pattern generation and circuit layout. Non-standard interpretation provides a good framework for implementing these analyses. A general model for making non-standard interpretations is presented. Combining forms that combine two interpretations to produce a new interpretation are also introduced. This allows complex circuit analyses to be decomposed in a modular manner into smaller circuit analyses which can be built independently
- …