Search CORE

3,899 research outputs found

A Survey on Array Storage, Query Languages, and Systems

Author: Cheng Yu
Rusu Florin
Publication venue
Publication date: 19/02/2013
Field of study

Since scientific investigation is one of the most important providers of massive amounts of ordered data, there is a renewed interest in array data processing in the context of Big Data. To the best of our knowledge, a unified resource that summarizes and analyzes array processing research over its long existence is currently missing. In this survey, we provide a guide for past, present, and future research in array processing. The survey is organized along three main topics. Array storage discusses all the aspects related to array partitioning into chunks. The identification of a reduced set of array operators to form the foundation for an array query language is analyzed across multiple such proposals. Lastly, we survey real systems for array processing. The result is a thorough survey on array data storage and processing that should be consulted by anyone interested in this research topic, independent of experience level. The survey is not complete though. We greatly appreciate pointers towards any work we might have forgotten to mention.Comment: 44 page

arXiv.org e-Print Archive

CiteSeerX

Formal Representation of the SS-DB Benchmark and Experimental Evaluation in EXTASCID

Author: Cheng Yu
Rusu Florin
Publication venue
Publication date: 01/01/2013
Field of study

Evaluating the performance of scientific data processing systems is a difficult task considering the plethora of application-specific solutions available in this landscape and the lack of a generally-accepted benchmark. The dual structure of scientific data coupled with the complex nature of processing complicate the evaluation procedure further. SS-DB is the first attempt to define a general benchmark for complex scientific processing over raw and derived data. It fails to draw sufficient attention though because of the ambiguous plain language specification and the extraordinary SciDB results. In this paper, we remedy the shortcomings of the original SS-DB specification by providing a formal representation in terms of ArrayQL algebra operators and ArrayQL/SciQL constructs. These are the first formal representations of the SS-DB benchmark. Starting from the formal representation, we give a reference implementation and present benchmark results in EXTASCID, a novel system for scientific data processing. EXTASCID is complete in providing native support both for array and relational data and extensible in executing any user code inside the system by the means of a configurable metaoperator. These features result in an order of magnitude improvement over SciDB at data loading, extracting derived data, and operations over derived data.Comment: 32 pages, 3 figure

arXiv.org e-Print Archive

CiteSeerX

eScholarship - University of California

Reconstructing Rational Functions with $\texttt{FireFly}$

Author: Klappert Jonas
Lange Fabian
Publication venue: 'Elsevier BV'
Publication date: 11/10/2019
Field of study

We present the open-source

\texttt{C++}

library

\texttt{FireFly}

for the reconstruction of multivariate rational functions over finite fields. We discuss the involved algorithms and their implementation. As an application, we use

\texttt{FireFly}

in the context of integration-by-parts reductions and compare runtime and memory consumption to a fully algebraic approach with the program

\texttt{Kira}

.Comment: 46 pages, 3 figures, 6 tables; v2: matches published versio

arXiv.org e-Print Archive

Publikationsserver der RWTH Aachen University

RAM: array processing over a relational DBMS

Author: Ballegooij A.R. van
Kersten M.L. (Martin)
Vries A.P. (Arjen) de
Publication venue: CWI
Publication date: 01/01/2003
Field of study

Developing multimedia applications in relational databases is hindered by a mismatch in computational frameworks. Efficient manipulation of multimedia data calls for array-based processing, which at best is available as a database add-on, not supported by the query optimizer. As a result, array-based processing ends up in dedicated programs outside the DBMS: non-reusable black boxes. The goal of our research is to reduce this gap between user-needs and system functionality by developing a seemless integration of array processing in a relational algebra engine. The paper introduces a declarative language for array-expressions based on the array comprehension, and its mapping to a relational kernel in a prototype implementation. The layered architecture of the resulting array database management system allows the use of structural knowledge available in the array data type. This additional source of information can be exploited for query optimization, which is demonstrated with a case study. The experiments show how the performance of a standard tool for matrix computations can be achieved without sacrificing data independence, highlighting however a critical aspect in the DBMS architecture proposed

CWI's Institutional Repository

Similarity of Semantic Relations

Author: Morris Jane
Peter D. Turney
Publication venue
Publication date: 01/01/2006
Field of study

There are at least two kinds of similarity. Relational similarity is correspondence between relations, in contrast with attributional similarity, which is correspondence between attributes. When two words have a high degree of attributional similarity, we call them synonyms. When two pairs of words have a high degree of relational similarity, we say that their relations are analogous. For example, the word pair mason:stone is analogous to the pair carpenter:wood. This paper introduces Latent Relational Analysis (LRA), a method for measuring relational similarity. LRA has potential applications in many areas, including information extraction, word sense disambiguation, and information retrieval. Recently the Vector Space Model (VSM) of information retrieval has been adapted to measuring relational similarity, achieving a score of 47% on a collection of 374 college-level multiple-choice word analogy questions. In the VSM approach, the relation between a pair of words is characterized by a vector of frequencies of predefined patterns in a large corpus. LRA extends the VSM approach in three ways: (1) the patterns are derived automatically from the corpus, (2) the Singular Value Decomposition (SVD) is used to smooth the frequency data, and (3) automatically generated synonyms are used to explore variations of the word pairs. LRA achieves 56% on the 374 analogy questions, statistically equivalent to the average human score of 57%. On the related problem of classifying semantic relations, LRA achieves similar gains over the VSM

arXiv.org e-Print Archive

CiteSeerX

NRC Publications Archive

Crossref

CogPrints Cognitive Sciences Eprint Archive

On the Effect of Semantically Enriched Context Models on Software Modularization

Author: Hage Jurriaan
Jansen Slinger
Khadka Ravi
Saeidi Amir
Publication venue: 'Aspect-Oriented Software Association (AOSA)'
Publication date: 04/08/2017
Field of study

Many of the existing approaches for program comprehension rely on the linguistic information found in source code, such as identifier names and comments. Semantic clustering is one such technique for modularization of the system that relies on the informal semantics of the program, encoded in the vocabulary used in the source code. Treating the source code as a collection of tokens loses the semantic information embedded within the identifiers. We try to overcome this problem by introducing context models for source code identifiers to obtain a semantic kernel, which can be used for both deriving the topics that run through the system as well as their clustering. In the first model, we abstract an identifier to its type representation and build on this notion of context to construct contextual vector representation of the source code. The second notion of context is defined based on the flow of data between identifiers to represent a module as a dependency graph where the nodes correspond to identifiers and the edges represent the data dependencies between pairs of identifiers. We have applied our approach to 10 medium-sized open source Java projects, and show that by introducing contexts for identifiers, the quality of the modularization of the software systems is improved. Both of the context models give results that are superior to the plain vector representation of documents. In some cases, the authoritativeness of decompositions is improved by 67%. Furthermore, a more detailed evaluation of our approach on JEdit, an open source editor, demonstrates that inferred topics through performing topic analysis on the contextual representations are more meaningful compared to the plain representation of the documents. The proposed approach in introducing a context model for source code identifiers paves the way for building tools that support developers in program comprehension tasks such as application and domain concept location, software modularization and topic analysis

arXiv.org e-Print Archive

Heriot Watt Pure

Crossref

ZENODO

Utrecht University Repository

FigShare

Recommended from our members

Injecting Inductive Biases into Distributed Representations of Text

Author: Prokhorov Victor
Publication venue: University of Cambridge
Publication date: 12/11/2021
Field of study

Distributed real-valued vector representations of text (a.k.a. embeddings), learned by neural networks, encode various (linguistic) knowledge. To encode this knowledge into the embeddings the common approach is to train a large neural network on large corpora. There is, however, a growing concern regarding the sustainability and rationality of pursuing this approach further. We depart from the mainstream trend and instead, to incorporate the desired properties into embeddings, use inductive biases. First, we use Knowledge Graphs (KGs) as a data-based inductive bias to derive the semantic representation of words and sentences. The explicit semantics that is encoded in a structure of a KG allows us to acquire the semantic representations without the need of employing a large amount of text. We use graph embedding techniques to learn the semantic representation of words and the sequence-to-sequence model to learn the semantic representation of sentences. We demonstrate the efficacy of the inductive bias for learning embeddings for rare words and the ability of sentence embeddings to encode topological dependencies that exist between entities of a KG. Then, we explore the amount of information and sparsity as two key (data-agnostic) inductive biases to regulate the utilisation of the representation space. We impose these properties with Variational Autoencoders (VAEs). First, we regulate the amount of information encoded in a sentence embedding via constraint optimisation of a VAE objective function. We show that increasing amount of information allows to better discriminate sentences. Afterwards, to impose distributed sparsity we design a state-of-the-art Hierarchical Sparse VAE with a flexible posterior which captures the statistical characteristics of text effectively. While sparsity, in general, has desired computational and statistical representational properties, it is known to compensate task performance. We illustrate that with distributed sparsity, task performance could be maintained or even improved. The findings of the thesis advocate further development of inductive biases that could mitigate the dependence of representation learning quality on large data and model sizes

Apollo (Cambridge)

Density-Aware Linear Algebra in a Column-Oriented In-Memory Database System

Author: Kernert David
Publication venue
Publication date: 20/09/2016
Field of study

Linear algebra operations appear in nearly every application in advanced analytics, machine learning, and of various science domains. Until today, many data analysts and scientists tend to use statistics software packages or hand-crafted solutions for their analysis. In the era of data deluge, however, the external statistics packages and custom analysis programs that often run on single-workstations are incapable to keep up with the vast increase in data volume and size. In particular, there is an increasing demand of scientists for large scale data manipulation, orchestration, and advanced data management capabilities. These are among the key features of a mature relational database management system (DBMS). With the rise of main memory database systems, it now has become feasible to also consider applications that built up on linear algebra. This thesis presents a deep integration of linear algebra functionality into an in-memory column-oriented database system. In particular, this work shows that it has become feasible to execute linear algebra queries on large data sets directly in a DBMS-integrated engine (LAPEG), without the need of transferring data and being restricted by hard disc latencies. From various application examples that are cited in this work, we deduce a number of requirements that are relevant for a database system that includes linear algebra functionality. Beside the deep integration of matrices and numerical algorithms, these include optimization of expressions, transparent matrix handling, scalability and data-parallelism, and data manipulation capabilities. These requirements are addressed by our linear algebra engine. In particular, the core contributions of this thesis are: firstly, we show that the columnar storage layer of an in-memory DBMS yields an easy adoption of efficient sparse matrix data types and algorithms. Furthermore, we show that the execution of linear algebra expressions significantly benefits from different techniques that are inspired from database technology. In a novel way, we implemented several of these optimization strategies in LAPEG’s optimizer (SpMachO), which uses an advanced density estimation method (SpProdest) to predict the matrix density of intermediate results. Moreover, we present an adaptive matrix data type AT Matrix to obviate the need of scientists for selecting appropriate matrix representations. The tiled substructure of AT Matrix is exploited by our matrix multiplication to saturate the different sockets of a multicore main-memory platform, reaching up to a speed-up of 6x compared to alternative approaches. Finally, a major part of this thesis is devoted to the topic of data manipulation; where we propose a matrix manipulation API and present different mutable matrix types to enable fast insertions and deletes. We finally conclude that our linear algebra engine is well-suited to process dynamic, large matrix workloads in an optimized way. In particular, the DBMS-integrated LAPEG is filling the linear algebra gap, and makes columnar in-memory DBMS attractive as efficient, scalable ad-hoc analysis platform for scientists

Technische Universität Dresden: Qucosa