Search CORE

2,455 research outputs found

Data generator for evaluating ETL process quality

Author: Abelló Gamazo Alberto
Jovanovic Petar
Nakuçi Emona
Theodorou Vasileios
Publication venue: 'Elsevier BV'
Publication date: 01/01/2017
Field of study

Obtaining the right set of data for evaluating the fulfillment of different quality factors in the extract-transform-load (ETL) process design is rather challenging. First, the real data might be out of reach due to different privacy constraints, while manually providing a synthetic set of data is known as a labor-intensive task that needs to take various combinations of process parameters into account. More importantly, having a single dataset usually does not represent the evolution of data throughout the complete process lifespan, hence missing the plethora of possible test cases. To facilitate such demanding task, in this paper we propose an automatic data generator (i.e., Bijoux). Starting from a given ETL process model, Bijoux extracts the semantics of data transformations, analyzes the constraints they imply over input data, and automatically generates testing datasets. Bijoux is highly modular and configurable to enable end-users to generate datasets for a variety of interesting test scenarios (e.g., evaluating specific parts of an input ETL process design, with different input dataset sizes, different distributions of data, and different operation selectivities). We have developed a running prototype that implements the functionality of our data generation framework and here we report our experimental findings showing the effectiveness and scalability of our approach.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Contextual Media Retrieval Using Natural Language Queries

Author: Bulling Andreas
Chowdhury Sreyasi Nag
Fritz Mario
Malinowski Mateusz
Publication venue
Publication date: 01/01/2016
Field of study

The widespread integration of cameras in hand-held and head-worn devices as well as the ability to share content online enables a large and diverse visual capture of the world that millions of users build up collectively every day. We envision these images as well as associated meta information, such as GPS coordinates and timestamps, to form a collective visual memory that can be queried while automatically taking the ever-changing context of mobile users into account. As a first step towards this vision, in this work we present Xplore-M-Ego: a novel media retrieval system that allows users to query a dynamic database of images and videos using spatio-temporal natural language queries. We evaluate our system using a new dataset of real user queries as well as through a usability study. One key finding is that there is a considerable amount of inter-user variability, for example in the resolution of spatial relations in natural language utterances. We show that our retrieval system can cope with this variability using personalisation through an online learning-based retrieval formulation.Comment: 8 pages, 9 figures, 1 tabl

arXiv.org e-Print Archive

CISPA – Helmholtz-Zentrum für Informationssicherheit

MPG.PuRe

Querying Spatio-temporal Patterns in Mobile Phone-Call Databases

Author: Enrique Frías-martínez
Marcos R. Vieira
Petko Bakalov
Vanessa Frías-martínez
Vassilis J. Tsotras
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2010
Field of study

Abstract — Call Detail Record (CDR) databases contain millions of records with information about cell phone calls, including the position of the user when the call was made/received. This huge amount of spatiotemporal data opens the door for the study of human trajectories on a large scale without the bias that other sources (like GPS or WLAN networks) introduce in the population studied. Also, it provides a platform for the development of a wide variety of studies ranging from the spread of diseases to planning of public transport. Nevertheless, previous work on spatiotemporal queries does not provide a framework flexible enough for expressing the complexity of human trajectories. In this paper we present the Spatiotemporal Pattern System (STPS) to query spatiotemporal patterns in very large CDR databases. STPS defines a regular-expression query language that is intuitive and that allows for any combination of spatial and temporal predicates with constraints, including the use of variables. The design of the language took into consideration the layout of the areas being covered by the cellular towers, as well as “areas ” that label places of interested (e.g. neighborhoods, parks, etc) and topological operators. STPS includes an underlying indexing structure and algorithms for query processing using different evaluation strategies. A full implementation of the STPS is currently running with real, very large CDR databases on Telefónica Research Labs. An extensive performance evaluation of the STPS shows that it can efficiently find complex mobility patterns in large CDR databases. I

CiteSeerX

Crossref

MonALISA : A Distributed Monitoring Service Architecture

Author: Cirstoiu C.
Galvez P.
Legrand I. C.
Newman H. B.
Voicu R.
Publication venue
Publication date: 16/06/2003
Field of study

The MonALISA (Monitoring Agents in A Large Integrated Services Architecture) system provides a distributed monitoring service. MonALISA is based on a scalable Dynamic Distributed Services Architecture which is designed to meet the needs of physics collaborations for monitoring global Grid systems, and is implemented using JINI/JAVA and WSDL/SOAP technologies. The scalability of the system derives from the use of multithreaded Station Servers to host a variety of loosely coupled self-describing dynamic services, the ability of each service to register itself and then to be discovered and used by any other services, or clients that require such information, and the ability of all services and clients subscribing to a set of events (state changes) in the system to be notified automatically. The framework integrates several existing monitoring tools and procedures to collect parameters describing computational nodes, applications and network performance. It has built-in SNMP support and network-performance monitoring algorithms that enable it to monitor end-to-end network performance as well as the performance and state of site facilities in a Grid. MonALISA is currently running around the clock on the US CMS test Grid as well as an increasing number of other sites. It is also being used to monitor the performance and optimize the interconnections among the reflectors in the VRVS system.Comment: Talk from the 2003 Computing in High Energy and Nuclear Physics (CHEP03), La Jolla, Ca, USA, March 2003, 8 pages, pdf. PSN MOET00

arXiv.org e-Print Archive

CERN Document Server

A query processing system for very large spatial databases using a new map algebra

Author: Firouzabadi Seyed-Ali
Publication venue: 'Universite de Sherbrooke'
Publication date: 01/01/2002
Field of study

Dans cette thèse nous introduisons une approche de traitement de requêtes pour des bases de donnée spatiales. Nous expliquons aussi les concepts principaux que nous avons défini et développé: une algèbre spatiale et une approche à base de graphe utilisée dans l'optimisateur. L'algèbre spatiale est défini pour exprimer les requêtes et les règles de transformation pendant les différentes étapes de l'optimisation de requêtes. Nous avons essayé de définir l'algèbre la plus complète que possible pour couvrir une grande variété d'application. L'opérateur algébrique reçoit et produit seulement des carte. Les fonctions reçoivent des cartes et produisent des scalaires ou des objets. L'optimisateur reçoit la requête en expression algébrique et produit un QEP (Query Evaluation Plan) efficace dans deux étapes: génération de QEG (Query Evaluation Graph) et génération de QEP. Dans première étape un graphe (QEG) équivalent de l'expression algébrique est produit. Les règles de transformation sont utilisées pour transformer le graphe a un équivalent plus efficace. Dans deuxième étape un QEP est produit de QEG passé de l'étape précédente. Le QEP est un ensemble des opérations primitives consécutives qui produit les résultats finals (la réponse finale de la requête soumise au base de donnée). Nous avons implémenté l'optimisateur, un générateur de requête spatiale aléatoire, et une base de donnée simulée. La base de donnée spatiale simulée est un ensemble de fonctions pour simuler des opérations spatiales primitives. Les requêtes aléatoires sont soumis à l'optimisateur. Les QEPs générées sont soumis au simulateur de base de données spatiale. Les résultats expérimentaux sont utilisés pour discuter les performances et les caractéristiques de l'optimisateur.Abstract: In this thesis we introduce a query processing approach for spatial databases and explain the main concepts we defined and developed: a spatial algebra and a graph based approach used in the optimizer. The spatial algebra was defined to express queries and transformation rules during different steps of the query optimization. To cover a vast variety of potential applications, we tried to define the algebra as complete as possible. The algebra looks at the spatial data as maps of spatial objects. The algebraic operators act on the maps and result in new maps. Aggregate functions can act on maps and objects and produce objects or basic values (characters, numbers, etc.). The optimizer receives the query in algebraic expression and produces one efficient QEP (Query Evaluation Plan) through two main consecutive blocks: QEG (Query Evaluation Graph) generation and QEP generation. In QEG generation we construct a graph equivalent of the algebraic expression and then apply graph transformation rules to produce one efficient QEG. In QEP generation we receive the efficient QEG and do predicate ordering and approximation and then generate the efficient QEP. The QEP is a set of consecutive phases that must be executed in the specified order. Each phase consist of one or more primitive operations. All primitive operations that are in the same phase can be executed in parallel. We implemented the optimizer, a randomly spatial query generator and a simulated spatial database. The query generator produces random queries for the purpose of testing the optimizer. The simulated spatial database is a set of functions to simulate primitive spatial operations. They return the cost of the corresponding primitive operation according to input parameters. We put randomly generated queries to the optimizer, got the generated QEPs and put them to the spatial database simulator. We used the experimental results to discuss on the optimizer characteristics and performance. The optimizer was designed for databases with a very large number of spatial objects nevertheless most of the concepts we used can be applied to all spatial information systems."--Résumé abrégé par UMI

Savoirs UdeS