Search CORE

221 research outputs found

Universal Indexes for Highly Repetitive Document Collections

Author: Claude Francisco
Fariña Antonio
Martínez-Prieto Miguel A.
Navarro Gonzalo
Publication venue: 'Elsevier BV'
Publication date: 01/01/2016
Field of study

Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space. We introduce new techniques for compressing inverted indexes that exploit this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar compression of the differential inverted lists, instead of the usual practice of gap-encoding them. We show that, in this highly repetitive setting, our compression methods significantly reduce the space obtained with classical techniques, at the price of moderate slowdowns. Moreover, our best methods are universal, that is, they do not need to know the versioning structure of the collection, nor that a clear versioning structure even exists. We also introduce compressed self-indexes in the comparison. These are designed for general strings (not only natural language texts) and represent the text collection plus the index structure (not an inverted index) in integrated form. We show that these techniques can compress much further, using a small fraction of the space required by our new inverted indexes. Yet, they are orders of magnitude slower.Comment: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sk{\l}odowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094

arXiv.org e-Print Archive

Repositorio da Universidade da Coruña

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositorio Académico de la Universidad de Chile

RDF-TR: Exploiting structural redundancies to boost RDF compression

Author: Fernández Javier D.
Hernández-Illera Antonio
Martínez-Prieto Miguel A.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2020
Field of study

The number and volume of semantic data have grown impressively over the last decade, promoting compression as an essential tool for RDF preservation, sharing and management. In contrast to universal compressors, RDF compression techniques are able to detect and exploit specific forms of redundancy in RDF data. Thus, state-of-the-art RDF compressors excel at exploiting syntactic and semantic redundancies, i.e., repetitions in the serialization format and information that can be inferred implicitly. However, little attention has been paid to the existence of structural patterns within the RDF dataset; i.e. structural redundancy. In this paper, we analyze structural regularities in real-world datasets, and show three schema-based sources of redundancies that underpin the schema-relaxed nature of RDF. Then, we propose RDF-Tr (RDF Triples Reorganizer), a preprocessing technique that discovers and removes this kind of redundancy before the RDF dataset is effectively compressed. In particular, RDF-Tr groups subjects that are described by the same predicates, and locally re-codes the objects related to these predicates. Finally, we integrate RDF-Tr with two RDF compressors, HDT and k2-triples. Our experiments show that using RDF-Tr with these compressors improves by up to 2.3 times their original effectiveness, outperforming the most prominent state-of-the-art techniques

Elektronische Publikationen der Wirtschaftsuniversität Wien

Compressed k2-Triples for Full-In-Memory RDF Engines

Author: Brisaboa Nieves R.
Fernández Javier D.
Martínez-Prieto Miguel A.
Álvarez-García Sandra
Publication venue
Publication date: 19/05/2011
Field of study

Current "data deluge" has flooded the Web of Data with very large RDF datasets. They are hosted and queried through SPARQL endpoints which act as nodes of a semantic net built on the principles of the Linked Data project. Although this is a realistic philosophy for global data publishing, its query performance is diminished when the RDF engines (behind the endpoints) manage these huge datasets. Their indexes cannot be fully loaded in main memory, hence these systems need to perform slow disk accesses to solve SPARQL queries. This paper addresses this problem by a compact indexed RDF structure (called k2-triples) applying compact k2-tree structures to the well-known vertical-partitioning technique. It obtains an ultra-compressed representation of large RDF graphs and allows SPARQL queries to be full-in-memory performed without decompression. We show that k2-triples clearly outperforms state-of-the-art compressibility and traditional vertical-partitioning query resolution, remaining very competitive with multi-index solutions.Comment: In Proc. of AMCIS'201

arXiv.org e-Print Archive

AIS Electronic Library (AISeL)

An Empirical Study of Real-World SPARQL Queries

Author: Arias Mario
de la Fuente Pablo
Fernández Javier D.
Martínez-Prieto Miguel A.
Publication venue
Publication date: 01/01/2011
Field of study

Understanding how users tailor their SPARQL queries is crucial when designing query evaluation engines or fine-tuning RDF stores with performance in mind. In this paper we analyze 3 million real-world SPARQL queries extracted from logs of the DBPedia and SWDF public endpoints. We aim at finding which are the most used language elements both from syntactical and structural perspectives, paying special attention to triple patterns and joins, since they are indeed some of the most expensive SPARQL operations at evaluation phase. We have determined that most of the queries are simple and include few triple patterns and joins, being Subject-Subject, Subject-Object and Object-Object the most common join types. The graph patterns are usually star-shaped and despite triple pattern chains exist, they are generally short.Comment: 1st International Workshop on Usage Analysis and the Web of Data (USEWOD2011) in the 20th International World Wide Web Conference (WWW2011), Hyderabad, India, March 28th, 201

arXiv.org e-Print Archive

CiteSeerX

MapReduce-based Solutions for Scalable SPARQL Querying

Author: Javier D. Fernández
José M. Giménez-Garcia
Miguel A. Martínez-Prieto
Publication venue: RonPub
Publication date: 01/01/2014
Field of study

The use of RDF to expose semantic data on the Web has seen a dramatic increase over the last few years. Nowadays, RDF datasets are so big and rconnected that, in fact, classical mono-node solutions present significant scalability problems when trying to manage big semantic data. MapReduce, a standard framework for distributed processing of great quantities of data, is earning a place among the distributed solutions facing RDF scalability issues. In this article, we survey the most important works addressing RDF management and querying through diverse MapReduce approaches, with a focus on their main strategies, optimizations and results

CiteSeerX

RonPub -- Research Online Publishing

HDTourist: exploring urban data on Android

Author: Corcho Oscar
Fernández Javier D.
Hervalejo Elena
Martínez Prieto Miguel A.
Publication venue: E.T.S. de Ingenieros Informáticos (UPM)
Publication date: 01/01/2014
Field of study

The Web of Data currently comprises ? 62 billion triples from more than 2,000 different datasets covering many fields of knowledge3. This volume of structured Linked Data can be seen as a particular case of Big Data, referred to as Big Semantic Data [4]. Obviously, powerful computational configurations are tradi- tionally required to deal with the scalability problems arising to Big Semantic Data. It is not surprising that this ?data revolution? has competed in parallel with the growth of mobile computing. Smartphones and tablets are massively used at the expense of traditional computers but, to date, mobile devices have more limited computation resources. Therefore, one question that we may ask ourselves would be: can (potentially large) semantic datasets be consumed natively on mobile devices? Currently, only a few mobile apps (e.g., [1, 9, 2, 8]) make use of semantic data that they store in the mobile devices, while many others access existing SPARQL endpoints or Linked Data directly. Two main reasons can be considered for this fact. On the one hand, in spite of some initial approaches [6, 3], there are no well-established triplestores for mobile devices. This is an important limitation because any po- tential app must assume both RDF storage and SPARQL resolution. On the other hand, the particular features of these devices (little storage space, less computational power or more limited bandwidths) limit the adoption of seman- tic data for different uses and purposes. This paper introduces our HDTourist mobile application prototype. It con- sumes urban data from DBpedia4 to help tourists visiting a foreign city. Although it is a simple app, its functionality allows illustrating how semantic data can be stored and queried with limited resources. Our prototype is implemented for An- droid, but its foundations, explained in Section 2, can be deployed in any other platform. The app is described in Section 3, and Section 4 concludes about our current achievements and devises the future work

CiteSeerX

Archivo Digital UPM

Mathematical models of cytotoxic effects in endpoint tumor cell line assays: Critical assessment of the application of a single parametric value as a standard criterion to quantify the dose-response effects and new unexplored proposal formats

Author: Calhelha Ricardo C.
Ferreira Isabel C.F.R.
Martínez Mireia A.
Prieto Lage Miguel A.
Publication venue: 'Royal Society of Chemistry (RSC)'
Publication date: 01/01/2017
Field of study

The development of convenient tools for describing and quantifying the effects of standard and novel therapeutic agents is essential for the research community, to perform more precise evaluations. Although mathematical models and quantification criteria have been exchanged in the last decade between different fields of study, there are relevant methodologies that lack proper mathematical descriptions and standard criteria to quantify their responses. Therefore, part of the relevant information that can be drawn from the experimental results obtained and the quantification of its statistical reliability are lost. Despite its relevance, there is not a standard form for the in vitro endpoint tumor cell lines' assays (TCLA) that enables the evaluation of the cytotoxic dose-response effects of anti-tumor drugs. The analysis of all the specific problems associated with the diverse nature of the available TCLA used is unfeasible. However, since most TCLA share the main objectives and similar operative requirements, we have chosen the sulforhodamine B (SRB) colorimetric assay for cytotoxicity screening of tumor cell lines as an experimental case study. In this work, the common biological and practical non-linear dose-response mathematical models are tested against experimental data and, following several statistical analyses, the model based on the Weibull distribution was confirmed as the convenient approximation to test the cytotoxic effectiveness of anti-tumor compounds. Then, the advantages and disadvantages of all the different parametric criteria derived from the model, which enable the quantification of the dose-response drug-effects, are extensively discussed. Therefore, model and standard criteria for easily performing the comparisons between different compounds are established. The advantages include a simple application, provision of parametric estimations that characterize the response as standard criteria, economization of experimental effort and enabling rigorous comparisons among the effects of different compounds and experimental approaches. In all experimental data fitted, the calculated parameters were always statistically significant, the equations proved to be consistent and the correlation coefficient of determination was, in most of the cases, higher than 0.98.The authors are grateful to the Foundation for Science and Technology (FCT) of Portugal and FEDER for financial support to CIMO (UID/AGR/00690/2013); and to the Xunta de Galicia for financial support for the post-doctoral research of M. A. Prieto.info:eu-repo/semantics/publishedVersio

Biblioteca Digital do IPB

Use of synthetic RGB images in train

Author: Calbet Xavier
Martínez Rubio Miguel Ángel
Prieto Fernández José Ignacio
Tjemkes Stephen A.
Publication venue
Publication date: 01/01/2010
Field of study

Ponencia presentada en: 2010 EUMETSAT Meteorological Satellite Conference celebrada del 20-24 de septiembre de 2010 en Córdoba.This paper compares actual and synthetic (calculated) airmass RGB composites, before and after applying different corrections: (a) synthetic airmass RGB calculated with the satellite zenith angle corresponding to the pixel, versus others with fixed zenith angles (0º, 15º, 30º, 45º, 60º and 75º), to observe the blue shift close to the disk boundary due to a longer atmospheric path for the signal in oblique views, (b) Synthetic airmass RGB in areas of sinking tropopause with ozone intrusion for different values in the ozone concentration and humidity. The synthetic RGB for MSG are based on the use of the ECMWF model with RTTOV package and the METEOSAT Second Generation coefficients. Also synthetic MTG-IRS sounder RGBs have been generated. Based on the use of the RTTOV package with the IASI coefficients for one GRIB file of the ECMWF model, synthetic IASI RTTOV radiances are converted to synthetic MTG-IRS RTTOV brightness temperatures (BT). After one selection of the most adequate MTG-IRS brightness temperatures (BTs), two MTG-sounder synthetic RGBs are created. This is a proving ground experiment for the MTG sounder era

Agencia Estatal de Meteorología