1,417 research outputs found

    Universal Indexes for Highly Repetitive Document Collections

    Get PDF
    Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space. We introduce new techniques for compressing inverted indexes that exploit this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar compression of the differential inverted lists, instead of the usual practice of gap-encoding them. We show that, in this highly repetitive setting, our compression methods significantly reduce the space obtained with classical techniques, at the price of moderate slowdowns. Moreover, our best methods are universal, that is, they do not need to know the versioning structure of the collection, nor that a clear versioning structure even exists. We also introduce compressed self-indexes in the comparison. These are designed for general strings (not only natural language texts) and represent the text collection plus the index structure (not an inverted index) in integrated form. We show that these techniques can compress much further, using a small fraction of the space required by our new inverted indexes. Yet, they are orders of magnitude slower.Comment: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sk{\l}odowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094

    RDF-TR: Exploiting structural redundancies to boost RDF compression

    Get PDF
    The number and volume of semantic data have grown impressively over the last decade, promoting compression as an essential tool for RDF preservation, sharing and management. In contrast to universal compressors, RDF compression techniques are able to detect and exploit specific forms of redundancy in RDF data. Thus, state-of-the-art RDF compressors excel at exploiting syntactic and semantic redundancies, i.e., repetitions in the serialization format and information that can be inferred implicitly. However, little attention has been paid to the existence of structural patterns within the RDF dataset; i.e. structural redundancy. In this paper, we analyze structural regularities in real-world datasets, and show three schema-based sources of redundancies that underpin the schema-relaxed nature of RDF. Then, we propose RDF-Tr (RDF Triples Reorganizer), a preprocessing technique that discovers and removes this kind of redundancy before the RDF dataset is effectively compressed. In particular, RDF-Tr groups subjects that are described by the same predicates, and locally re-codes the objects related to these predicates. Finally, we integrate RDF-Tr with two RDF compressors, HDT and k2-triples. Our experiments show that using RDF-Tr with these compressors improves by up to 2.3 times their original effectiveness, outperforming the most prominent state-of-the-art techniques

    Compressed k2-Triples for Full-In-Memory RDF Engines

    Get PDF
    Current "data deluge" has flooded the Web of Data with very large RDF datasets. They are hosted and queried through SPARQL endpoints which act as nodes of a semantic net built on the principles of the Linked Data project. Although this is a realistic philosophy for global data publishing, its query performance is diminished when the RDF engines (behind the endpoints) manage these huge datasets. Their indexes cannot be fully loaded in main memory, hence these systems need to perform slow disk accesses to solve SPARQL queries. This paper addresses this problem by a compact indexed RDF structure (called k2-triples) applying compact k2-tree structures to the well-known vertical-partitioning technique. It obtains an ultra-compressed representation of large RDF graphs and allows SPARQL queries to be full-in-memory performed without decompression. We show that k2-triples clearly outperforms state-of-the-art compressibility and traditional vertical-partitioning query resolution, remaining very competitive with multi-index solutions.Comment: In Proc. of AMCIS'201

    An Empirical Study of Real-World SPARQL Queries

    Get PDF
    Understanding how users tailor their SPARQL queries is crucial when designing query evaluation engines or fine-tuning RDF stores with performance in mind. In this paper we analyze 3 million real-world SPARQL queries extracted from logs of the DBPedia and SWDF public endpoints. We aim at finding which are the most used language elements both from syntactical and structural perspectives, paying special attention to triple patterns and joins, since they are indeed some of the most expensive SPARQL operations at evaluation phase. We have determined that most of the queries are simple and include few triple patterns and joins, being Subject-Subject, Subject-Object and Object-Object the most common join types. The graph patterns are usually star-shaped and despite triple pattern chains exist, they are generally short.Comment: 1st International Workshop on Usage Analysis and the Web of Data (USEWOD2011) in the 20th International World Wide Web Conference (WWW2011), Hyderabad, India, March 28th, 201

    Injury assessment of common nage-waza judo techniques for amateur judokas

    Get PDF
    There are few detailed publications that allow performing associations between the technical aspects and the occurrence of injuries. The purpose of this study was to apply a methodology based in recording material to assess the injury risk factors. Common nage-waza judo techniques during regular training of amateur judokas were used as a case study. Novice students (n=193; 100 men and 93 women) from the University of Vigo during five academic years (2003 to 2008) were filmed during the ordinary training period of the technical execution of ten nage-waza techniques. The obtained data were evaluated using descriptive statistics and tpatterns analysis. Thus, it was possible to identify the presence of typical inaccuracies during execution of the technique uncovering the main temporal sequence of errors allowing us to link our findings with the injury occurrence. In order to narrow the unexpected causes of accidents regarding poor technique performance in regular training, this research provides the hidden temporal sequence of errors of common throw techniques, helping professionals to correct the key technical errors in order to prevent diverse type injuries. The methodology developed here could be easily extended to other martial sports.info:eu-repo/semantics/publishedVersio

    Modeling the potential distribution and richness of cetaceans in the Azores from Fisheries Observer Program data

    Get PDF
    © The Author(s), 2016. This article is distributed under the terms of the Creative Commons Attribution License. The definitive version was published in Frontiers in Marine Science 2 (2016): 202, doi:10.3389/fmars.2016.00202.Marine spatial planning and ecological research call for high-resolution species distribution data. However, those data are still not available for most marine large vertebrates. The dynamic nature of oceanographic processes and the wide-ranging behavior of many marine vertebrates create further difficulties, as distribution data must incorporate both the spatial and temporal dimensions. Cetaceans play an essential role in structuring and maintaining marine ecosystems and face increasing threats from human activities. The Azores holds a high diversity of cetaceans but the information about spatial and temporal patterns of distribution for this marine megafauna group in the region is still very limited. To tackle this issue, we created monthly predictive cetacean distribution maps for spring and summer months, using data collected by the Azores Fisheries Observer Programme between 2004 and 2009. We then combined the individual predictive maps to obtain species richness maps for the same period. Our results reflect a great heterogeneity in distribution among species and within species among different months. This heterogeneity reflects a contrasting influence of oceanographic processes on the distribution of cetacean species. However, some persistent areas of increased species richness could also be identified from our results. We argue that policies aimed at effectively protecting cetaceans and their habitats must include the principle of dynamic ocean management coupled with other area-based management such as marine spatial planning.This work was supported by FEDER funds, through the Competitiveness Factors Operational Programme - COMPETE, by national funds, through FCT - Foundation for Science and Technology, under project TRACE (PTDC/ MAR/74071/2006), and by regional funds, through DRCT/SRCTE, under projects MAPCET (M2.1.2/F/012/2011) and 2020 (M2.1.2/I/026/2011). We acknowledge funds provided by FCT to MARE, through the strategic project UID/MAR/04292/2013. RP is supported by an FCT postdoctoral grant (SFRH/BPD/108007/2015); MAS is supported by Program Investigator FCT (IF/00943/2013) and MT was supported by a research fellowship under the Exploratory project (IF/00943/2013/CP1199/CT0001) that also paid the fees for this open-access publication. IF/00943/2013 and IF/00943/2013/CP1199/CT0001 are funded by FSE and MCTES, through POPH and QREN

    Epigenetics override pro-inflammatory PTGS transcriptomic signature towards selective hyperactivation of PGE 2 in colorectal cancer

    Get PDF
    This is an Open Access article distributed under the terms of the Creative Commons Attribution License.-- et al.[Background]: Misregulation of the PTGS (prostaglandin endoperoxide synthase, also known as cyclooxygenase or COX) pathway may lead to the accumulation of pro-inflammatory signals, which constitutes a hallmark of cancer. To get insight into the role of this signaling pathway in colorectal cancer (CRC), we have characterized the transcriptional and epigenetic landscapes of the PTGS pathway genes in normal and cancer cells. [Results]: Data from four independent series of CRC patients (502 tumors including adenomas and carcinomas and 222 adjacent normal tissues) and two series of colon mucosae from 69 healthy donors have been included in the study. Gene expression was analyzed by real-time PCR and Affymetrix U219 arrays. DNA methylation was analyzed by bisulfite sequencing, dissociation curves, and HumanMethylation450K arrays. Most CRC patients show selective transcriptional deregulation of the enzymes involved in the synthesis of prostanoids and their receptors in both tumor and its adjacent mucosa. DNA methylation alterations exclusively affect the tumor tissue (both adenomas and carcinomas), redirecting the transcriptional deregulation to activation of prostaglandin E 2 (PGE 2 ) function and blockade of other biologically active prostaglandins. In particular, PTGIS, PTGER3, PTGFR, and AKR1B1 were hypermethylated in more than 40 % of all analyzed tumors. [Conclusions]: The transcriptional and epigenetic profiling of the PTGS pathway provides important clues on the biology of the tumor and its microenvironment. This analysis renders candidate markers with potential clinical applicability in risk assessment and early diagnosis and for the design of new therapeutic strategies.IC was funded by Fundação para a Ciência e a Tecnologia (SFRH/BD/28464/2006); JC was funded by a FPI fellowship. ADV was supported in part by a contract from the Ministerio de Economía y Competitividad (MINECO) (PTC2011-1091). This work was supported by the MINECO(SAF2011/23638, SAF2014/52492), the Catalan Institute of Oncology and the Instituto de Salud Carlos III (grant PI11-01439, RD12/0042/0019 and CIBERESP CB06/02/2005), the Generalitat de Catalunya (grant 2014SGR647), and the Asociación Española Contra el Cáncer (AECC).Peer Reviewe

    Mechanistic studies on the photogeneration of o- and p-xylylenes from α,α'-dichloroxylenes

    Get PDF
    Two-colour two-laser techniques have unambiguously proved that photolysis of the o-/p-(chloromethyl)benzyl radical leads to the sequential two-photon generation of o-/pxylylene from α,α'-dichloro-o-/p-xylene.Perez Prieto, Julia, [email protected]
    corecore