56 research outputs found

    Dynamic Range Majority Data Structures

    Full text link
    Given a set PP of coloured points on the real line, we study the problem of answering range α\alpha-majority (or "heavy hitter") queries on PP. More specifically, for a query range QQ, we want to return each colour that is assigned to more than an α\alpha-fraction of the points contained in QQ. We present a new data structure for answering range α\alpha-majority queries on a dynamic set of points, where α(0,1)\alpha \in (0,1). Our data structure uses O(n) space, supports queries in O((lgn)/α)O((\lg n) / \alpha) time, and updates in O((lgn)/α)O((\lg n) / \alpha) amortized time. If the coordinates of the points are integers, then the query time can be improved to O(lgn/(αlglgn)+(lg(1/α))/α))O(\lg n / (\alpha \lg \lg n) + (\lg(1/\alpha))/\alpha)). For constant values of α\alpha, this improved query time matches an existing lower bound, for any data structure with polylogarithmic update time. We also generalize our data structure to handle sets of points in d-dimensions, for d2d \ge 2, as well as dynamic arrays, in which each entry is a colour.Comment: 16 pages, Preliminary version appeared in ISAAC 201

    Cross-Document Pattern Matching

    Get PDF
    We study a new variant of the string matching problem called cross-document string matching, which is the problem of indexing a collection of documents to support an efficient search for a pattern in a selected document, where the pattern itself is a substring of another document. Several variants of this problem are considered, and efficient linear-space solutions are proposed with query time bounds that either do not depend at all on the pattern size or depend on it in a very limited way (doubly logarithmic). As a side result, we propose an improved solution to the weighted level ancestor problem

    WeR-trees

    No full text
    R-tree has been proven to be one of the most practical and well-behaved data structures for accommodating dynamic massive sets of low dimensionality geometric objects and conducting a very diverse set of queries on such data sets in real-world applications. In this paper, we present weighted R-trees-WeR-trees-a new practical and efficient scheme for dynamic manipulation of multi-dimensional data sets, which applies for the first time the technique of partial rebuildings to the case of the R-tree family. Partial rebuildings refer to the method of progressive reconstruction of whole subtrees across update paths in order to keep them in perfect balance from the performance perspective. An analytical investigation is performed, showing amortized bounds for the update operations while detailed experimental results concerning both synthetic and real data sets confirm the applicability of the proposed method and demonstrate its superiority over R*-trees, the most well-behaved and widely accepted variant of R-trees: node space utilization reaches up to 98.1%, query savings vary between 25% and 50% and even more for skewed data, while the scheme scales up linearly with respect to the number of inserted items. (c) 2007 Elsevier B.V. All rights reserved

    Computing scientometrics in large-scale academic search engines with MapReduce

    No full text
    Apart from the well-established facility of searching for research articles, the modern academic search engines also provide information regarding the scientists themselves. Until recently, this information was limited to include the articles each scientist has authored, accompanied by their corresponding citations. Presently, the most popular scientific databases have enriched this information by including scientometrics, that is, metrics which evaluate the research activity of a scientist. Although the computation of scientometrics is relatively easy when dealing with small data sets, in larger scales the problem becomes more challenging since the involved data is huge and cannot be handled efficiently by a single workstation. In this paper we attempt to address this interesting problem by employing MapReduce, a distributed, fault-tolerant framework used to solve problems in large scales without considering complex network programming details. We demonstrate that by setting the problem in a manner that is compatible to MapReduce, we can achieve an effective and scalable solution. We propose four algorithms which exploit the features of the framework and we compare their efficiency by conducting experiments on a large dataset comprised of roughly 1.8 million scientific documents. © 2012 Springer-Verlag

    Grid-file: Towards to a flash efficient multi-dimensional index

    No full text
    Spatial indexes are of great importance for multidimensional query processing. Traditional data structures have been optimized for magnetic disks in the storage layer. In the recent years flash solid disks are widely utilized, as a result of their exceptional features. However, the specifics of flash memory (asymmetric read/write speeds erase before update, wear out) introduce new challenges. Algorithms and data structures designed for magnetic disks experience reduced performance in flash. Most research efforts for flash-aware spatial indexes concern R-tree and its variants. Distinguishing from previous works we investigate the performance of Grid File in flash and enlighten constrains and opportunities towards a flash efficient Grid File. We conducted experiments on mainstream and high performance SSD devices and Grid File outperforms R ∗ -tree in all cases. © Springer International Publishing Switzerland 2015

    Effective Unsupervised Matching of Product Titles with k-Combinations and Permutations

    No full text
    The problem of matching product titles is of particular interest for both users and marketers. The former, frequently search the Web with the aim of comparing prices and characteristics, or obtaining and aggregating information provided by other users. The latter, often require wide knowledge of competitive policies, prices and features to organize a promotional campaign about a group of products. To address this interesting problem, recent studies have attempted to enrich the product titles by exploiting Web search engines. More specifically, these methods suggest that for each product title a query should be submitted. After the results have been collected, the most important words which appear in the results are identified and appended in the titles. In the sequel, each word is assigned an importance score and finally, a similarity measure is applied to identify if two or more titles refer to the same product. Nonetheless, these methods have multiple problems including scalability, slow retrieval of the required additional search results, and lack of flexibility. In this paper, we present a different approach which addresses all these issues and is based on the morphological analysis of the titles of the products. In particular, our method operates in two phases. In the first phase, we compute the combinations of the words of the titles and we record several statistics such as word proximity and frequency values. In the second phase, we use this information to assign a score to each combination. The highest scoring combination is then declared as label of the cluster which contains each product. The experimental evaluation of the algorithm, in a real world dataset, demonstrated that compared to three popular string similarity metrics, our approach achieves up to 36% better matching performance and at least 13 times faster execution. © 2018 IEEE

    Effective ranking fusion methods for personalized metasearch engines

    No full text
    Metasearch engines are a significant part of the information retrieval process. Most of Web users use them directly or indirectly to access information from more than one data sources. The cornerstone of their technology is their rank aggregation method, which is the algorithm they use to classify the collected results. In this paper we present three new rank aggregation methods. At first, we propose a method that takes into consideration the regional data for the user and the pages and assigns scores according to a variety of user defined parameters. In the second expansion, not all component engines are treated equally. The user is free to define the importance of each engine by setting appropriate weights. The third algorithm is designed to classify pages having URLs that contain subdomains. The three presented methods are combined into a single, personalized scoring formula, the Global KE. All algorithms have been implemented in QuadSearch, an experimental metasearch engine available at http://quadsearch.csd.auth.gr. © 2008 IEEE

    LB-Grid: An SSD efficient Grid File

    No full text
    Recent advances in non-volatile memory technology have led to the introduction of solid state drives (SSD). NVMe SSDs are the latest development in flash based solid state drives and they were designed as a means of low latency and high bandwidth. Many research studies seek for taking advantage of this new technology to accelerate data management. Multidimensional indexes are fundamental for the efficiency of spatial query processing. In this work, we study the implication of high performance NVMe drives on spatial indexing. More specifically, we present an in-depth performance analysis of the Grid File in flash storage and we introduce LB-Grid, a write efficient variant of Grid File for flash based solid state drives. We present new query algorithms for both LB-Grid and Grid File that exploit the high internal parallelism and I/O bandwidth of NVMe SSDs. Experimental results unveil the efficiency of the proposed algorithms. Utilizing a test set of 500M points, LB-Grid appears to be up to 2.26 times faster than Grid File, up to 5.5 times faster than the R∗-tree, and up to 3.3 times faster than the FAST-Rtree in update intensive workloads. On the other hand, the Grid File presents better performance in read intensive workloads; exploiting a batch read operation, it achieves a speedup up to 10.2x in range queries, up to 1.56x in kNN and 4.6x in group point queries. © 2019 Elsevier B.V

    A spatial index for hybrid storage

    No full text
    The introduction of flash SSDs has accelerated the performance of DBMSes. However, the intrinsic characteristics of flash motivated many researchers to investigate new efficient data structures. The emergence of 3DXPoint, a new non-volatile memory, sets new challenges: 3DXPoint features low latency and high IOPS even at small queue depths. However, the cost of 3DXPoint is 4 times higher than that of a flash-based device, rendering hybrid storage systems a good alternative. In this paper we pursue exploiting the efficiency of both 3DXPoint and flash-based devices introducing H-Grid, a variant of Grid-File for hybrid storage. H-Grid uses a flash SSD as main store and a small 3DXPoint device to persist the hottest data. The performance of the proposed index is experimentally evaluated, comparing it against GFFM, a flash efficient implementation of Grid File. The results show that H-Grid is faster than GFFM execution on a flash SSD, reducing the single point search time from 35% up to 43%. © 2019 Association for Computing Machinery

    A supervised machine learning classification algorithm for research articles

    No full text
    The issue of the automatic classification of research articles into one or more fields of science is of primary importance for scientific databases and digital libraries. A sophisticated classification strategy renders searching more effective and assists the users in locating similar relevant items. Although the most publishing services require from the authors to categorize their articles themselves, there are still cases where older documents remain unclassified, or the taxonomy changes over time. In this work we attempt to address this interesting problem by introducing a machine learning algorithm which combines several parameters and meta-data of a research article. In particular, our model exploits the training set to correlate keywords, authors, co-authorship, and publishing journals to a number of labels of the taxonomy. In the sequel, it applies this information to classify the rest of the documents. The experiments we have conducted with a large dataset comprised of about 1,5 million articles, demonstrate that in this specific application, our model outperforms the AdaBoost.MH and SVM methods. Copyright 2013 ACM
    corecore