511 research outputs found

    CiNCT: Compression and retrieval for massive vehicular trajectories via relative movement labeling

    Full text link
    In this paper, we present a compressed data structure for moving object trajectories in a road network, which are represented as sequences of road edges. Unlike existing compression methods for trajectories in a network, our method supports pattern matching and decompression from an arbitrary position while retaining a high compressibility with theoretical guarantees. Specifically, our method is based on FM-index, a fast and compact data structure for pattern matching. To enhance the compression, we incorporate the sparsity of road networks into the data structure. In particular, we present the novel concepts of relative movement labeling and PseudoRank, each contributing to significant reductions in data size and query processing time. Our theoretical analysis and experimental studies reveal the advantages of our proposed method as compared to existing trajectory compression methods and FM-index variants

    De Novo Assembly of Nucleotide Sequences in a Compressed Feature Space

    Get PDF
    Sequencing technologies allow for an in-depth analysis of biological species but the size of the generated datasets introduce a number of analytical challenges. Recently, we demonstrated the application of numerical sequence representations and data transformations for the alignment of short reads to a reference genome. Here, we expand out approach for de novo assembly of short reads. Our results demonstrate that highly compressed data can encapsulate the signal suffi- ciently to accurately assemble reads to big contigs or complete genomes

    Compressed data structures for trajectory representation

    Get PDF
    Programa Oficial de Doutoramento en Computación . 5009V01[Abstract] The proliferation of GPS devices in smartphones, vehicles and sport wearables in one hand, and geolocation mechanisms (such as smart cards in public transportation) in the other hand, have produced an unprecedented capacity of obtaining and storing trajectories that people generate by the movements that originate from their daily schedules. However, no standard data models exist to represent these trajectories, and besides neither traditional databases nor new NoSQL databases are adequate for the representation and exploitation of the complex data of spatio-temporal nature which these trajectories consist of. This general outlook is even more complex once we consider that whenever we are storing information related to a context of public transportation passengers, customers inside a mall, or simply vehicles moving in a city we must deal with a true Big Data scenario in which guaranteeing an efficient response can be very challenging. Consequently, in this thesis we address the design of compact data structures for the representation of the followed trajectories, both in the context of vehicles and/or people moving in urban or periurban spaces, as in the context of itineraries of commuters in public transportation. Additionally to designing these compact data structures that allow us to represent the Big Data scenario usually seen in this application domain, we have designed the algorithms that allow the efficient exploitation of said information. These algorithms, in addition to solving classic spatio-temporal queries, such as obtaining the position of a moving object at a time instant, reconstructing the trajectory of an object, or even spatio-temporal window queries (which objects are inside a spatial range either within a time window or at a time instant), are also able to solve more specialized queries for the analysis of trajectories that travelers make. For instance, we have designed algorithms to query the number of travelers that start (or finish) their trip in a certain place within a determined time interval, or the number of travelers that switch from one line from the public transportation network to another using a particular stop, or even the number of travelers that had started their trip in a certain place (which can be either a stop or a whole neighborhood) to finish it in another place. Both the designed structures as the querying algorithms, which are available at https://github.com/dgalaktionov/compact-trip-representation, have been experimentally evaluated. With these structures we are able to represent, in a compact space of 100 MiB, a collection of approximately a million and a half of taxi trajectories, or alternatively ten million trajectories consisting of itineraries over public transportation networks, given that they are more compact. In both cases, we can solve most of the considered exploitation queries in the order of microseconds, with algorithms that scale logarithmically with respect to the increase in the number of stored trajectories. Finally, considering the practical quality of this work, it was required for the performed research to be of a clearly applied nature, which led us to developing a web application with Geograhic Information Systems technology, which integrates with our compressed structures and algorithms instead of relying on common spatial databases. This application, which provides a simple and intuitive user interface that represents the map of a transportation network, enabled an end user to run the aforementioned algorithms over a large collection of historic trajectories. Likewise, this interface presents the query results in a graphical and intuitive way.[Resumen] La proliferación de por un lado de dispositivos GPS en smartphones, vehículos o pulseras de deporte, y por otro, de otros mecanismos de geolocalización (como las tarjetas de pago de trasporte público), han generado una capacidad inédita de obtener y almacenar las trayectorias que generan las personas al moverse durante sus quehaceres diarios. Sin embargo, no existen modelos de datos estándar para representar dichas trayectorias, además de que ni las bases de datos tradicionales, ni para las nuevas bases de datos NoSQL se adecúan bien a la representación y explotación de esos datos complejos de naturaleza espacio-temporal que son las trayectorias. Para hacer más complejo aún el panorama, se constata además que cuando se quieren almacenar trayectorias de viajeros de transporte público, o de clientes en centros comerciales, o simplemente de personas o vehículos moviéndose por la ciudad hay que enfrentarse a un verdadero escenario Big Data en el que la eficiencia en la respuesta a las consultas se hace muy difícil. Por todo ello, en esta tesis se aborda el diseño de estructuras de datos compactas para la representación de las trayectorias seguidas, por un lado, por vehículos y/o personas que se mueven por las calles de un entorno urbano o periurbano acotado, y por otro los itinerarios de viajeros de transporte público. Además de diseñar esas estructuras de datos compactas, que permiten representar ese escenario Big Data habitual en estos dominios de aplicación, se han diseñado los algoritmos que permiten la explotación eficiente de dichos datos. Dichos algoritmos, además de resolver las consultas espacio-temporales clásicas, tanto las de posición de un objeto en un tiempo, o trayectoria de un objeto durante un intervalo temporal, como las consultas de rango espacio-temporal (qué objetos están en una ventana del espacio en un instante o intervalo temporal) resuelven también consultas más especializadas para el análisis de trayectorias de viajeros. Por ejemplo, hemos diseñado algoritmos para consultar el número de viajeros que inician (o terminan) su viaje en cierto lugar dentro de un cierto intervalo temporal, o el número de viajeros que conmutan de una línea a otra de la red de transporte público en una cierta parada, o incluso el número de viajeros que inicia su viaje en cierto lugar (parada o barrio) y lo termina en otra parada o barrio determinados. Tanto las estructuras de datos diseñadas como todos los algoritmos de consulta, que están disponibles en https://github.com/dgalaktionov/compact-trip-representation, han sido evaluados experimentalmente. Con estas estructuras es posible representar en un espacio de 100 MiB una colección de aproximadamente un millón y medio de trayectorias de taxis, o alternativamente diez millones de trayectorias consistentes de itinerarios sobre redes de transporte público, al ser éstas últimas más compactas. En ambos casos, podemos resolver la mayor parte de las consultas de explotación planteadas en el orden de microsegundos, con algoritmos que escalan de forma logarítmica con respecto al incremento en el número de trayectorias almacenadas. Por último y dado el carácter de tesis industrial de este trabajo, era necesario que la investigación realizada tuviese un carácter claramente aplicado, por ello se implementó una aplicación web con tecnología de Sistemas de Información Geográfica que en vez de trabajar sobre una base de datos espacial convencional utiliza la estructura comprimida y los algoritmos para su explotación diseñados en la tesis. Esa aplicación facilita, mediante una sencilla e intuitiva interfaz de usuario que representa el mapa de la red de transporte, el lanzamiento de los algoritmos diseñados sobre un amplio conjunto de trayectorias de viajeros. Del mismo modo esa interfaz presenta los resultados de las consultas de modo gráfico e intuitivo.[Resumo] A proliferación de por un lado os dispositivos GPS en smartphones, vehículos ou brazaletes deportivos e por outro lado os mecanismos de xeolocalización (como as tarxetas de pago do transporte público), xeraron unha capacidade sen precedentes para obter e almacenar as traxectorias que a xente xera ao moverse durante as súas tarefas diarias. Non obstante, non hai modelos de datos estándar para representar tales traxectorias, ademais de que nin as bases de datos tradicionais nin para as novas bases de datos NoSQL son adecuadas para a representación e explotación de datos tan complexos de natureza espazo-temporal que son as traxectorias. Para facer o panorama aínda máis complexo, tamén se comproba que cando se quere almacenar traxectorias de viaxeiros de transporte público, ou clientes en centros comerciais, ou simplemente de persoas ou vehículos que se desprazan pola cidade, se ten que afrontar un verdadeiro escenario de Big Data no que a eficiencia na resposta ás consultas faise moi difícil. Por iso, esta tese trata do deseño de estruturas compactas de datos para a representación dos camiños seguidos, por un lado, por vehículos e/ou persoas que se desprazan polas rúas dun contorno urbano ou periurbano delimitado, e por outros itinerarios de viaxeiros en transporte público. Ademais de deseñar estas estruturas compactas de datos, que permiten representar ese escenario Big Data habitual neste dominios de aplicación, deseñáronse algoritmos que permitan a explotación eficiente dos devanditos datos. Estes algoritmos, ademais de resolver as clásicas consultas espazo-temporais, tanto a posición dun obxecto á vez, como a traxectoria dun obxecto durante un intervalo de tempo, así como as consultas de rango espazo-temporal (qué obxectos están nun rango do espazo nun intre ou nun intervalo temporal) tamén resolver consultas máis especializadas para a análise de traxectorias de viaxeiros. Por exemplo, deseñamos algoritmos para comprobar o número de viaxeiros que inician (ou terminan) a súa viaxe nun determinado lugar nun determinado intervalo de tempo, ou o número de viaxeiros que cambian dunha liña a outra da rede de transporte público nun certa parada, ou incluso o número de viaxeiros que comezan a súa viaxe nun determinado lugar (parada ou barrio) e rematan noutra parada ou barrio específico. Tanto as estruturas de datos deseñadas como todos os algoritmos de consulta, dispoñibles en https://github.com/dgalaktionov/ compact-trip-representation, foron evaluados experimentalmente. Con estas estruturas é posible representar nun espazo de 100 MiB unha colección de aproximadamente un millón e medio de traxectos de taxi ou, alternativamente, dez millóns de traxectos consistentes en itinerarios en redes de transporte público, sendo estes últimos máis compactos. Nos dous casos, podemos resolver a maioría das consultas de explotación plantexadas na orde de microsegundos, con algoritmos que escalan logarítmicamente con respecto ao aumento do número de traxectorias almacenadas. Finalmente, dado o carácter de tese industrial deste traballo, foi necesario que a investigación realizada tivese un carácter claramente aplicado, polo que se implementou unha aplicación web con tecnoloxía de Sistemas de Información Xeográfica que no canto de traballar nunha base de datos espacial convencional usa a estrutura comprimida e algoritmos de explotación deseñados na tese. Esta aplicación facilita, mediante unha interface de usuario sinxela e intuitiva que representa o mapa da rede de transporte, o lanzamento dos algoritmos deseñados nun amplo conxunto de rutas de pasaxeiros. Do mesmo xeito que a interface presenta os resultados das consultas dun xeito gráfico e intuitivo.Xunta de Galicia; IN848D 2017 2350417Xunta de Galicia; IN852A 2018/14Xunta de Galicia; ED431G/01Xunta de Galicia; ED431C 2017/58Ministerio de Economía y Competitividad; TIN2016-78011-C4-1-RMinisterio de Economía y Competitividad; TIN2015-69951-RMinisterio de Ciencia e Innovación; RTI-2018-098309-B-C3

    Simple and dynamic data structure for pattern matching in texts, A

    Get PDF
    2011 Summer.Includes bibliographical references.The demand for a pattern matching algorithm is currently on the rise from diverse areas such as string search, image matching, voice recognition and bioinformatics. In particular, string search or matching algorithms have been growing in popularity as they have been applied to areas such as text editors, search engines and bioinformatics. To satisfy these various demands, many string matching methods have been developed to search for substrings (pattern strings) within a text, and several techniques employ the use of tree data structures, deterministic finite automata, and other structures. The problem of string matching is defined by finding all location of a pattern string P within a text T, where preprocessing of T is allowed in order to facilitate the queries. There has been significant success in finding a pattern string in O(m+k) time, where m is the length of the pattern string and k is the number of occurrences, using data structures that can be constructed in O(n) time, where n is the length of T. Suffix trees and directed acyclic word graphs are such data structures. All of these data structures index the searched text in O(m+k) time. However, the difficulty of understanding and programming the construction algorithms is rarely mentioned. Also, they have significant space requirements and take Θ(n) time to update even if one character of T is changed. To solve these problems, we propose the augmented position heap. It can be built in O(n) time, and can be used to search a pattern string in O(m+k) time. Most importantly, when a block of j characters are inserted or deleted, the asymptotic updating it when a text is modified is O((h(T) + j)h(T)), where h(T) is the length of the longest substring X of T that occurs at least ||X|| times in T, where ||X|| is the length of X. For texts arising from practical applications, h(T) is typically slowly growing function of ||T||; for a random text T, its expected value is O(logn). Another issue in data structures that must be addressed is space requirement. The most space efficient data structure for string search is the suffix array, which uses 2n words and supports searches in O(nlogn + m + k). A compact representation of the position heap proposed in this thesis also takes 2n words, but can be updated in O((h(T) + j)h(T)) time, but takes O(m2+k) time for a search. The best bound known bound for updating the suffix array or the directed acyclic word graph is O(n), and they both take considerably more space. A compact representation proposed in this thesis for the augmented position heap takes 4n words, can be updated just as efficiently as the position heap, and takes O(m+k) time for a search

    간결한 자료구조를 활용한 반구조화된 문서 형식들의 공간 효율적 표현법

    Get PDF
    학위논문 (박사) -- 서울대학교 대학원 : 공과대학 전기·컴퓨터공학부, 2021. 2. Srinivasa Rao Satti.Numerous big data are generated from a plethora of sources. Most of the data stored as files contain a non-fixed type of schema, so that the files are suitable to be maintained as semi-structured document formats. A number of those formats, such as XML (eXtensible Markup Language), JSON (JavaScript Object Notation), and YAML (YAML Ain't Markup Language) are suggested to sustain hierarchy in the original corpora of data. Several data models structuring the gathered data - including RDF (Resource Description Framework) - depend on the semi-structured document formats to be serialized and transferred for future processing. Since the semi-structured document formats focus on readability and verbosity, redundant space is required to organize and maintain the document. Even though general-purpose compression schemes are widely used to compact the documents, applying those algorithms hinder future handling of the corpora, owing to loss of internal structures. The area of succinct data structures is widely investigated and researched in theory, to provide answers to the queries while the encoded data occupy space close to the information-theoretic lower bound. Bit vectors and trees are the notable succinct data structures. Nevertheless, there were few attempts to apply the idea of succinct data structures to represent the semi-structured documents in space-efficient manner. In this dissertation we propose a unified, space-efficient representation of various semi-structured document formats. The core functionality of this representation is its compactness and query-ability derived from enriched functions of succinct data structures. Incorporation of (a) bit indexed arrays, (b) succinct ordinal trees, and (c) compression techniques engineers the compact representation. We implement this representation in practice, and show by experiments that construction of this representation decreases the disk usage by up to 60% while occupying 90% less RAM. We also allow processing a document in partial manner, to allow processing of larger corpus of big data even in the constrained environment. In parallel to establishing the aforementioned compact semi-structured document representation, we provide and reinforce some of the existing compression schemes in this dissertation. We first suggest an idea to encode an array of integers that is not necessarily sorted. This compaction scheme improves upon the existing universal code systems, by assistance of succinct bit vector structure. We show that our suggested algorithm reduces space usage by up to 44% while consuming 15% less time than the original code system, while the algorithm additionally supports random access of elements upon the encoded array. We also reinforce the SBH bitmap index compression algorithm. The main strength of this scheme is the use of intermediate super-bucket during operations, giving better performance on querying through a combination of compressed bitmap indexes. Inspired from splits done during the intermediate process of the SBH algorithm, we give an improved compression mechanism supporting parallelism that could be utilized in both CPUs and GPUs. We show by experiments that this CPU parallel processing optimization diminishes compression and decompression times by up to 38% in a 4-core machine without modifying the bitmap compressed form. For GPUs, the new algorithm gives 48% faster query processing time in the experiments, compared to the previous existing bitmap index compression schemes.셀 수 없는 빅 데이터가 다양한 원본로부터 생성되고 있다. 이들 데이터의 대부분은 고정되지 않은 종류의 스키마를 포함한 파일 형태로 저장되는데, 이로 인하여 반구조화된 문서 형식을 이용하여 파일을 유지하는 것이 적합하다. XML, JSON 및 YAML과 같은 종류의 반구조화된 문서 형식이 데이터에 내재하는 구조를 유지하기 위하여 제안되었다. 수집된 데이터를 구조화하는 RDF와 같은 여러 데이터 모델들은 사후 처리를 위한 저장 및 전송을 위하여 반구조화된 문서 형식에 의존한다. 반구조화된 문서 형식은 가독성과 다변성에 집중하기 때문에, 문서를 구조화하고 유지하기 위하여 추가적인 공간을 필요로 한다. 문서를 압축시키기 위하여 일반적인 압축 기법들이 널리 사용되고 있으나, 이들 기법들을 적용하게 되면 문서의 내부 구조의 손실로 인하여 데이터의 사후 처리가 어렵게 된다. 데이터를 정보이론적 하한에 가까운 공간만을 사용하여 저장을 가능하게 하면서 질의에 대한 응답을 제공하는 간결한 자료구조는 이론적으로 널리 연구되고 있는 분야이다. 비트열과 트리가 널리 알려진 간결한 자료구조들이다. 그러나 반구조화된 문서들을 저장하는 데 간결한 자료구조의 아이디어를 적용한 연구는 거의 진행되지 않았다. 본 학위논문을 통해 우리는 다양한 종류의 반구조화된 문서 형식을 통일되게 표현하는 공간 효율적 표현법을 제시한다. 이 기법의 주요한 기능은 간결한 자료구조가 강점으로 가지는 특성에 기반한 간결성과 질의 가능성이다. 비트열로 인덱싱된 배열, 간결한 순서 있는 트리 및 다양한 압축 기법을 통합하여 해당 표현법을 고안하였다. 이 기법은 실재적으로 구현되었고, 실험을 통하여 이 기법을 적용한 반구조화된 문서들은 최대 60% 적은 디스크 공간과 90% 적은 메모리 공간을 통해 표현될 수 있다는 것을 보인다. 더불어 본 학위논문에서 반구조화된 문서들은 분할적으로 표현이 가능함을 보이고, 이를 통하여 제한된 환경에서도 빅 데이터를 표현한 문서들을 처리할 수 있다는 것을 보인다. 앞서 언급한 공간 효율적 반구조화된 문서 표현법을 구축함과 동시에, 본 학위논문에서 이미 존재하는 압축 기법 중 일부를 추가적으로 개선한다. 첫째로, 본 학위논문에서는 정렬 여부에 관계없는 정수 배열을 부호화하는 아이디어를 제시한다. 이 기법은 이미 존재하는 범용 코드 시스템을 개선한 형태로, 간결한 비트열 자료구조를 이용한다. 제안된 알고리즘은 기존 범용 코드 시스템에 비해 최대 44\% 적은 공간을 사용할 뿐만 아니라 15\% 적은 부호화 시간을 필요로 하며, 기존 시스템에서 제공하지 않는 부호화된 배열에서의 임의 접근을 지원한다. 또한 본 학위논문에서는 비트맵 인덱스 압축에 사용되는 SBH 알고리즘을 개선시킨다. 해당 기법의 주된 강점은 부호화와 복호화 진행 시 중간 매개인 슈퍼버켓을 사용함으로써 여러 압축된 비트맵 인덱스에 대한 질의 성능을 개선시키는 것이다. 위 압축 알고리즘의 중간 과정에서 진행되는 분할에서 영감을 얻어, 본 학위논문에서 CPU 및 GPU에 적용 가능한 개선된 병렬화 압축 매커니즘을 제시한다. 실험을 통해 CPU 병렬 최적화가 이루어진 알고리즘은 압축된 형태의 변형 없이 4코어 컴퓨터에서 최대 38\%의 압축 및 해제 시간을 감소시킨다는 것을 보인다. GPU 병렬 최적화는 기존에 존재하는 GPU 비트맵 압축 기법에 비해 48\% 빠른 질의 처리 시간을 필요로 함을 확인한다.Chapter 1 Introduction 1 1.1 Contribution 3 1.2 Organization 5 Chapter 2 Background 6 2.1 Model of Computation 6 2.2 Succinct Data Structures 7 Chapter 3 Space-efficient Representation of Integer Arrays 9 3.1 Introduction 9 3.2 Preliminaries 10 3.2.1 Universal Code System 10 3.2.2 Bit Vector 13 3.3 Algorithm Description 13 3.3.1 Main Principle 14 3.3.2 Optimization in the Implementation 16 3.4 Experimental Results 16 Chapter 4 Space-efficient Parallel Compressed Bitmap Index Processing 19 4.1 Introduction 19 4.2 Related Work 23 4.2.1 Byte-aligned Bitmap Code (BBC) 24 4.2.2 Word-Aligned Hybrid (WAH) 27 4.2.3 WAH-derived Algorithms 28 4.2.4 GPU-based WAH Algorithms 31 4.2.5 Super Byte-aligned Hybrid (SBH) 33 4.3 Parallelizing SBH 38 4.3.1 CPU Parallelism 38 4.3.2 GPU Parallelism 39 4.4 Experimental Results 40 4.4.1 Plain Version 41 4.4.2 Parallelized Version 46 4.4.3 Summary 49 Chapter 5 Space-efficient Representation of Semi-structured Document Formats 50 5.1 Preliminaries 50 5.1.1 Semi-structured Document Formats 50 5.1.2 Resource Description Framework 57 5.1.3 Succinct Ordinal Tree Representations 60 5.1.4 String Compression Schemes 64 5.2 Representation 66 5.2.1 Bit String Indexed Array 67 5.2.2 Main Structure 68 5.2.3 Single Document as a Collection of Chunks 72 5.2.4 Supporting Queries 73 5.3 Experimental Results 75 5.3.1 Datasets 76 5.3.2 Construction Time 78 5.3.3 RAM Usage during Construction 80 5.3.4 Disk Usage and Serialization Time 83 5.3.5 Chunk Division 83 5.3.6 String Compression 88 5.3.7 Query Time 89 Chapter 6 Conclusion 94 Bibliography 96 요약 109 Acknowledgements 111Docto

    Energy Consumption in Compact Integer Vectors: A Study Case

    Get PDF
    [Abstract] In the field of algorithms and data structures analysis and design, most of the researchers focus only on the space/time trade-off, and little attention has been paid to energy consumption. Moreover, most of the efforts in the field of Green Computing have been devoted to hardware-related issues, being green software in its infancy. Optimizing the usage of computing resources, minimizing power consumption or increasing battery life are some of the goals of this field of research. As an attempt to address the most recent sustainability challenges, we must incorporate the energy consumption as a first-class constraint when designing new compact data structures. Thus, as a preliminary work to reach that goal, we first need to understand the factors that impact on the energy consumption and their relation with compression. In this work, we study the energy consumption required by several integer vector representations. We execute typical operations over datasets of different nature. We can see that, as commonly believed, energy consumption is highly related to the time required by the process, but not always. We analyze other parameters, such as number of instructions, number of CPU cycles, memory loads, among others.Ministerio de Ciencia, Innovación y Universidades; TIN2016-77158-C4-3-RMinisterio de Ciencia, Innovación y Universidades; RTC-2017-5908-7Xunta de Galicia (co-founded with ERDF); ED431C 2017/58Xunta de Galicia; ED431G/01Comisión Nacional de Investigación Científica y Tecnológica; 3170534

    Succinct Representations in Collaborative Filtering: A Case Study using Wavelet Tree on 1,000 Cores

    Get PDF
    User-Item (U-I) matrix has been used as the dominant data infrastructure of Collaborative Filtering (CF). To reduce space consumption in runtime and storage, caused by data sparsity and growing need to accommodate side information in CF design, one needs to go beyond the UI Matrix. In this paper, we took a case study of Succinct Representations in Collaborative Filtering, rather than using a U-I Matrix. Our key insight is to introduce Succinct Data Structures as a new infrastructure of CF. Towards this, we implemented a User-based K-Nearest-Neighbor CF prototype via Wavelet Tree, by first designing a Accessible Compressed Documents (ACD) to compress U-I data in Wavelet Tree, which is efficient in both storage and runtime. Then, we showed that ACD can be applied to develop an efficient intersection algorithm without decompression, by taking advantage of ACD’s characteristics. We evaluated our design on 1,000 cores of Tianhe-II supercomputer, with one of the largest public data set ml-20m. The results showed that our prototype could achieve 3.7 minutes on average to deliver the results
    corecore